SalfMix: A Novel Single Image-Based Data Augmentation Technique Using a Saliency Map

Modern data augmentation strategies such as Cutout, Mixup, and CutMix, have achieved good performance in image recognition tasks. Particularly, the data augmentation approaches, such as Mixup and CutMix, that mix two images to generate a mixed training image, could generalize convolutional neural networks better than single image-based data augmentation approaches such as Cutout. We focus on the fact that the mixed image can improve generalization ability, and we wondered if it would be effective to apply it to a single image. Consequently, we propose a new data augmentation method to produce a self-mixed image based on a saliency map, called SalfMix. Furthermore, we combined SalfMix with state-of-the-art two images-based approaches, such as Mixup, SaliencyMix, and CutMix, to increase the performance, called HybridMix. The proposed SalfMix achieved better accuracies than Cutout, and HybridMix achieved state-of-the-art performance on three classification datasets: CIFAR-10, CIFAR-100, and TinyImageNet-200. Furthermore, HybridMix achieved the best accuracy in object detection tasks on the VOC dataset, in terms of mean average precision.


Introduction
Deep learning has achieved remarkable performances in various computer vision tasks such as image classification [1][2][3][4], segmentation [5,6], detection [7][8][9][10][11], and image quality assessment [12]. Generally, deep neural networks (DNNs) require large training data to achieve high performance. Data augmentation techniques can increase the limited size of training data and are important elements in the training process of DNNs to improve their generalization performances. Data augmentation techniques have been used to train AlexNet [13], and geometric data augmentation approaches have been used to reduce Top-5 error rates of ImageNet classification tasks, such as flip, rotation, crop, and translation [13,14]. In 2014, VGG neural networks were proposed, and the scale jittering data augmentation technique was introduced by [15]. The Cutout method, which is a representative data augmentation approach, performs regional dropout, where pixel values of a randomly selected region of an input image are removed [16]. Regional dropout approaches have shown better recognition rates than previous geometric transformation strategies [16,17]. These data augmentation approaches are performed on a single image, as shown in Figure 1.
In the recent data augmentation studies, two training images are selected and mixed during network training, and mixed images are used for training a convolutional neural network (CNN), such as Mixup [18] and CutMix [19]. These techniques further improve generalization performance than traditional single image-based approaches. Most recent research works such as SaliencyMix [20], PuzzleMix [21], ResizeMix [22], and SnapMix [23] focus on the mixing of two images for data augmentation. Especially, when CutMix mixes images, random patches are cut and pasted on other images; however, saliency-guided approaches have recently been proposed and achieve better performances than the original In this paper, we present two kinds of data augmentation strategies. The SalfMix uses a saliency map for self-guidance. It is important to extract important features that can predict the class of input images. The saliency map represents the importance of the image in training and can be utilized to augment the image including more important features. It produces a self-mixed image based on a single training image. The performance of the approach outperforms the Cutout method, which is one of the state-of-the-art single imagebased data augmentation techniques [16] (e.g., 19.89%, SalfMix VS 21.46%, Cutout error rates in CIFAR-100 with PreActResNet-101). Additionally, we prove that the SalfMix data augmentation technique could increase the state-of-the-art performance of two imagesbased data augmentation approaches, such as Mixup, SaliencyMix, and CutMix, by linearly combining two approaches without any modification. We call it HybridMix. HybridMix does not lost important features even if the salient regions of the two images to be mixed overlap at the same location, because it applies SalfMix together. We summarize the key differences among state-of-the-art data augmentation techniques, including our proposed approaches, in Table 1. Table 1. Key differences among state-of-the-art data augmentation techniques.
Cutout [16] Mixup [18] CutMix [19] SaliencyMix [20] SnapMix [23] ResizeMix [22] PuzzleMix [21] Self-Augmentation [24] SalfMix HybridMix Both The contributions of this paper can be summarized as follows: • A new single image-based data augmentation method SalfMix uses a self-guidance method finding import region and conduct meaningful mixing in a single image. • The proposed SalfMix can replace the least salient region with most salient region considering spatial importance from a saliency map. • We propose HybridMix, which simply combines SalfMix and two images-based approaches, and shows the best performance among state-of-the-art data augmentation methods.

Randomly Patched Data Augmentation
Cutout and Random regional dropout [16,17] are techniques that remove random rectangular parts of an input image and fill it with zeros or random values. Unlike the dropout [25], the approaches remove regionally continuous parts in the input layer. These techniques improve the generalization performance of a model in classification and object localization tasks.
Mixup [18] is data augmentation technique to mix two random images. Mixup blends the images and their labels using linear interpolation. It improved the classification performance of a model and demonstrated its robustness to adversarial attacks through training using mixed images. However, Mixup produces an image revealing locally ambiguous and unnatural characteristics that can confuse localization.
CutMix [19] was published in 2019 by supplementing Cutout and Mixup. CutMix is a data augmentation strategy that uses patches that are cut and pasted on images. Image patches are randomly selected, and image labels are mixed. Compared to Cutout, this strategy minimizes information loss by replacing the removed areas with patches from other images and provides good performance in various image recognition tasks.
ResizeMix [22] was published in 2020 to preserve more substantial object information than CutMix. ResizeMix proposed a method that mixes not a cutting-based image but resized image. The resized image might have the original object's identities, and this might be helpful for improving performance in image classification and object detection compared to CutMix. [24] firstly proposed a self-mix method that cuts a random patch from an image and pastes into the same image. The goal of the research is to improve generalization ability in a few-shot learning scenario. They did not show improvements in large-scale image recognition tasks like classification, segmentation and object detection. Although our approach looks similar tothis method, it's not because we use a saliency map containing importance for each pixel and extend it to a form mixed with two images-based data augmentation.

Saliency-Guided Data Augmentation
PuzzleMix [21] significantly improves the performance of the existing data augmentation technique Mixup using saliency information. In other words, the novel Mixup that explicitly utilizes saliency information was proposed, and achieves better generalization and the adversarial robustness than other Mixup methods.
SnapMix [23] is a semantic proportional mixing method using a class activation map [26]. In CutMix, noisy labels can occur because CutMix may select a random region that does not contain the object. SnapMix determines the label ratio based on semantic percentage maps. Therefore, the noisy labels can be prevented. The proposed SnapMix achieves the state-of-the-art performance for fine-grained recognition.
SaliencyMix [20] uses the saliency detection method which is a hand-crafted saliency map designed for fast human detection in a scene [27]. SaliencyMix conducted a study on how to mix patches based on the saliency map. Finally, SaliencyMix achieves the state-of-the-art performance in various image classification tasks, such as CIFAR-10, 100, and ImageNet datasets.

Motivation and Problem Statement
We denote T as a data augmentation function. T s (x 1 ) and T t (x 1 , y 1 , x 2 , y 2 ) represent single image-based and two images-based data augmentation functions, respectively, with arbitrary training images x 1 , x 2 and their labels y 1 , y 2 . The functions can be combined as follows: x is used for training DNNs. Based on the equation, single image-based data augmentation techniques such as Cutout can be integrated with two images-based approaches. Currently, various research studies focus on developing two images-based data augmentation techniques. However, we believe that the single image-based data augmentation technique still plays a big role to improve the generalization performance of DNNs. Our goal in this research is to develop a new single image-based data augmentation technique, and the developed algorithm may synergize with traditional two images-based data augmentation approaches (e.g., Mixup, SaliencyMix, CutMix).

Saliency-Guided Data Augmentation
To develop a new single image-based data augmentation technique, we adopt a saliency map-based self-guidance method. The self-guidance method guides which part of an image should be removed or duplicated. Based on the self-guidance, we generate combined training samples for every epoch, which are used as training data. Figure 2 shows the overall process of the proposed method. First, two images are randomly selected from a training dataset. Then, a saliency map for each image is computed based on the trained network. The most salient region is cropped and pasted to the least salient region in the same image. This step is repeated for the other selected image. Using the two synthesized images, two images-based data augmentation is then used.

Figure 2.
Overall process of the proposed approach. Self-mixed images are created by replacing the least salient region with the most salient region. Then, a two images-based data augmentation approach is performed to create HybridMix result.

Saliency Map Extraction
In this step, we would like to calculate a spatial importance score of a given image for self-guidance. The least important region will be removed and replaced using the most important region. A saliency map is one of the choices to compute the spatial importance score. We compute gradients to generate the saliency map as follows [28]: where the L(·) is a cross-entropy loss function, and x ∈ R W×H×3 and y denote an input image and its true label, respectively. W and H are the width and height of an input image, respectively. f : x → y denotes a CNN function and θ is set of parameters of f . The loss value is obtained by passing through the CNN function. Next, we proceed with the backpropagation with the obtained loss value and obtain the gradient value g.
Based on the g ∈ R ×W×H×3 , we create a saliency map M ∈ R W×H that indicates the importance of pixels in the input image. g(i, j, c) is the gradient value of the c-th channel (e.g., RGB channel), i-th column and j-th row corresponding to the input image. The (i, j)-th pixel value of the saliency map is the maximum absolute value of the corresponding pixel values on all channels: where the numbers 0, 1, 2 indicate the red, green and blue channels, respectively.

SalfMix
The next step is to perform average pooling to the saliency map. Average pooling is used to determine the importance of a region rather than a pixel. Then, M is transformed as illustrated in the average-pooled of Figure 2. We denote it asM. In Figure 2, the second image, which name is the saliency map, means M and the third image, which name is Average pooled, meansM. When we use average pooling, we use H n for kernel size and stride and not padding. Then The size ofM is n × n. We compute the location with the largest and smallest values inM using the following equation: To convert the coordinates to the original coordinate in an image, we compute We uniformly and randomly select a square patch P m that includes the area R m from (i m , j m ) to (i m + W n , j m + H n ). In other words, P m represents the wider area than R m , where R m ⊂ P m . We use a parameter r to determine the patch size of o × o using o = W × r . P l is selected via the same way, except for using i l and j l instead of i m and j m . The region R l is defined from (i l , j l ) to (i l + W n , j l + H n ). Finally, we replace patch P l with patch P m . The procedure of SalfMix is demonstrated in Figure 3. Algorithm 1 shows the final algorithm of SalfMix. The f is the model to be trained, and it is used to calculated the saliency map through Equation (2) of line 4 of Algorithm 1. We obtain the self-mixed imagê x using this algorithm.  (2), next, average-pooled image is created through an average pooling process. Then, the most and least salient regions are detected in the average-pooled image. Patches P m and P l are randomly chosen, and the patches must include the detected regions R m and R l . Finally, the pixel values in P m are copied to the region of P l in the same image.

Algorithm 1 SalfMix
1: Input: f , x, y, n, r // training image, label, parameters 2: Output:x // self-mixed image 3: 4: M ← Compute saliency map using x and y in Equations (2) and 3. 5:M ← Perform an average pooling to produce n × n image. 6: P m , P l ← Obtain the most and least salient patches with r. 7:x ← Replace P l with P m in input image x. 8: returnx

HybridMix
A self-mixed imagex was generated through the SalfMix process in Section 4.3. The SalfMix can be integrated easily with two images-based data augmentation approaches, such as Mixup, SaliencyMix, and CutMix, as shown in Equation (1). Consequently, we present three types of HybridMix as shown in Table 2. Table 2. Three types of HybridMix.

Version
Two-Image-Based Mix Function T t

HybridMix v1
Mixup (x 1 , y 1 , Finally, A training algorithm with Hybridmix is shown in Algorithm 2. In this algorithm, X = {X 1 , . . . , X N b } and Y = {Y 1 , . . . , Y N b } denote the total training set where X 1 , . . . , X N b and Y 1 , . . . , Y N b represent each mini-batch of training data and labels, respectively. The number of mini-batches is N e is the total number of epochs. In line 8 of the Algorithm 2, z is an integer value to represent the index of another image to be mixed.In line 9 of the Algorithm 2, i is each index in mini-batch. id means the d-th data of the i-th mini-batch, and iz means the z-th data of the i-th mini-batch, where z is randomly sampled index.

Experimental Settings
Dataset. To demonstrate the effectiveness of our proposed method, we evaluated the proposed method on multiple datasets such as CIFAR-10, CIFAR-100, TinyImageNet-200, and VOC 2007/2012 datasets [14,29,30]. The CIFAR-10, CIFAR-100, and TinyImageNet-200 dataset are used for evaluating classification performance, and the VOC 2007/2012 datasets are used for evaluating object detection performance. The CIFAR-10 and CIFAR-100 each comprise 50 K training images and 10 K test images, and each image has the same size as 32 × 32. The numbers of classes in CIFAR-10 and CIFAR-100 are 10 and 100, respectively. TinyImageNet-200, which is a smaller dataset than the original ImageNet dataset [14] with fewer image classes (e.g., 200), comprises 100 K training images and 10 K test images, and each image has the same size as 64 × 64. VOC 2007/2012 dataset contains 20 object categories, and we used 16 K and 1.2 K images for train and test, respectively.The summary of dataset is shown in Table 3 below. Implementation. Our code was implemented using PyTorch [31] for the classification task and Detectron2 [32] for the detection. The experiments were conducted on a Quadro RTX 8000 GPUs for classification tasks and four A100 GPUs for object detection tasks.
Parameter Settings for Classification Task. We used PreActResNet-18, PreActResNet-50, and PreActResNet-101 as baseline architectures for classification experiments [33]. The number means the number of layers of the architecture. (e.g., PreActResNet-50 means the architecture has 50 layers.) The number of parameters for each model are 11 M, 24 M, and 43 M, respectively. We trained the models for 300 epochs. On the CIFAR [29] datasets, the initial learning rate was 0.05, and the learning rate was multiplied by 0.1 at 150 and 225 epochs. On the TinyImageNet-200 [34], the initial learning rate was the same as for CIFAR, but it was multiplied by 0.1 at 75, 150, and 225 epochs.
The mini-batch size was set to 64 and 128 for CIFAR and TinyImageNet-200, respectively. Stochastic gradient descent was used with a momentum of 0.9 and weight decay of 1 × 10 −4 . The size of the saliency map that passed average pooling was 4 × 4 and 8 × 8 for CIFAR and TinyImageNet-200, respectively. The patch size of P m and P l were set according to r = 0.3.
The parameter values for each data augmentation method were set as the default values used in their corresponding studies [16,18,19,22]. Cutout uses regional dropout, and its mask size is set to 0.5 ratio of the input image size. Mixup, CutMix, and SaliencyMix use the same beta distribution as a ratio to mix two images. In ResizeMix, α and β to determine the image patch size were set to 0.1 and 0.8, respectively. For geometric data augmentation, random resized crop, random horizontal flip and normalization were used for the CIFAR dataset, and additional color jittering and brightness were used for TinyImageNet-200.
Parameter Settings for Detection Task. For the object detection task, ResNet architecture with 50 layers was used. We trained the ResNet on the TinyImageNet-200 dataset, and the experimental setting is the same as the TinyImageNet-200 experiment for classification tasks. We transferred it to the detection task of VOC 2007/2012 datasets. Parameters for batch size, learning rate, and training iterations were set to 8, 0.02 and 24 k, respectively. The learning rate was multiplied by 0.1 at 18 k and 22 k iterations.
We also conduct additional experiments using other architectures. PyramidNet-110 [35] and RegNet-200M [36] were used, and the number of parameters are 1.7 and 2.3 million, respectively. The experimental settings for PyramidNet-110 are the same as the CutMix paper [19]. Furthermore, the settings for RegNet-200M were adopted from the source code (https://github.com/yhhhli/RegNet-Pytorch, accessed on 1 August 2021).   Table 7 depicts the experimental results on the TinyImageNet-200 dataset. The largest performance improvement was achieved with PreActResNet-101, which is the deepest architecture. This is an improvement of 7.33% from the baseline and 1.58% from CutMix using HybridMix v3. Similarly, HybridMix v1 and HybridMix v2 showed 2.29% and 0.58% performance improvements over Mixup and SaliencyMix, respectively.

Transferring to Object Detection Task
We used the ResNet50 model trained with HybridMix v1, v2, and v3 as the backbone of Faster R-CNN [37]. Data augmentation techniques were used only to train the backbone, ResNet50. ResNet50 models are pre-trained on the TinyImageNet-200 dataset and used the pre-trained models as backbone networks for Faster R-CNN. The Faster R-CNN was fine-tuned on VOC 2007/2012 training data. To confirm the effect of HybridMix on object detection, ResNet models were independently trained using other two images-based augmentation approaches, such as Mixup, SaliencyMix, CutMix, and ResizeMix [18][19][20]22]. We used mean average precision (mAP) as the evaluation metric. mAP 50 and mAP 75 mean mAP were calculated at IoU = 0.5 and IoU = 0.75, respectively. Table 8 shows the object detection experimental results on VOC 2007/2012 dataset. HybridMix-trained backbone network showed 1.47%, 3.65%, and 4.52% performance improvements in terms of mAP 50 , compared to the baseline, respectively. Each version of HybridMix also showed performance improvements over other two images-based data augmentations. HybridMix v3 which is a combination of CutMix and SalfMix led to 1.34% performance improvement over CutMix, and it has the best performance in terms of mAP 50 . Compared to the Cutout-trained model, the SalfMix-trained model showed a 0.89% performance improvement in terms of mAP 50 . This means that copying and pasting in a single image is effective for improving the object detection accuracies.

Visualization of Training Graph
We visualize training graphs on TinyImageNet-200 experiments performed in Section 5.3, as shown in Figure 4. The graphs were based on the PreActResNet-50 model, and we compared HybridMix v1, v2, and v3 with two images-based data augmentation approaches such as Mixup, SaliencyMix and CutMix, respectively. Interestingly, error rates of Hy-bridMix for training data are higher than the error rates of Mixup and CutMix, but the HybridMix v1 and v3 have achieved low test error rates. The high training error rates were caused by adding the SalfMix process, and it provides better generalization capability.  [18], Salien-cyMix [20], and CutMix [19]) and HybridMix v1, v2, and v3 on TinyImageNet-200 with PreActResNet-50.

Effectiveness of the Saliency Map
Cutout [16], which is a representative single image-based data augmentation approach, performs regional dropout. When Cutout drops a region of an image, the algorithm does not consider its saliency. We compared Cutout with and without the saliency map to show the effectiveness of the saliency map. We obtained a saliency map from an image and dropped out the least salient region P l . Table 9 shows the experimental results using PreActResNet-101. Interestingly, the saliency map is also effective in improving Cutout. In other words, Cutout showed better performance with the saliency map than without the saliency map. It achieves 0.11%, 0.59% and 0.31% performance improvements on the CIFAR-10, CIFAR-100, and TinyImageNet-200 datasets, respectively. Similarly, we experimented with the effectiveness of the saliency map in SalfMix. Original SalfMix replaces the P m patch with the P l patch using the saliency map. For SalfMix without the saliency map, P m and P l are randomly selected. SalfMix achieved 0.07%, 0.81%, and 1.86% performance improvements on the CIFAR-10, CIFAR-100 and TinyImageNet-200 datasets, respectively.

Effectiveness of SalfMix
We demonstrated better performance by dropping the least salient region than by randomly dropping a region in Cutout. This means that salient regions are important for training and it is better to remove the least salient regions during data augmentation. The drawback of Cutout is that some features are filled with zeros or random values, whereas SalfMix copies and fills this dropped region with the most salient region. We compared the Top-1 classification error rates when using Cutout with a saliency map and SalfMix on the three datasets, and the results are listed in Table 9. SalfMix achieved performance improvements of 0.05%, 0.98%, and 2.16% compared to Cutout with a saliency map, on CIFAR-10, 100, and TinyImageNet-200, respectively.

Hyperparameter r
Hyperparameter r is important for determining the performance of the proposed method. We varied the hyperparameter r, which determines the patch size, on CIFAR-10. Varying r in the interval [0.1, 0.9] with increments of 0.1, the best performance was achieved when r is within 0.3 and 0.5 intervals, as shown in Figure 5. The experiment has shown that our method can achieve better performance than CutMix, Mixup, SaliencyMix, and ResizeMix, which are state-of-the-art data augmentation techniques, without carefully choosing r.

Effectiveness of Average Pooling
When SalfMix is applied to the input image during training, a saliency map is obtained. Then, we apply average pooling to a certain size (e.g., 4 × 4 or 8 × 8 ), as shown in Figure 2. When we find (i m , j m ) and (i l , j l ) in the SalfMix process, we compare and analyze the proposed method with the saliency map using and not using average pooling.
The experimental setup is the same as in Section 5.1, and the datasets used in the experiments are CIFAR-10, CIFAR-100, and TinyImageNet-200. We used HybridMix v3 and PreActResNet-101 for experiments. The size of saliency map without average pooling is 32 × 32 for CIFAR datasets, 64 × 64 for TinyImageNet-200, and the SalfMix process is the same afterward. Experimental results showed that the method using average pooling performed better on the CIFAR-10, CIFAR-100, and TinyImageNet-200 with a difference of 0.05%, 0.11% and, 0.64%, respectively, as shown in Table 10.
The regions of the highest score and lowest score are found in pixels because the saliency map withoutaverage pooling is the same size as the image. However, using the saliency map with average pooling, we can find the regions of the highest score and lowest score. This is considered to be a better performance because it smoothens the saliency map.

Visualizing Augmented Data
In the case of SalfMix, images are randomly mixed at the beginning of training because the saliency map is not accurate. Based on the visualization of augmented data, we can find a relatively accurate saliency map after a small number of epochs (e.g., e < 10), as shown in Figure 6. For the flagpole image in the first, second, and third columns, SalfMix randomly selected the patches at epoch #1. After 10 epochs, the part containing a flagpole was correctly detected as the most salient region. Similarly, for the second image shown in the fourth, fifth, and sixth columns, the extracted region was not good at epoch #1, but after a few epochs, SalfMix could estimate the saliency map accurately. For the two labels lady bug and labrador retriever, the tendency was similar.

Conclusions and Future Works
In this paper, we proposed a novel single image-based data augmentation technique using a saliency map. The proposed technique replaces the least salient region with the most salient region in a single image during a training process. Through several ablation studies, we have proved the effectiveness of the saliency map and SalfMix. Finally, we showed that SalfMix outperforms Cutout, which is one of state-of-the-art single image-based data augmentation techniques, on the CIFAR-10, CIFAR-100, TinyImageNet-200, and VOC 2007/2012 datasets. Furthermore, the HybridMix technique, which combines SalfMix with two images-based data augmentation strategies, achieved performance improvements on three datasets, such as CIFAR-10, CIFAR-100, and TinyImageNet-200. Particularly, HybridMix v3 using PreActResNet-101 showed state-of-the-art performance on the CIFAR-10, CIFAR-100, and TinyImageNet-200 datasets, respectively. Additionally, the HybridMix v3-trained model led to a performance improvement by 1.34% in terms of mAP 50 over CutMix on the VOC 2007/2012 dataset. Our future work will be considering to improve the training speed and also experiment on segmentation tasks.