The Impact of Data Augmentations on Deep Learning-Based Marine Object Classification in Benthic Image Transects

Data augmentation is an established technique in computer vision to foster the generalization of training and to deal with low data volume. Most data augmentation and computer vision research are focused on everyday images such as traffic data. The application of computer vision techniques in domains like marine sciences has shown to be not that straightforward in the past due to special characteristics, such as very low data volume and class imbalance, because of costly manual annotation by human domain experts, and general low species abundances. However, the data volume acquired today with moving platforms to collect large image collections from remote marine habitats, like the deep benthos, for marine biodiversity assessment and monitoring makes the use of computer vision automatic detection and classification inevitable. In this work, we investigate the effect of data augmentation in the context of taxonomic classification in underwater, i.e., benthic images. First, we show that established data augmentation methods (i.e., geometric and photometric transformations) perform differently in marine image collections compared to established image collections like the Cityscapes dataset, showing everyday traffic images. Some of the methods even decrease the learning performance when applied to marine image collections. Second, we propose new data augmentation combination policies motivated by our observations and compare their effect to those proposed by the AutoAugment algorithm and can show that the proposed augmentation policy outperforms the AutoAugment results for marine image collections. We conclude that in the case of small marine image datasets, background knowledge, and heuristics should sometimes be applied to design an effective data augmentation method.


Introduction
Underwater imaging with autonomous or remotely operated vehicles such as AUV (autonomous underwater vehicles [1]) or ROV (remotely operated vehicles [2]) allows visual assessments of large remote marine habitats through large image collections with 10 2 -10 4 images collected in one dive. One of the main purposes of these image collections is to provide valuable information about the biodiversity in marine life. To detect and classify species or morphotypes in these images, machine learning, has been proposed with some early promising results in the last decade obtained with pre-deep learning methods such as support vector machines (SVM) [3][4][5] or, more recently, using convolutional neural networks (CNN) [6][7][8][9][10][11][12][13][14][15][16][17][18][19]. The application of CNN for taxonomic classification problems in this marine image domain features some characteristics that separate this field of research from the majority of CNN applications in the context of human civilization (like traffic image classification/identification, quality control, observation, manufacturing). First, the process of image collection is very expensive as it involves costs for ship cruises, ROV/AUV hardware, operator personnel, advanced planning, maneuvering, and special camera equipment. Second, the number of domain experts that are required for collecting taxonomic labels to build training and test data is limited. In addition, the task of taxonomic classification in the images is very difficult as the domain experts usually have only one single image of an organism, in contrast to the traditional approach of collecting samples from the seafloor and intense visual inspection of the specimen in the laboratory. Another problem is that due to the natural structure of food webs and communities, some species are rather rare by nature and appear in just a few images and some species can be observed in hundreds of images, which leads to a strong class imbalance. Similar problems can be observed in aerial images collected for environmental monitoring. As a consequence, the field of marine image classification may sometimes require individual approaches to pattern recognition problems shaped by the characteristics listed above.
Data augmentation (DA) is a standard approach to overcome training problems caused by limitations in training data and over-fitting. In the case of image classification, the image augmentation algorithms can be broadly classified as deep learning approaches, for example, adversarial training [20], neural style transfer [21], feature space augmentation [22,23], generative adversarial networks (GAN) [24,25], meta-learning (AutoAugment [26], Smart Augmentation [27]), and basic image manipulation augmentations, for instance, geometric transformations (horizontal flipping, vertical flipping, random rotation, shearing), color space transformations (contrast modulation, brightness adjustment, hue variety), random erasing, mixing images [28,29], and kernel filters [30]. Deep learning approaches and basic image manipulation augmentations do not form a mutually exclusive dichotomy. In this work, we are mainly curious about the effectiveness of the most broadly used and readily available basic image manipulation operations in marine images. However, there is little literature available on methodological approaches to (i) select one or more data augmentation method(s) for a given image domain, or (ii) employ (or improve) the effect of DA in the marine image domain in particular. While single individual successful applications of DA have been reported already [14,17,19,31] also some examples have been reported on unsuccessful DA applications leading to decreasing performance [31][32][33].
Recently, a small number of concepts have been proposed for combinations of DA methods. Shorten and Khoshgoftaar [33] describes that it is important to consider the 'safety' of augmentation, and this is somewhat domain-dependent, providing a challenge for developing generalizable augmentation policies. Cubuk et al. have met the challenge of developing generalizable augmentation policies in their work, proposing the AutoAugment algorithm [26] to search for augmentation policies from a dataset automatically. Based on this work, Fast AutoAugment [34] optimized the search strategy, which speeds up the search time. One study was conducted by Shijie et al. [35], which compared the performance evaluation of some DA methods and their combinations on the CIFAR-10 and ImageNet datasets. They found four individual methods generally perform better than others, and some appropriate combinations of methods are slightly more effective than the individuals. However, although all these approaches provide highly valuable new methods, a domain-specific investigation of DA effectiveness is missing, especially for special domains like medical imaging, aerial images, digital microscopy, or -like in this case-benthic marine images.
As already explained, the marine imaging domain challenges deep learning applications with a permanent lack of annotated data. On the other hand, marine benthic images often are recorded with a downward-looking camera (sometimes referred to as "orthoimages") that feature a higher degree of regularity (for instance, regarding illumination or camera-object distance). This can also increase the potential of machine learning-based classifications in benthic images when compared to everyday benchmark image data in computer vision documenting, for instance, different aspects of human activities (e.g., traffic scenes, social activities). Such images somehow constitute the mainstream in CNN application domains and show lesser irregularities in this regard due to changing weather, light condition, camera distance to the object, etc. Thus, we hypothesize that marine images may show special characteristics regarding the effectiveness of particular DA methods. These characteristics need to be thoroughly investigated as DA can be considered one of the most promising tools to overcome problems in missing or unbalanced training data in marine imaging. To show the effect of different DA methods in the context of deep learning classification in marine images, we first report results from exhaustive comparative experiments using single DA methods. Based on our findings, we propose different combinations of augmentation methods, referred to as augmentation policies, and can show a significant improvement for our marine-domain datasets.

Data Sets
To show the effect of different DA approaches, we conduct experiments with two marine-domain datasets. The Porcupine Abyssal Plane (PAP) which is in the northwest Atlantic Ocean, to the southwest of the United Kingdom [36][37][38] and the Clarion Clipperton Zone (CCZ) which is located in the Pacific Ocean and is known for its rich deposits of manganese nodules [39]. They are collected with AUVs in several 1000 m depths and show deep-sea benthos with different species. In addition to these two marine-domain datasets, we conduct a series of data augmentation experiments on the Cityscapes dataset [40], referred to as CSD, collected from annotated traffic videos in urban street scenes.

PAP
In this work, we choose the following four categories (see Figure 1) to form Γ PAP for our experiments to ensure that a more trustworthy number of test samples are left.      Figure 4) to form the Γ CCZ dataset for the experiments in this work. Sample sizes of each class in the Γ CCZ dataset are shown in Table 2.

CSD
We choose two classes (see Figure 5) from the vehicle group of the CSD to generate the dataset Γ CSD (shown in Table 3) for our experiments in this work.

Model and Evaluation Criteria
In this work, we use a MobileNet-v2 which is pre-trained on the ImageNet [41] to investigate several augmentation policies. Image data are resized to 224 px × 224 px and normalized based on the ImageNet dataset. We use Adam as the optimizer in our experiments and set a learning rate of 1 × 10 −4 accompanied by a step decay with a step size of 1 and a gamma of 0.1. The loss function used in this work is cross-entropy loss. For each epoch, we perform a train and a validation phase, and compute the prediction accuracy Acc e,j = n e,j N e,j , where n e,j and N e,j stand for the number of correct predictions and the total number of samples in phase j ∈ {train, validation} of epoch e, respectively. In the test phase, since the sample sizes of each class in the test set are not consistent, we compute the prediction accuracy Acc k = n k N k for each class, where n k and N k stand for the number of correct predictions for class k and the sample sizes of class k in the test set, respectively. In each experiment we record two trained-models corresponding to the highest Acc e,train and the highest Acc e,validation in all epochs and apply them to the test set separately. The two inference results are averaged to obtain the average accuracy ) for each class. As the last step, we compute the mean average accuracy mAA = 1 K ∑ k AA k with K the number of classes as the prediction accuracy in the test phase. The flowchart of our work is shown in Figure 6. Figure 6. Flowchart of the work. Each DA method was applied to each train data and an individual model was trained and optimized using the validation data. The accuracy of the individually trained model was evaluated using the test data.

Methods
In this work, we investigate the data augmentations implemented in torchvision of PyTorch during the training progress. We describe the data augmentation in the way that a training image X i is given defined by X i = f (x i ) with x i as the original image with index i and f () as the transformation function executing the augmentation method. In this work we apply RandomRotation f RR (x i , d) to rotate the image randomly within the angle range represented by d, RandomVerticalFlip f RVF (x i , p) to vertically flip the given image randomly with a given probability p, RandomHorizontalFlip f RHF (x i , p) to horizontally flip the given image randomly with a given probability p, RandomAffine f RA (x i , t, s) to randomly affine transformation translate and shear of the image keeping center invariant according to the parameters t and s. Color transformations f CT (x i , b, c, s, h) is used to randomly change the brightness, contrast, saturation, and hue of an image according to the values of parameters b, c, s, h, respectively.
We investigate the performance of DA methods on Γ PAP and Γ CCZ , determining the four best-performing ones. We propose six DA combination policies and apply them to Γ PAP , Γ CCZ and Γ CSD . To avoid randomness affecting the results during the experiments, we set a seed for fixing the following random number generators: CUDA, NumPy, Python, PyTorch, and cudnn.

Results
The experimental results are represented by the change of mAA of applying DA and without applying DA. We generate a heatmap for each experiment based on the change of mAA, with blue indicating positive increments and orange indicating decrease. The darker the color, the greater the change.  PAP is revealed. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy).  Table 4 reveals the performance of different DA methods and parameters when applied on Γ train 50 PAP with setting seed to 350. From this heatmap, we can see that almost all of the DA methods and parameters used in our experiments perform well. The best-performing DA methods and parameters on Γ train 50 PAP are RandomRotation with a parameter of 100°, Contrast with a parameter of 5, Brightness with a parameter of 2, and Shear with a parameter of (−30°, 30°, −30°, 30°). By applying these four, a significant improvement can be achieved on Γ train 50 PAP . When increasing the number of training images to 200 per class, we can see from Table 5 that the effect of DA on the improvement of average accuracy further diminishes. The best-performing four DA methods are still the same and the best-performing parameters of RandomRotation, Brightness, Contrast, and Shear are 170°, 1.9, 4.5, (−40°, 40°, −40°, 40°), respectively. classification performance is shown. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). . As the size of training samples increases, the best-performing parameters of the DA vary. The increment of mAA shows an almost proportional trend to the magnitude of the parameters of RandomRotation, Brightness, Contrast, and Saturation. The performance of Shear reveals that the parameters that introduce more deformation can yield a better augmentation effect. However, it can be shearing in one direction at a bigger angle, or shearing in two directions at one angle.

Random Rotation Brightness
We also conduct experiments on Γ train 50 PAP when setting the seed to 3500, finding that the heatmap shows a similar trend to that with the seed set to 350. The results are shown in Appendix B.

Experiment B: Performance of Data Augmentations on Γ CCZ
In Experiment B, we apply the same DA methods and parameters to Γ CCZ to verify whether the observations we obtained in Experiment A can be seen on another marinedomain dataset as well. The experimental results with setting seed to 350 are shown in Tables 6 and 7, and the results of Γ train 50 CCZ with setting seed to 3500 are shown in Appendix B. CCZ classification performance is displayed. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). The heatmap shown in Table 6 presents the performance of different DA methods and parameters when applied on Γ train 50 CCZ with setting seed to 350. We can find that the bestperforming DA methods on Γ train 50 CCZ are the same as the observations from Experiment A, which are RandomRotation, Brightness, Contrast, and Shear. On Γ train 50 CCZ , RandomRotation shows the best results with parameter 150°. Brightness, Contrast, and Shear work best with parameters 1.9, 4.5, and (−40°, 40°, −40°, 40°), respectively. Table 7 shows the performance of different DA methods and parameters when applied on Γ train 100 CCZ with setting seed to 350. We increase the number of training data from 50 even to 100 per class. The effect of DA methods, except for Saturation, becomes weaker or has a negative effect as the number of training samples is increased, which is the same as shown in Table 5. The four most effective DA methods are still RandomRotation, Brightness, Contrast, and Shear, whose best-performing parameters are 170°, 2, 5, and 40°, respectively. classification mAA improvement is shown. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). Experiment B investigates the effect of different DA methods and different parameters on Γ CCZ with different training set sizes and shows similar results to the observations on Γ PAP . We find that RandomRotation, Contrast, Brightness, and Shear are always the four best-performing DA methods on both Γ PAP and Γ CCZ marine-domain dataset. The effect of RandomRotation, Brightness, and Contrast becomes more significant as the parameters increase. Similarly, as the amount of training data is increased, almost all DA methods' effects are diminished.

Experiment C: Performance of Data Augmentations on Γ CSD
We conduct research on Γ CSD that is different from the marine domain to demonstrate that the effect of DA is domain-dependent. The seed is set to 350 in Experiment C, and the experimental results of Γ train 50 CSD and Γ train 150 CSD are shown in Tables 8 and 9. The results of Γ train 100 CSD are shown in Appendix A. Table 8 illustrates the effect of different DA methods and different parameters on the average accuracy improvement of Γ train 50 CSD . From this heatmap, we can see that RandomRotation no longer works as well as it did on the marine-domain datasets, and only when the parameters are small does it improve the average accuracy a little. Similarly, Shear with parameters of big degrees decreases the mAA. In addition, RandomVerticalFlip is also no longer suitable for this dataset. Table 8. Performance of DA methods on Γ train 50 CSD . The impact of different DA methods and parameters on Γ train 50 CSD classification performance, which is very different from the impact on Γ train PAP and Γ train CCZ , is displayed. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy).  When training samples are supplemented to 150 per class, we can see from Table 9 that RandomRotation with a parameter of big degrees and RandomVerticalFlip still have a negative impact. The effect of Brightness, Contrast, and Saturation on Γ train 150 CSD performs well with the value of parameters increasing, which is similar to that on Γ train 50 PAP and Γ train 50 CCZ . Experiment C investigated the performance of different DA methods and parameters on Γ CSD with different training set sizes. Overall, Saturation shows an effect almost proportional to the value of parameters on all three datasets. The performance of Contrast and Brightness improve with increasing training data size. RandomRotation can slightly increase mAA with parameters smaller than 10°, but have increasingly negative effects on mAA as the parameters become larger. RandomHorizontalFlip and Hue can slightly improve mAA, while RandomVerticalFlip and Translate almost always reduce mAA. The effect of Shear is no longer as shown in Experiment A and Experiment B, showing a negative effect at large deformation angles.

Random Rotation
To compare the different DA methods' impact on the three different datasets more intuitively, the percentage change of classification mAA is plotted for different parameters, which are shown in Figures 7 and 8 and Appendix C. In each plot, the x-axis represents parameters, the y-axis represents the change of mAA, and circles, stars, triangle stand for Γ PAP , Γ CCZ , and Γ CSD , respectively. Table 9. Performance of DA methods on Γ train 150 CSD . The heatmap shows the DA methods and parameters impact on Γ train 150 CSD classification performance. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy).

Experiment D: Data Augmentation Policies
According to the above experimental results, the four best-performing DA methods on both Γ PAP and Γ CCZ datasets are RandomRotation, Contrast, Brightness, and Shear. In most cases, the best performances are shown when the parameter of RandomRotation is around 150°, the parameters of Contrast and Brightness are around 2 and 5, respectively, and the parameters of Shear are those that produce larger deformations (we experiment with the parameters 40). To verify the effect of the combination of these DA methods, we proposed DA combination policies in Table 10, where the function of RBC_1 indicates that RandomRotation 150°, Brightness 1.9, and Contrast 5 are applied to the training images successively. Table 10. DA combination policies.

RBC_1
X The performance of our policies on Γ PAP , Γ CCZ , and Γ CSD are shown in Table 11, where AA_IP and AA_CP represent ImageNet policy and CIFAR-10 policy proposed by AutoAugment [26], respectively. It shows that all our policies trained on Γ train 50 PAP , Γ train 100 PAP , Γ train 200 PAP , Γ train PAP , and Γ train 50 CCZ can outperform AA_IP and AA_CP policies which are proposed by AutoAugment [26]. RBC_3, RBC_5, and RBCS trained on Γ train 100 CCZ and Γ train CCZ can also outperform AA_IP and AA_CP. In contrast to the effect of our policies on Γ PAP and Γ CCZ , these policies have a negative effect on the Γ CSD dataset. Here the AutoAugment policies outperform our policies.

Discussion
In this paper, we could show a clear domain dependence for the application of augmentation. While experiments A and B applying augmentation to marine data show similar results, experiment C applied to the more established everyday traffic data shows different trends. The same observation applies to experiment D when comparing auto augmentation policies fit on everyday data (ImageNet and CIFAR-10) on the one hand with our manual combination policies on the other hand. Here the AutoAugment policies work best on the traffic data while leading to results inferior to the policies proposed by our experiments (RBC_ * ). From Experiment A (marine), we can find that the effect of DA diminishes as the number of samples in the training set increases. This is because the additional training images allow the model to learn more features, which also indicates that DA is an effective way to address the lack of training data. The results of Experiment A and Experiment B (marine) indicate that for the marine domain, with increasing training sample size and different parameter choices, some DA methods show the possibility of decreasing mAA (e.g., flip, translate), but RandomRotation, Brightness, Contrast, and Shear always show good results. This may be due to the natural variation regarding the orientation and position of the underwater target objects relative to the camera. Besides, the light during underwater data acquisition has a significant effect on the image data. Overall, experiments A and B show similar trends. However, due to the insufficient sample size of Γ test CCZ , the results of Experiment B are not as reliable as those of Experiment A.
In Experiment C (traffic), the effect of RandomRotation, RandomVerticalFlip, and Shear have a significantly different effect than in Experiment A and Experiment B, by often even decreasing the performance. This is likely caused by the fact that a VerticalFlip or a Rotation at a large angle is unrealistic under the urban traffic domain. We can find that on the Γ CSD dataset, the color transformations perform better than the geometric transformations.
From Table 11, we can see that the performance of our policies can reach or outperform that of transferring the policies proposed by AutoAugment [26] to marine data. However, our policies have a negative effect on the traffic dataset. As can be seen in Tables 4 and 5 However, they have the opposite effect on traffic-domain dataset Γ CSD , revealing that DA methods have a strong domain dependency.

Conclusions
We have shown that in our work, we could observe a clear difference in the effects of DA applied to our domain-specific marine dataset or the more established everyday urban traffic dataset. Therefore we propose to use DA with lower expectations, especially when applied to image domains that differ from the mainstream image domains CNNs are applied to. As we can show good results with a customized DA policy, we conclude that DA can definitely be the tool of choice, especially for small training sets, but increased efforts are required to make data augmentation more adaptive and domain aware.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: when setting seed to 350. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). The four DA methods with the best effect are still RandomRotation, Brightness, Contrast, and Shear. These four methods can obtain their best enhancement effect on Γ train 100 PAP with parameters 160°, 2, 5, and (−20°, 20°, −20°, 20°), respectively. The trend of the effect of different parameters is similar to that of Γ train 50 PAP , but with fewer increments for the average accuracy.  Table A2. Performance of DA methods on Γ train 100 CSD . This heatmap shows the results when we increase the training set samples to 100 per class. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). We can see that RandomRotation and RandomVerticalFlip have a negative impact. Saturation with a parameter of 9 and Brightness with a parameter of 1.7 are the two most effective DA methods and parameters. when setting seed to 3500. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). DA methods show a similar trend to that at seed is 350.  Table A4. Performance of DA methods on Γ train 50 CCZ with setting seed to 3500. The displayed percentage values describe the increase/decrease of mAA in percent. The value is color-coded from blue (increase in accuracy) over white (no effect) to orange (decrease in accuracy). This heatmap shows the effect of different DA methods and different parameters on the average accuracy improvement of Γ train 50 CCZ when setting seed to 3500. As with the experimental results on Γ train 50 PAP , the performance of DA is similar under different seeds.