Use of Generative Adversarial Networks (GAN) for Taphonomic Image Augmentation and Model Protocol for the Deep Learning Analysis of Bone Surface Modiﬁcations

: Deep learning models are based on a combination of neural network architectures, op-timization parameters and activation functions. All of them provide exponential combinations whose computational ﬁtness is difﬁcult to pinpoint. The intricate resemblance of the microscopic features that are found in bone surface modiﬁcations make their differentiation challenging, and determining a baseline combination of optimizers and activation functions for modeling seems necessary for computational economy. Here, we experiment with combinations of the most resolutive activation functions (relu, swish, and mish) and the most efﬁcient optimizers (stochastic gradient descent (SGD) and Adam) for bone surface modiﬁcation analysis. We show that despite a wide variability of outcomes, a baseline of relu–SGD is advised for raw bone surface modiﬁcation data. For imbalanced samples, augmented datasets generated through generative adversarial networks are implemented, resulting in balanced accuracy and an inherent bias regarding mark replication. In summary, although baseline procedures are advised, these do not prevent to overcome Wolpert’s “no free lunch” theorem and extend it beyond model architectures.


Introduction
The use of computer vision (CV) through deep learning (DL) methods has substantially modified the resolution of taphonomic studies. It initially showed that, on small samples (n ≤ 100), CV yielded an accuracy of classification > 90% when human experts were systematically producing < 60% correct identifications of tested bone surface modifications (BSM). In larger samples, the difference between CV and experts becomes exponential. Preliminary models have been produced, using deep convolutional neural networks (DCNN), that correctly identify images of tooth, cut, and trampling marks [1]. DCNN models go as far as to differentiate cut marks imparted on bones when carcasses were fleshed or defleshed [2]. These methods are even capable of detecting BSM morphing through dynamic impact of biostratinomic abrasion processes [3]. Traditional taphonomic studies do not have the power to identify carnivore agency when carnivorous mammals affect bone assemblages. DL models have successfully and variably differentiated among several diverse carnivore types [4]. Some DL models have even successfully classified tooth marks from different felid types [5,6]. All this shows the promising path ahead in the use of these techniques for taphonomic research.
However, optimal DL model construction is not easy, since it involves combinations of multiple variables, namely the architecture of the model, the choice of transfer and ensemble learning procedures, the selection of activation functions and the choice of optimizers, among others. There currently is no protocol for which of these combinations is most adequate in the use of BSM images for taphonomic analyses. Most of the models mentioned above are based on imbalanced samples and their balanced accuracy is variable. In the present work, our objectives are two-fold: we intend to provide a baseline protocol based on combinations of different architectures, activation functions and optimizers, and we will show the convenience of including, in these protocols, data augmentation procedures for coping with unbalanced samples.
Image augmentation has frequently been done using morphing of the currently existing datasets by modifying spatial properties, such as rotation, zoom, flipping, cropping, translation, kernel filters, noise introduction, and others [7][8][9][10]. DCNNs need large datasets for training in order to efficiently learn, and small samples hinder the process by either biasing it or by overfitting the training and underperforming on the testing. Data augmentation has been essential to avoid these issues. One major improvement over traditional image augmentation techniques has been the development of generative adversarial networks (GAN) [11,12]. GANs are capable of, not just modifying existing images, but creating new images that could diversify within sample variability. This has boosted the efficiency of medical imaging, where samples for specific pathologies are always limited [13][14][15][16]. Here, we will use GAN data augmentation with the goal of improving existing models and making accuracy more balanced. However, we will present some problems associated with GAN methods that require some caution in how these methods are implemented in BSM studies.

Phase 1: Parameter Selection and Model Protocol
In order to test the best architecture and parameters for BSM DL analysis, we selected some of the most powerful existing DCNN models, some of them successfully used for BSM classification [1]: VGG16, ResNet50, Densenet, and Jason2. All these models, but the latter, were used through transfer learning. Transfer learning consists of using models that were trained for a different problem and use their feature-learning weights, which are already pre-trained [17], for retraining on new image datasets. Here, some of the most high-performing models trained on more than 1,000,000 images for the 1000-image category ILSVRC competition were used. These pre-trained models were used as standalone feature extractors and classifiers. The layers of each pre-trained model with their weights were integrated within the new models used here containing an output dense layer containing 128 neurons. This was implemented through the Keras API. For a summary of the description of these architectures see [1,5]. In previous modeling, BSM images were high resolution (80 × 400 pixels) [1]. Here, we adopted a lower resolution approach, since experimentation showed that model accuracy was not affected. For this reason, we reshaped the original rectangular BSM images into 64 × 64-pixel images. The original images were captured using a binocular microscope (Optika) at 30×. The resulting BSM image data bank was composed of 488 cut marks, 106 tooth marks, plus 63 marks from trampling experiments. Cut marks were made with simple flakes. Tooth marks were obtained in experiments of bones modified by lions and wolves. Trampling marks were obtained from experiments using different times of bone exposure to diverse sand-grain sizes. For a detailed description of the BSM samples refer to [1]. For a more in-depth description of the experiments, see [2,5,18]. All images were transformed into black and white during image processing in the Keras platform (with a Tensorflow backend), by using bidimensional matrices for standardization and centering. The architectures were designed to address a multiple classification problem. For this reason, the "softmax" activation was used on the last dense layer. This function is specific for multinomial classification. It provides the probabilities of each input element of pertaining to a specific label. During compilation, categorical cross-entropy was used as the loss function. We used Tensorflow 2.4.1 and Keras 2.4.2.
The architectures were trained on 70% of the original BSM image set (n = 459). The resulting models were subsequently tested against the 30% remaining sample, which constituted the testing set (n = 197). Training and testing were performed through minibatch kernels (size = 32) because it is the minimum recommended. Models were run using a backpropagation process for 50 epochs.
The two parameters selected for intra-and inter-model performance comparison were the activation functions and the optimizers. Following experimentally-based recommendations, we selected the "relu" activation as the baseline function [7,17,19,20]. Relu (rectified linear activation) has become the default activation function for DL models because it deals efficiently with vanishing gradient problems (decreased error through backpropagation), the saturation on the ends and the sensitivity focused in the mid-points of other activation functions like sigmoid or tanh [20]. The "relu" function returns a value equivalent to the input if activated or zero if inactivated. The formula is rather simple: Recently, a purported improvement over "relu" has been suggested by the Google Brain team under a new activation function named "swish", whose formula is f (x) = x · sigmoid(x) [21,22]. Swish is a monotonic smooth function that implements a sigmoid curvature at the base of the ramp and it is more flexible for activation/deactivation. It is supposed to work better in deep architectures (>40 layers) because of the problems in correctly activating deep layers. Here, we will compare the performance of both functions for BSM image analysis.
"Mish" is another recent function innovation. The formula is a little more complex: . This latter is the softplus activation function [23]. "Mish" is not limited in its upper slope values, which avoids saturation and deals efficiently with zero-gradient problems. In contrast, the slope is bounded at the bottom, which enables better performance at avoiding overfitting during training. It has been argued that this function outperforms "relu" and "swish" [23]. We tried alternative functions, such as "tanh" and "LeakyReLU" and the results were not as good as with "relu", "swish" and "mish".
The second parameter that we will analyze and compare is the optimizer type. Optimizers try to minimize the loss function by tweaking weight parameters. In the case of the optimizers, the responsibility of the analyst is bigger than in the selection of the activation functions, because optimizers need to be tuned. For example, convergence to an optimum depends on how fast or slow gradients operate. This is why selection of the learning rate is important. Regularization can also be implemented to avoid overfitting. Other parameters may also be selected. Gradients (which are partial derivatives connecting the loss function to the weights) show any minor or major impact that any modification of parameters does to the weights. The goal is to reduce the loss function through gradient descent. One of the most widely used optimizers is stochastic gradient descent (SGD). This optimizer uses only batches or random selection of cases in the original training sample, instead of all the cases contained therein. It is used in combination with another technique called "momentum", which consists of accumulating the gradient values of past steps along the descent trajectory in order to determine directionality. It is initiated at 0.5 and it increasingly must reach 0.9 over subsequent iterations. Here, we will use SGD as the baseline optimizer. The default parameters were: Learning rate = 1 × 10 −3 ; momentum = 0.9.
Adam (adaptive moment estimation) has been proposed as another of the most efficient optimizers [24]. It takes another optimizer (Adagrad, adaptive gradient algorithm) as a baseline and modifies it by scaling the learning rate instead of averaging it. It maintains an exponentially decreasing average of previous gradients. The algorithm implements an exponential modification of the average of the moving gradients and squares the gradient. We selected the VGG16 and Jason 2 models and tested them with alternative optimizers: SGD, Adam, Adagrad, Adadelta, and RMSProp. Given that the best results were obtained with SGD and Adam, we decided to use these two optimizers as competitors in combination with different activation functions. This created a network of combinations including: four different DL architectures, three activations functions and two optimizer types. This resulted in 24 different combinations or models.
Model evaluation was based on the joint consideration of accuracy, loss and F1-scores ( Table 1). The latter was determinant given that it provides a good indication on how good balanced accuracy is. We considered the two best combinations for each model. The first one was labeled "best" and the second one as "good", with the remainder being labeled as "normal". This was applied to the six function-optimizer combinations of each model. To properly assess the importance of each function and optimizer, a Bayesian network was used. This was based on score-searching structure learning algorithms, namely a hill-climbing algorithm with different scores [25]: the Bayesian information criterion (BIC) score, which is equivalent to the minimum description length (MDL), the multinomial log-likelihood (loglik), the Bayesian Dirichlet equivalent (BDE) score, the Bayesian Dirichlet sparse score (BDS), and the locally averaged Bayesian Dirichlet (BDLA) score [26]. The final network selected was the one showing the lowest score (i.e., Akaike-BIC). The "bnlearn" R library was used for this analysis (www.r-project.org, accessed on 2 February 2021).

Phase 2: GAN-Augmented Sampling and Model Testing
Subsequently to the parameter and model testing, the most successful combinations of model-function-optimizer were selected and used as the protocol for the complete model analysis with image augmentation. For this last stage, the original highly unbalanced sample was augmented with GAN. Given that, originally, they were between five and eight times smaller than the cut mark subsample, we artificially augmented only the tooth mark and trampling mark datasets by adding 500 images for each category. In order to generate images for each class, two main requirements were to be fulfilled. First, the generated images should be large enough to be fed to the discriminator. On the other hand, in order to train the algorithm, few images were made available. Despite that modern image generation models complying with these two requisites do exist, as the recently published F2GAN [27], technical constraints made this and other approaches, such as DAGAN [28] non-viable.
As a result, in order to tackle the problem, a divide and conquer strategy was adopted: a simple GAN for creating high-fidelity low-resolution images and a super resolution GAN (SRGAN) that would upscale the generated images into a resolution optimum to train the model. Regarding the GAN, a simple approach was taken, using a simple GAN [11], but with small modifications. In the case of the generator, the initialization of the input vector followed a standard normal distribution. This vector was convolved into a squared array with 128 channels and width 64. After a LeakyReLU activation, the data were convolved and upsampled using a Conv2DTranspose layer, normalized using batch normalization, and then activated with a LeaykyReLU. This process was done twice, using a momentum of 0.8 in batch normalization and an alpha of 0.2 in LeakyReLU in both cases. As a result, the generation returned 64 × 64 grayscale images ( Figure 1).

Phase 2: GAN-Augmented Sampling and Model Testing
Subsequently to the parameter and model testing, the most successful combinations of model-function-optimizer were selected and used as the protocol for the complete model analysis with image augmentation. For this last stage, the original highly unbalanced sample was augmented with GAN. Given that, originally, they were between five and eight times smaller than the cut mark subsample, we artificially augmented only the tooth mark and trampling mark datasets by adding 500 images for each category.
In order to generate images for each class, two main requirements were to be fulfilled. First, the generated images should be large enough to be fed to the discriminator. On the other hand, in order to train the algorithm, few images were made available. Despite that modern image generation models complying with these two requisites do exist, as the recently published F2GAN [27], technical constraints made this and other approaches, such as DAGAN [28] non-viable.
As a result, in order to tackle the problem, a divide and conquer strategy was adopted: a simple GAN for creating high-fidelity low-resolution images and a super resolution GAN (SRGAN) that would upscale the generated images into a resolution optimum to train the model. Regarding the GAN, a simple approach was taken, using a simple GAN [11], but with small modifications. In the case of the generator, the initialization of the input vector followed a standard normal distribution. This vector was convolved into a squared array with 128 channels and width 64. After a LeakyReLU activation, the data were convolved and upsampled using a Conv2DTranspose layer, normalized using batch normalization, and then activated with a LeaykyReLU. This process was done twice, using a momentum of 0.8 in batch normalization and an alpha of 0.2 in LeakyReLU in both cases. As a result, the generation returned 64 × 64 grayscale images ( Figure 1). In order to train the discriminator, both real and fake images were used, applying a sequence of convolution, LeakyReLU, and a dropout at a rate of 0.5. This process was done four times. Including the dropout layer is a small change from the original implementation, which avoided the discriminator to learn the images and create overfitting or chances of collapse mode, improving the model's performance. For the generation of trampling and tooth marks, the GAN was run through 7000 epochs of training. After this, the model ended up generating realistic images (Figure 2). In order to train the discriminator, both real and fake images were used, applying a sequence of convolution, LeakyReLU, and a dropout at a rate of 0.5. This process was done four times. Including the dropout layer is a small change from the original implementation, which avoided the discriminator to learn the images and create overfitting or chances of collapse mode, improving the model's performance. For the generation of trampling and tooth marks, the GAN was run through 7000 epochs of training. After this, the model ended up generating realistic images (Figure 2). Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 13 Regarding the execution of the super resolution GAN, a literal implementation of the defined super-resolution GAN [29] was made. As a result, the generated images significantly increased their resolution, going from 64 × 64-pixel images to 256 × 256 pixel and, eventually, to 1024 × 1024-pixel images (Figure 3). Finally, it was decided that the 256 × 256 images were the ones to be added to the model. The code structure is added to this paper.
After completing the GAN-augmented dataset, the two most successful classification models obtained from the combination of activation function/optimizer were used to test their efficiency on the augmented data.

Phase 1: Parameter Selection and Model Protocol
There was a major improvement over previous modeling of the same dataset [1], with most models providing an accuracy close to 95% (Table 1). In this case, the differences among models were rather minor. Sequential models like VGG16 and Jason 2 performed very well and at least one of the parallel architectures (ResNet50) also performed very efficiently. As a matter of fact, the latter model provided the highest accuracy, the lowest loss and the best-balanced accuracy, as reported by the F1-scores. Within these architectures, the choice of function had a minor impact in model performance ( Table 1). The optimizer was more relevant to the final results. In most models, if considering the best two Regarding the execution of the super resolution GAN, a literal implementation of the defined super-resolution GAN [29] was made. As a result, the generated images significantly increased their resolution, going from 64 × 64-pixel images to 256 × 256 pixel and, eventually, to 1024 × 1024-pixel images (Figure 3). Finally, it was decided that the 256 × 256 images were the ones to be added to the model. The code structure is added to this paper.
After completing the GAN-augmented dataset, the two most successful classification models obtained from the combination of activation function/optimizer were used to test their efficiency on the augmented data. Regarding the execution of the super resolution GAN, a literal implementation of the defined super-resolution GAN [29] was made. As a result, the generated images significantly increased their resolution, going from 64 × 64-pixel images to 256 × 256 pixel and, eventually, to 1024 × 1024-pixel images ( Figure 3). Finally, it was decided that the 256 × 256 images were the ones to be added to the model. The code structure is added to this paper.
After completing the GAN-augmented dataset, the two most successful classification models obtained from the combination of activation function/optimizer were used to test their efficiency on the augmented data.

Phase 1: Parameter Selection and Model Protocol
There was a major improvement over previous modeling of the same dataset [1], with most models providing an accuracy close to 95% (Table 1). In this case, the differences among models were rather minor. Sequential models like VGG16 and Jason 2 performed very well and at least one of the parallel architectures (ResNet50) also performed very efficiently. As a matter of fact, the latter model provided the highest accuracy, the lowest loss and the best-balanced accuracy, as reported by the F1-scores. Within these architectures, the choice of function had a minor impact in model performance ( Table 1). The optimizer was more relevant to the final results. In most models, if considering the best two

Phase 1: Parameter Selection and Model Protocol
There was a major improvement over previous modeling of the same dataset [1], with most models providing an accuracy close to 95% (Table 1). In this case, the differences among models were rather minor. Sequential models like VGG16 and Jason 2 performed very well and at least one of the parallel architectures (ResNet50) also performed very efficiently. As a matter of fact, the latter model provided the highest accuracy, the lowest loss and the best-balanced accuracy, as reported by the F1-scores. Within these architectures, the choice of function had a minor impact in model performance ( Table 1). The optimizer was more relevant to the final results. In most models, if considering the best two combinations for each model, SGD produced better performance of the network resulting in higher accuracy and lower loss (Table 1). SGD displayed the best performance of the "relu" (37.5%), the "swish" (12.5%) and the "mish" (12.5%) functions (adding results qualified as "best" and "good"). In contrast, Adam only managed to get a lower performance of the "relu" (25%) and the "swish" (12.5%) functions (Table 2). In total, 62.5% of the best performing combinations were achieved by SGD and only 37.5% was obtained with Adam. This, despite that three of the four best single combinations were obtained with Adam: Adam-relu (2), Adam-swish (1) and SGD-relu (1) ( Table 2). The Bayesian networks supported this interpretation and expanded it. The "BIC" (score = −71.52577), the "loglik" (score = −56.68172), the "bde" (score = −72.85762), the "bda" (score = −72.85762), and the "bdla" (score = −72.71031) hill-climbing score factors indicated by a majority of four to one, that the differences among the activation functions were not that relevant (Figure 4). It was the optimizer that made the difference.
ci. 2021, 11, x FOR PEER REVIEW 7 of 13 combinations for each model, SGD produced better performance of the network resulting in higher accuracy and lower loss (Table 1). SGD displayed the best performance of the "relu" (37.5%), the "swish" (12.5%) and the "mish" (12.5%) functions (adding results qualified as "best" and "good"). In contrast, Adam only managed to get a lower performance of the "relu" (25%) and the "swish" (12.5%) functions (Table 2). In total, 62.5% of the best performing combinations were achieved by SGD and only 37.5% was obtained with Adam. This, despite that three of the four best single combinations were obtained with Adam: Adam-relu (2), Adam-swish (1) and SGD-relu (1) ( Table 2). The Bayesian networks supported this interpretation and expanded it. The "BIC" (score = −71.52577), the "loglik" (score = −56.68172), the "bde" (score = −72.85762), the "bda" (score = −72.85762), and the "bdla" (score = −72.71031) hill-climbing score factors indicated by a majority of four to one, that the differences among the activation functions were not that relevant (Figure 4). It was the optimizer that made the difference. This analysis of parameter combinations suggests that, although several combinations must be tried for every dataset, the most parsimonious baseline model should be a combination of "relu" and SGD. Tuning is mandatory, since the best model obtained in this study was the combination of "relu-SGD" (accuracy = 97.9; F1-score = 0.8) and "swish-SGD" (accuracy = 98.4; F1-score = 0.77) under the ResNet50 architecture. This analysis of parameter combinations suggests that, although several combinations must be tried for every dataset, the most parsimonious baseline model should be a combination of "relu" and SGD. Tuning is mandatory, since the best model obtained in this study was the combination of "relu-SGD" (accuracy = 97.9; F1-score = 0.8) and "swish-SGD" (accuracy = 98.4; F1-score = 0.77) under the ResNet50 architecture.

Phase 2: GAN-Augmented Sampling and Model Testing
The GAN-augmented data yielded slightly lower accuracy rates for the testing sets than the raw data (Table 3). It produced similar estimates to previous testing using highresolution images [1]. In contrast, it also yielded higher balanced accuracy estimates ( Table 3). The greater increase in F1-scores is the most modified outcome. In this case, the results are highly variable depending on the combination and not a clear strategy is visible. With "SGD", "swish" produced better results with some models but not others, and the same is observed for "Adam" and "relu". Overall, VGG16 and ResNet50 were the best models. In contrast, the Jason2 model was less successful and its accuracy stagnated several points below, ranging from 57.1 to 88.9 (Table 3). It also yielded lower estimates of balanced accuracy; however, it still outperforms its efficiency when using the non-augmented dataset, by having improved the balanced accuracy by almost ten points in some of the combinations [1]. Within each model, the activation function plays a minor role in the outcome, and the optimizer seems more relevant, but less so than with the raw dataset. It is the model architecture that makes a difference.

Phase 2: GAN-Augmented Sampling and Model Testing
The GAN-augmented data yielded slightly lower accuracy rates for the testing sets than the raw data (Table 3). It produced similar estimates to previous testing using highresolution images [1]. In contrast, it also yielded higher balanced accuracy estimates ( Table 3). The greater increase in F1-scores is the most modified outcome. In this case, the results are highly variable depending on the combination and not a clear strategy is visible. With "SGD", "swish" produced better results with some models but not others, and the same is observed for "Adam" and "relu". Overall, VGG16 and ResNet50 were the best models. In contrast, the Jason2 model was less successful and its accuracy stagnated several points below, ranging from 57.1 to 88.9 (Table 3). It also yielded lower estimates of balanced accuracy; however, it still outperforms its efficiency when using the nonaugmented dataset, by having improved the balanced accuracy by almost ten points in some of the combinations [1]. Within each model, the activation function plays a minor role in the outcome, and the optimizer seems more relevant, but less so than with the raw dataset. It is the model architecture that makes a difference. Table 3. Accuracy loss and F1-score of the models using the augmented dataset, according to combination of activation function and optimizer. The Bayesian networks supported these interpretations. Several algorithms provide different insights into the variable relationship network. The "BIC" (score = −217.1542), the "loglik" (score = −76.27329), the "bde" (score = −202.0743), and the "bdla" (score = −203.2623) hill-climbing score factors indicated different options. The "loglik" factor suggests that the model architecture has the major impact on all the other elements ( Figure 6); however, this model has the largest AIC (Akaike information criterion) score. The "bde" score factor suggests a more complex picture, in which the parameters affect the model performance. This is more specifically nuanced by the "bdla" factor, which indicates that although the model architecture has a major impact on the performance of the activation function and the resulting balanced accuracy, it is the optimizer that has a major role in model performance ( Figure 6). If we select the optimal network, resulting from the BIC score factor, it can be concluded that all these relationships are of interest, but in general, the only meaningful relationship is that between the accuracy and its impact on loss and the F1-score, underscoring that with the GAN-augmented dataset, the activation function and the optimizer do not create significant differences within each model architecture performance.

Activation Function
It could be argued, based on these results, that the augmented dataset contributed to building a more reliable model, because even if the general accuracy did not improve or became slightly lower, the balanced accuracy was significantly higher. Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 13 It could be argued, based on these results, that the augmented dataset contributed to building a more reliable model, because even if the general accuracy did not improve or became slightly lower, the balanced accuracy was significantly higher.

Discussion
The augmented dataset contained 1657 images, in which the three classes appeared balanced: 606 tooth marks, 488 cut marks and 563 trampling marks. The original imbalanced model, based on a smaller raw data sample, classified correctly 92% of the testing set [1]; however, its F1-score (0.73) was substantially lower. Split by class, cut marks were the best classified (F1 = 0.97), followed by tooth marks (F1 = 0.80), but trampling marks were frequently misclassified (F1 = 0.42), most of the time as tooth marks. This creates uncertainty as to the power of the classifications because of the imbalanced dataset. The original number of cut marks in the sample was eight times bigger than the trampling dataset. Likewise, it was almost five times bigger than the tooth mark subsample. Data

Discussion
The augmented dataset contained 1657 images, in which the three classes appeared balanced: 606 tooth marks, 488 cut marks and 563 trampling marks. The original imbalanced model, based on a smaller raw data sample, classified correctly 92% of the testing set [1]; however, its F1-score (0.73) was substantially lower. Split by class, cut marks were the best classified (F1 = 0.97), followed by tooth marks (F1 = 0.80), but trampling marks were frequently misclassified (F1 = 0.42), most of the time as tooth marks. This creates uncertainty as to the power of the classifications because of the imbalanced dataset. The original number of cut marks in the sample was eight times bigger than the trampling dataset. Likewise, it was almost five times bigger than the tooth mark subsample. Data augmentation using the generation of new images is essential to balance each class dataset. This usually results in higher accuracy in classification and more reliability in classification probabilities. In the present work, the GAN-augmented sample and models have yielded a slightly lower global accuracy than when using just the raw data, but the balanced accuracy was systematically higher. The swish-Adam combination in the ResNet50 model yielded the most accurate (91.29%) architecture and the F1.score yielded the highest balanced accuracy (87%) ( Table 3); this latter was significantly higher than when using the raw data. One biasing feature that we observed in our augmented sample is that the generator created preferentially trampling images from the original sample which showed the least resemblance to either cut marks or tooth marks. This created an artificial sample where these originally minority marks became predominant in the augmented sample; hence the higher balanced accuracy.
Previously, the VGG16 model was used to preliminarily classify some controversial bone surface modifications from the archaeological record [1]. These marked bone specimens are of extraordinary importance because they could potentially attest some of the "first" traces of human butchery in the locations where they were found. These involved specimens from Bluefish Caves (18,000 B.P., Yukon, Canada), purportedly belonging to the earliest presence of humans in the American continent [30], Anjohibe (1400-2000 B.C.) [31], Itampolo (1100-1800 B.P.) [32] and Christmas River (>10,000 B.P.) [33] (Madagascar)interpreted as some of the earliest evidence of human presence on the island-, Dikika (3.2 Ma, Ethiopia)-potentially the first evidence of tone tool use-and Barranco León and Fuentenueva 3 (1.4 Ma, Orce, Spain)-presented as the oldest cut marks in Europe [34]. In order to assess architecture-model variability, here we used the most successful model (ResNet50) to tentatively classify some of these marks, bearing in mind that since these BSM were taken from published photographs not following the protocol applied to the experimental BSM, the conclusions are of limited value. Instead of using the swish-SGD combination, which yielded the highest accuracy (98.44%), we used the relu-SGD combination because its balanced accuracy was higher (80%) and then, less prone to misclassify classes. We only used a few of the archaeological marks that were interpreted by human experts as cut marks and that with the VGG16 model had previously been classified as non-anthropogenic [1]. The resulting classification did not vary from that obtained previously with the VGG16 model, using higher resolution images (Table 4). However, the probabilities became smaller (and so did the variance) because we were dealing with lower information images that also imparted some deformation over the original photographs.

Conclusions
The present study shows that when using image data augmentation, even if the resolution of the images is substantially reduced (which enhances computation), the accuracy can be balanced. For BSM, the augmented samples can be biasing if expanding the least common types of marks only because they are the ones that avoid confusion with the other categories. The application of the protocols described in the comparison of combinations of activation functions and optimizers to the artificially-augmented data also shows that protocols should be taken only as a baseline procedure, since what worked best in the same architectures with pre-augmented data, does not necessarily work best with augmented datasets. Although the augmented data enabled that the function/optimizer combination was virtually irrelevant in the final results, impacting them only in decimal modification, it also showed that the best hyper-parameter combination is contingent on the characteristics of each dataset. This also expands Wolpert's "no-free lunch" theorem [35] from model selection to hyper-parameter selection.