Geometric Morphometric Data Augmentation using Generative Computational Learning Algorithms

: The fossil record is notorious for being incomplete and distorted, frequently conditioning the type of knowledge that can be extracted from it. In many cases, this often leads to issues when performing complex statistical analyses, such as classification tasks, predictive modelling, and variance analyses, such as those used in Geometric Morphometrics. Here different Generative Adversarial Network architectures are experimented with, testing the effects of sample size and domain dimensionality on model performance. For model evaluation, robust statistical methods were used. Each of the algorithms were observed to produce realistic data. Generative Adversarial Networks using different loss functions produced multidimensional synthetic data significantly equivalent to the original training data. Conditional Generative Adversarial Networks were not as successful. The methods proposed are likely to reduce the impact of sample size and bias on a number of statistical learning applications. While Generative Adversarial Networks are not the solution to all sample-size related issues, combined with other pre-processing steps these limitations may be overcome. This presents a valuable means of augmenting geometric morphometric datasets for greater predictive visualization.


Introduction
Geometric Morphometrics (GM) is a powerful multivariate statistical toolset for the analysis of morphology [1].These methods are of a growing importance in fields such as biology and physical anthropology, with many implications for evolutionary theory and systematics.GM applications employ the use of two or three dimensional homologous points of interest, known as landmarks, to quantify geometric variances among individuals [1][2][3][4].
GM practices first project landmark configurations onto a common coordinate system.This process is carried out via a series of superimposition procedures, including scaling, rotation and translation, frequently known as Generalised Procrustes Analyses (GPA).GPA is a powerful technique that allows for the direct comparison of landmark configurations, quantifying minute displacements of individual landmarks in space [5,6].These distortions and deformations can then be used to highlight geometric variations among organisms and can be visualised with ease.From these superimposed configurations, matrix operations from linear algebra can be performed to project each element under study as a single multidimensional (ℝ n ) point in a newly constructed feature space.This procedure, known as Principal Components Analysis (PCA) is useful for dimensionality reduction and converting landmarks into more manageable data for complex statistical applications [7,8].
A wide array of techniques are known for different pattern recognition and classification tasks in GM.From one perspective, more traditional parametric and non-parametric multivariate statistical analyses can be performed to assess differences and similarities among sample distributions [7].
Likewise, generalised distances and group association probabilities can be used to compare groups of organisms and trends in variation and covariation [9].Moreover, many popular classification tasks rely on parametric discriminant functions [10,11].
In more recent years, tasks in pattern recognition and classification have received an increase in efficiency and precision with the implementation of Artificially Intelligent Algorithms (AIAs), reporting >90% accuracy in GM applications.Among these, the most popular AIAs for classification purposes in GM currently include Support Vector Machines (SVM) [12][13][14][15], and Artificial Neural Networks (ANN) [16][17][18][19][20].Both algorithms present distinct advantages, especially in the processing of complex high-dimensional data.As opposed to traditional Linear and/or Partial Least-Squares Discriminant Analyses (LDA & PLSDA), SVMs and ANNs are less susceptible to underlying assumptions within model properties.SVMs, for example, are able to use numerous different kernel functions to overcome issues imposed by linearity [21,22].ANNs, on the other hand, are highly versatile, achieving above human performance in a multitude of real-life situations [22][23][24].
Nevertheless, each of these types of analyses are susceptible to a number of different problems, all of which can affect the reliability of the extracted data.From one perspective, numerous studies have focused on the error produced through data collection procedures, whether this type of error be induced by analyst experience, collection protocols or the definition of the landmark itself [25][26][27][28].More often than not, however, the preservation rate of fossils results in the loss of landmarks, impeding many types of analyses [29,30].
The completeness of the fossil record is thus a major conditioning factor in archaeological and palaeontological GM analyses.Considering the number of available fossils for certain species, construction of reliable datasets is difficult, resulting in sample bias.Statistical tests such as Canonical Variant Analyses (CVA), for example, are highly sensitive to small or imbalanced datasets [9].Moreover, the impact of bias is directly proportional to the number of variables included in multivariate analyses [31].Even if samples are balanced, in fields such as palaeoanthropology obtaining large sample sizes is often difficult, and thus the predictive capacity of discriminant models may fall significantly.

Data Augmentation
Resampling techniques in traditional statistics have had great success in providing more robust methods for test statistic approximations and p-value calculations.Tests requiring permutations as well as more computationally efficient Monte Carlo simulations have been a standard procedure in statistical practices for over half a century [32,33].Their versatility to both parametric and nonparametric assumptions makes handling imbalanced and skewed data much more reliable, while proving less sensitive to samples of smaller sizes [34].Nevertheless, a critical issue when considering small sample sizes are an "insufficiency of information density" that is able to correctly provide a general overview of the population's distribution [35].This issue becomes apparent when trying to classify new individuals.With insufficient knowledge of the true coverage of a domain, the interpretation of new information is much more difficult.In data science this phenomenon is usually known as overfitting for classification algorithms [24].
One statistical technique frequently used to overcome this issue is resampling with replacement, known as bootstrapping.Bootstrapping duplicates the data multiple times creating a virtual population from a distribution sample [36,37].As opposed to resampling techniques without replacement (e.g.permutation, cross-validation, jackknife), bootstrap procedures are efficient in inferential tasks helping to simulate the general nature of the population.Nevertheless, neither of these resampling procedures, in truth, simulate new information.While they may be useful for inflating the dataset and providing enough information for a model to adjust its weights, overfitting is likely, as the space between data points can still be considered "uncharted territory".
In response, data scientists and specialists in AIAs propose the use of synthetically produced new data to overcome these problems [38].While using synthetic "fake" data has drawn some scepticism from scientists, numerous experiments in predictive modelling have empirically shown how these synthetic datasets not only reduce overfitting, but actually produce an increase in accuracy [39].This is achieved through creating new data that is "meaningful" to the real distribution by adapting the data that is already available [40] (Tanaka & Aranha, 2019).These advances have had a major impact on scientific disciplines dedicated to computational learning, especially in the case of highly complex applications for computer vision [24].One of the key AIAs responsible for this success is the Generative Adversarial Network (GAN).
GANs were originally presented as an unsupervised AIA capable of creating new data, based on the training data provided [41].In less than a decade, GANs have been efficiently incorporated into a wide variety of applications, especially in fields of computer vision and image processing.A GAN consists of two neural networks trained simultaneously.The first model, known as the Generator, is trained to produce synthetic information which the second model, the Discriminator, evaluates for authenticity.The two models work in competition (i.e.adversarial), with the generator working to produce data that the discriminator is unable to classify as synthetic.The final product is a generator model capable of producing completely new data that is indistinguishable from the real training set.With the additional advantage of a neural network's non-linear internal configuration, GANs are highly efficient in mapping out any type of probability distribution.

Materials and Methods
This study presents an experimental protocol used to evaluate and assess different types of GANs for augmenting GM datasets.Through experimenting with different architectures, configurations and training strategies, this study aims to propose an optimal architecture for augmenting data of this type.In order to evaluate these results, both descriptive statistics and equivalency testing have been used.

Datasets
Experiments included within this study were performed on a total of three GM datasets.These datasets originated from experimental archaeology samples in taphonomy.Considering the objective of this study is to observe the effects of generative learning for GM data augmentation, the origin of these datasets was considered unimportant.The reason behind this lies in how algorithms are trained using input data, irrespective of whether the raw landmark data was obtained from palaeoanthropological specimens, lithic tools or carnivore feeding samples.Additionally, each of the datasets were personally generated by the corresponding author, providing a means of controlling the origin of information.
Each of the datasets consist of a mixture of manually placed landmarks (Type I, II or III) as well as some computational semilandmarks [3,4].The three datasets include; • Dataset 1 (DS1); canid tooth score dataset [15].This dataset consists of 105 individuals from three different experimental carnivore feeding samples (labelled foxes, dogs and wolves).3D models for data extraction were generated using a low-cost structured light surface scanner (David SLS-2).The topography of each 3D digital model was then used to extract 2D images where landmarks could be place.Landmark data consist of a mixture of Type II and Type III 2D landmarks.• Dataset 2 (DS2); scratch and graze trampling dataset [14].This dataset consists of 60 individuals from two different experimental trampling mark samples (labelled scratches and grazes).Each of the elements under study were digitised employing a 3D Digital Microscope (HIROX KH-8700), using between 100x and 200x magnification.Collection of landmark data was then performed following a series of measurements that established a 3D coordinate system across the model.Landmark data consist of a mixture of Type II and Type III 3D landmarks.• Dataset 3 (DS3); semi-landmark based tooth pit dataset [28].This dataset consists of an adaptation of DS1 using 60 individuals from two carnivore feeding samples (labelled dogs and wolves).3D models for data extraction were generated using a low-cost structured light surface scanner (David SLS-2).Landmark data consist of a mixture of 3D Type II landmarks and a mesh of semi-landmarks.
These three datasets were chosen considering the dimensionality of the corresponding featurespace produced for GM analysis (ℝ 14 , ℝ 39 & ℝ 60 respectively).With each of these datasets presenting different dimensionalities, optimal GAN architectures could therefore be proposed so as to establish a standardised protocol, regardless of the target domain's ℝ n size.
These datasets were also chosen to observe the effect original sample size has on the accuracy of synthetic data.The latter was tested via minimum sample size calculations according to Cohen's d (power = 0.8, d = 0.8, α = 0.05, ratio = 1:1) [31].This established a minimum sample size for two-sample statistical comparisons of 26 individuals, rounded up to 30 for simplicity.In accordance with this calculation, experiments were performed by randomly sampling 30 real individuals and comparing them with 30 synthetic individuals.In datasets where larger samples were available, 60 real individuals were sampled and compared with 60 synthetic data points.

Baseline Geometric Morphometric Data Acquisition
Each of the datasets were prepared using traditional GM techniques, first performing a full Procrustes fit of landmark coordinates via GPA, followed by the extraction of multivariate features through PCA [7,8].Considering how the objectives of this study are to find the optimal algorithm for mapping out multidimensional distributions, differences in shape-size relationships were considered irrelevant for this study.GPA was therefore only performed using fully superimposed coordinates in shape feature space.
From here, PC scores were analysed evaluating their dimensionality and the proportion of variance represented across each of the decomposed eigenvectors and their eigenvalues.Considering how the final eigenvalues begin to represent little or no variance within the landmark configuration, preference was given to those PC scores representing up to 95% of sample variance for statistical evaluations.
For the purpose of this study, GM pre-processing of samples was performed in the free statistical software R (https://www.r-project.org/,v.3.5.1 64-bit).

Generative Adversarial Networks
A GAN is a Deep Learning (DL) architecture used for the synthesis of data via a generator model.GANs are fit to data using an unsupervised approach, where the generator is trained by competing with a discriminator that evaluates the authenticity of the synthetic data produced [24,41].While the basic concept behind a GAN is relatively straightforward, the theory behind their configuration and training can be incredibly challenging [24,[42][43][44][45].
To generate new data, the generator samples from a random Gaussian distribution (e.g.µ = 0, σ = 1), finding the best means of mapping this data out onto the real sample domain.A fixed-length random vector is used as input, triggering the generative process.Once trained, this vector space can essentially be considered a compressed representation of the real data's distribution.This multidimensional vector space is most commonly referred to in DL literature as latent space [24,42].
The discriminator model takes as input the output of the generator.This discriminator can then be used to predict a class label (real or fake) for the generated data.In some cases, this model is referred to as a critic model [46,47].
For the purpose of this study, multiple experiments were performed to define an optimal GAN architecture.These experiments followed standard DL protocol, finding the optimal neural network configurations by evaluating the effects of each hyperparameter on model performance.Summaries of the hyperparameters tested are included in table 1.In addition to this, the extensive literature on the "best-practices" in GAN research and different heuristics in GAN hyperparameter selection were considered [42,43,45,48].Among these, common "GAN-Hacks" were evaluated, including: • Use of the Adam optimization algorithm (α = 0.0002, β1 = 0.5) • Use of dropout in the generator with a probability threshold of 0.3 • Use of Leaky ReLU (slope = 0.2) • Stack hidden layers with increasing size in the generator and decreasing size in the discriminator.
For training, trials experimenting with the number of epochs and batch sizes were performed.The final values were chosen in accordance with the requirements of the model in order to reach an acceptable stability.
While binary cross-entropy is typically a recommended loss function for training, this study experimented with alternatives, such as the Least Squares loss function (LSGAN) and two versions of Wasserstein loss (WGAN).LSGAN was originally proposed as a means of overcoming small or vanishing gradients, which are frequently observed when using binary cross-entropy [44,46].In LSGAN, the discriminator (D) attempts to minimise the loss (L), using the sum squared difference between the predicted and expected values for real and fake data (eq.1), while the generator (G) attempts to minimise this difference assuming data is real (eq.2): This results in a greater penalization of larger errors (E) which forces the model to update weights more frequently, therefore avoiding vanishing gradients [49].WGAN, on the other hand, is based on the theory of Earth-Mover's distance [46], calculating the distance between the two probability distributions so that one distribution can be converted into another (eg. 3 & 4): WGAN additionally uses weight constraints (hypercube of [-0.01, 0.01]) to ensure that the discriminator lies within a 1-Lipschitz function.In certain cases, however, this has been reported to produce some undesired effects [47].As an alternative, a proposed adaptation, in the form of gradient penalty WGAN (WGAN-GP), includes the same loss for the generator (eq.4) but a modified discriminator (eg.5) with no weight constraints [47,50]: For both loss functions to work, the output of D requires a linear activation function.Finally, optimisation tests were performed using Adam (α = 0.0002, β1 = 0.5) and RMSprop (α = 0.00005) [44,47,51,52].
GANs were trained on scaled PCA feature spaces with 64-bit values ranging between 1 and -1.This scaling procedure was performed to boost neural network performance and optimization by helping reduce the size of weight updates [23].For these experiments, GANs were trained on all data within the dataset, regardless of label.This approach was chosen to directly observe how GANs handle this type of input data before considering more complex applications, including sample labels (see section 3.4).

Conditional Generative Adversarial Networks
The final GAN trials performed adapted the optimally defined model in section 3.1 for Conditional GAN tasks (CGAN).A CGAN is an extension of traditional GANs that incorporate class labels into the input, thus conditioning the generation process.Class labels are encoded and used as input alongside both the latent vector and the original vector in order for the GAN to learn targeted distributions within the dataset [53].This can be done by using an embedding layer and concatenating the embedded information with the original input [54].It is recommended that the embedding layer is kept as small as possible [45], with some of the original implementations of CGANs using an embedding layer with a size of only ≈5% of the original flattened generator's output.Because 5% of our largest dataset would still have been <1, experiments were performed with different sized embedding layers to find the optimal configuration.The best results came out using a ⌈¼ • n⌉ sized embedding layer, where n corresponds to the number of dimensions in ℝ n for each of the targeted feature spaces.
For comparison of GAN and CGAN performance, these models were used separately to augment the DS3.This dataset was chosen considering it was the most complex feature space to map, the most balanced (when compared with DS1), and the most difficult to study (seeing how DS2 presents the highest natural separability).

Synthetic Data Evaluation
Evaluation of GANs is a complex issue with little general agreement on suitable evaluation metrics [43].Considering how most practitioners in GAN research work with computer vision applications, many papers use manual inspection of images to evaluate synthesized data [55].For the synthesis of numeric data, this is evidently a very subjective means of evaluating information.Likewise, the majority of metrics used in GAN literature almost exclusively focus on the evaluation of image data [54,56], which is of little value to the present study.
Multidimensional numbers are incredibly difficult to visualise, meaning that precise human inspection of this data is impossible.To overcome this, a number of statistical metrics were adopted for GAN evaluation.
Firstly homogeneity of GM data was tested.In most traditional cases, the elimination of size and preservation of allometry in GPA is known to normalize data [57].Nevertheless, this assumption does not always hold true.The first logical step was to therefore evaluate distribution homogeneity and normality via multiple Shapiro tests.Synthetic distributions were then compared with the real data to assess the magnitude of differences and the significance of overlapping.For this, a "Two One-Sided" equivalency Test (TOST) was performed.TOST evaluates the magnitude of similarities between samples by using upper (εS) and lower (εI) equivalence bounds that can be established via Cohen's d.This assesses H0 and Ha using an α threshold of p < 0.05, with Ha implicating significant similarities among samples [31,[58][59][60][61].For TOST the test statistics used to assess these similarities were dependent on distribution normality.These varied between the traditional parametric method using Welch's t-statistic [62], or a trimmed non-parametric approach using Yuen's robust t-statistic [63,64].To differentiate between the two, from this point onwards non-parametric robust TOST will be referred to as rTOST.
More traditional univariate descriptive statistics were also employed.For distributions matching Gaussian properties, sample means and standard deviations were calculated.These were accompanied by calculations of sample skewness and kurtosis.For significantly non-Gaussian distributions, robust statistical metrics were used instead.In these cases, measurements of central tendency were established using the sample median (m), while deviations were calculated using the square root of the Biweight Midvariance (BWMV) (eq.6-9) [28,65,66].
( ) Robust skewness and kurtosis values were calculated using trimmed distributions.Trims were established using Interquatile Ranges (IR) [65], with confidence intervals of p = [0.05:0.95].Both the IR range and the trimmed skewness and kurtosis values were reported.
Finally, wherever possible, correlations were calculated to compare the effect of hyperparameters on the quality of synthesized data.For homogeneous data, the parametric Pearson test was used [67], whereas inhomogeneous data was tested using the non-parametric Kendall τ rankbased test [68].Considering neural networks are stochastic in nature, these correlations were performed using data obtained from multiple training runs of each GAN to ensure a more robust calculation.

Results
All three datasets analysed present highly inhomogeneous multivariate distributions (p < 2.2e-16).Univariate comparisons (Table 2), however, present a mixture of both inhomogeneous and homogeneous distributions across PC1 and PC2, where the majority of variance is represented.GAN failure through mode collapse was frequently observed throughout most of the initial trials, characterised by an intense clustering of points with little to no variation in feature space (Fig. 1).Qualitatively, this type of failure is easily diagnosed by visual inspection of graphs.Quantitatively, mode failure can be characterised by a dramatic decrease in variance seen through deviation metrics.
To provide an example, Figure 1 presents the use of a vanilla GAN trained on DS2.At first, training can be seen to start well with the closest (yet not optimal) approximation to the target domain's median (Fig. 1a).Nevertheless, little variation is present (BWMV of PC1 = 0.14, target BWMV = 0.38).
As training continues, the algorithm is unable to find the correct median, and performance is seen to deteriorate (Fig. 1b).This presents an exponential decrease in the variance of synthetic data (Fig. 1b PC1 BWMV = 0.02, Fig. 1c PC1 BWMV = 0.0002).Likewise, through mode collapse, the generator is unable to map the true normality of the distribution, generating increasingly normal data in PC1 (Shapiro w > 0.98, p > 0.56).
Replacing Leaky ReLU with tanh activation functions resulted in a significant improvement of generated sample medians (difference in median for Leaky ReLU = 0.78; tanh = 0.26), yet with little improvement in BWMV.
To overcome mode collapse, kernel initializers and batch normalization algorithms were incorporated into both the generator and the discriminator.Batch normalization was included before activation, presenting an increase in BWMV.Initializers required careful adjustment, with small standard deviation values resulting in mode collapse.Additional experiments found the discriminator to require a more intense initializer (σ = 0.1) than the generator (σ = 0.7), while optimal results were obtained using a random normal distribution.Such a configuration allows the generator more room to adjust its weights, finding the best way of reaching the target domain's median and absolute deviation while preventing the discriminator from learning too quickly.
Experiments adjusting hidden layer densities found symmetry between the generator and the discriminator to be unnecessary.The generator was seen to require more hidden layers in order to learn the distributions efficiently, while a larger density than the output in the last hidden layer also produced an increase in performance.The discriminator worked best with just two hidden layers.

Optimal Architectures
To optimally adjust these finds with all three datasets, the best GAN architecture that presented no mode collapse was obtained using 3 hidden layers in the generator and two hidden layers in the discriminator.The size of hidden layers are conditioned by the size of the target feature space (Fig. 2).To use the example of the largest target domain (DS3 = ℝ 60 ), the generator is programmed so that the first hidden layer (Gh 1 ) is a quarter of the size of the target vector (in this case Gh 1 = 60•¼ = 15).If this calculation produces a decimal value (e.g.59/4 = 14.75), the ceiling of this number is taken (⌈¼•59⌉ = 15).This is followed by a layer half the size of the target vector (Gh 2 = ⌈½•60⌉ = 30).The generator's final hidden layer is the size of the vector plus one quarter (Gh 3 = ⌈¼•60⌉ + 60 = 75).The discriminator, on the other hand, is composed of two hidden layers, with the first hidden layer being the same size as Gh 2 , and the second hidden layer equivalent to Gh 1 .Each hidden layer is followed by a batch normalisation algorithm before being activated using the tanh function.Tanh works best considering the target domain is scaled to values between -1 and 1.Other components of the algorithm include a dropout layer (p = 0.4) prior to the discriminator's output and random normal kernel initializers in both models (discriminator σ = 0.1, generator σ = 0.7).The output of the generator and the input of the discriminator is represented by the n number of dimensions in the ℝ n dimensional target feature space.The latent vector (L) input for the generator must be adjusted according to the dimensionality of the target feature space.Hidden neurons (h) in layers h n have a density (d) that is also conditioned by the shape of the target distribution.Finally, the discriminator has an additional dropout layer (red) with a threshold of p = 0.4.
The optimal batch number was found at 16.This allowed the discriminator enough data to objectively evaluate performance and, thus, resulted in more efficient weight updates for the generator.The number of epochs, however, were highly dependent on the number of individuals used for training.Finally, the dimensionality of latent space was also found to be conditioned by the size of the target domain, as will be explained in continuation.

Experiments with Dimensionality and Sample Size
Initial trials with latent space found ℝ 50 to produce the best results on average, especially in the case of DS2 (results of which have already been reported in Table 2).Nevertheless, interesting patterns emerged when experimenting with larger and smaller latent vector inputs.Starting with the case of the smallest target domain (DS1, Table 4), a significant negative correlation was detected when observing rTOST p values compared with the size of the latent vector over numerous runs (Kendall's τ = -0.44,p = 0.001).This is also true when considering rTOST absolute difference values (τ = -0.41,p = 0.003).This correlation highlights larger ℝ n s to work best when working with smaller target domains.When training on larger feature spaces (e.g.DS3, Table 5), correlations proved insignificant for both rTOST p values (Kendall's τ = -0.21,p = 0.13) and absolute difference calculations (τ = -0.29,p = 0.15).Nevertheless, while correlations remain insignificant, smaller latent vectors were seen to create more predictable and stable data (Fig. 3).
In most experiments, 400 epochs were considered enough for GANs to produce realistic data.Moreover, the best results of each GAN began appearing after approximately 100-130 epochs.Performance significantly decreased, however, when trained on the same number of epochs using less data.To test this, subsets of each dataset were taken for experimentation (e.g. 30 out of 60 samples from DS2, Table 6).On all accounts, significant correlations were detected, finding smaller datasets to need more training time in order to obtain optimal results (Pearson's r = 0.65, p = 0.0005).Likewise, LSGAN appeared to be the model least affected by dataset size, producing the most realistic distributions in each of the cases (Table 6).While training GANs using 400 epochs is still able to produce realistic data on small datasets, when considering the optimal number of epochs, increasing this number to 1000 produces a significant improvement in results.

General GAN Performance
All three GANs are highly successful in replicating sample distributions, effectively augmenting each of the distributions without too much distortion (Figure 4, Tables S1-S6).While evaluating standalone synthetic data creates some confusion, seen in some deviations of synthetic central tendency values and IR intervals (Tables S1, S3 & S5), the true value of GANs are observed when considering the augmented sample as a whole (Tables S2, S4 & S6).Plots of distributions both before (black) and after (red) GAN data augmentation, using the best performing models as described in Table 3. Descriptive statistics for each are included in Tables S2, S4 and S6.
In most cases, it can be seen how even the worst performing algorithms are able to maintain the central tendency of samples while boosting the variance represented.It is also important to highlight that, while in some cases central tendency can be seen to deviate slightly from the original distribution (e.g.LSGAN on DS2, Table S2), this is normally only true of one PC score and is still insignificant (rTOST p < 0.05).Some algorithms are also seen to affect the normality of sample distributions, creating distortion that is reflected in increased sample skewness.Nevertheless, these distortions are still unable to modify the general magnitude of similarities between synthetic and real data.
The greatest value of GANs can, therefore, be seen in increases in overall sample variance without significant distortion of the real sample's distribution.Deviation values and IR intervals increase, representing more variability without significantly shifting central tendency and without generating outliers.This shows how each algorithm is able to essentially "fill in the gaps" for each distribution while staying true to the original domain.
If GAN performance were to be compared with more traditional augmentation procedures, such as bootstrap, GANs can be seen to smooth out the distribution curve (Fig. 5), creating a more general and complete mapping of the target domain.Bootstrapping procedures, on the other hand, tend to exaggerate gaps in the distribution.This can mostly be characterised by noticeable modifications to sample kurtosis while maintaining the general variation (Table 7).

Conditional GAN Performance
CGAN presented limited success when augmenting datasets, with only Wasserstein Gradient Penalty loss succeeding in overcoming mode collapse.Nevertheless, CGAN was still able to generate data with insignificant differences (Table 8), successfully augmenting the targeted datasets (Table 9 & Figure 6).
When taking a closer look at the performance of CGAN, however, it is important to note that, while the magnitude of differences between synthetic and real data are insignificant, CGAN distorts the original distribution to a greater degree (Fig. 6).In both samples, CGAN deviates greatly from the target central tendency (Table 9) and appears to shift the general skew of the distribution (Fig. 6).
When using GANs to augment each of the samples separately, however, the generated data is arguably truer to the original domain.This is not to say, however, that CGANs are unable to augment feature spaces successfully.With the right configuration, CGANs are likely to reach similar results to traditional GANs.This, however, goes beyond the scope of the present study.

Discussion
Many algorithms require large amounts of data in order to efficiently extract information, a task which is particularly difficult when considering data derived from the fossil record.To confront this topic, this study presents a new integration of AIAs into archaeological and palaeontological sciences.
Here GANs have been shown to be a new and valuable tool for the modelling and augmentation of GM data.Moreover, these algorithms can additionally be employed on a number of different types of datasets and applications; whether this be the handling of palaeoanthropological, biological, taphonomic or lithic specimens via GM landmark data.
To demonstrate this latter point, and applying typical geometric morphometric techniques for classification on the O'Higgins and Dryden [69] Hominoid skull dataset, the present study is able to increment balanced accuracy of traditional LDA up to ca. 5% (Fig. 7a) with a significant increase in generalisation (Fig. 7b & c).For this demonstration LDA was trained using a traditional approach [11], as well as an augmented approach based on Machine Teaching [40].It can be seen how applying Machine Teaching using 100 realistic synthetic points per sample for training, and the original data was used for testing, helps the generalisation process (Fig. 7b) while providing clearer boundaries for each of the sample domains (Fig. 7c).
It is important to point out, however, that this is not the solution to all sample-size related issues, and a number of components have to be discussed before more advanced applications can be carried out.
Missing data and the availability of fossil finds are a major handicap in prehistoric research.This is increasingly relevant when considering fossils of older ages, such as individuals of the Australopithecus, Paranthropus and early Homo genera.In a number of cases, for example, the representation of Australopithecine or Homo erectus/ergaster specimens may not even surpass 10 individuals [70-73, inter alia], while Homo sapiens specimens are in abundance.In these cases, and in accordance with the data presented here, GANs would not be able to successfully augment the targeted minority distributions from scratch.Other options, however, could entail the use of algorithms for pre-processing, using variations of the Synthetic Minority Oversampling Technique (SMOTE & Borderline-SMOTE) [73][74][75], or the adapted version Adaptive Synthetic Sampling (ADASYN) [76].Both SMOTE and ADASYN are useful, easy to implement algorithms that augment minority samples in imbalanced datasets.SMOTE generates synthetic data in the feature spaces that join data points (e.g. according to k nearest-neighbour theory), thus filling in regions of the target domain [73][74][75].ADASYN takes this a step further by modelling on sample distributions based on data-density [76].Both are valuable algorithms that have become popular in imbalanced learning tasks, generally improving predictive model generalisation.Nevertheless, their application should be confronted conservatively.
Preliminary experiments within this study found that resampling via bootstrapping prior to the training of GANs resulted in severe mode collapse.This can be theoretically explained by the manner in which bootstrapping is over-inflating the domain and highlighting very specific regions which the model then learns from.This results in overfitting as the model is repeatedly learning to map out the same value multiple times, boosting the probability of mode collapse through an enhanced lack of variation in the original trainset.Considering how SMOTE and ADASYN produce more "meaningful" data [40,73], these algorithms are more likely to aid the training process rather than produce the adverse effect.Nevertheless, overuse of SMOTE/ADASYN is likely to have a similar effect to bootstrapping, where linear regions of feature space between data points are enhanced while other regions are left empty.
Through this, the current study proposes that a conservative use of SMOTE or ADASYN variants prior to the training of GANs may be able to boost performance on overly-scarce datasets (e.g. the cases of [70][71][72][73]).This practice would be able to augment minority samples to a suitable threshold (n = 30), preparing the dataset for more complex generative modelling and enabling an improved generalisation of any predictive models used in analyses that follow.
From a similar perspective, the use of Bayesian Inference Algorithms such as Markov Chain Monte Carlo (MCMC) and Metropolis-Hastings algorithms have also been known to effectively model from multiple types of probability distributions [77][78][79].In some cases, it may be possible to use these approaches to sample from the probability distribution at hand, and produce simulated information from the target distribution which would essentially be more realistic than simple bootstrap approaches.Further research into how these approaches may be applied could provide a powerful insight into GAN alternatives for different types of numerical data in GM.
In the general context of computational modelling, common criticism of neural network applications in archaeology and palaeoanthropolgy argue that GM datasets are generally insufficient for the training of AIAs.This is based on the fact that most DL algorithms require much more data to avoid overfitting.From this point of view, why would training a GAN on such little data be any different?The present study proves that this is not an issue, considering how, with only 30 individuals, GANs are still able to produce highly realistic synthetic data (LSGAN rTOST p = 3.8e-22).
In common DL literature, state-of-the-art models are reported to obtain ≈80% accuracy when trained on thousands to millions of specimens [24].It is important to consider, however, that in the majority of these cases AIAs are trained on images (i.e. computer vision applications).To provide an example, Karras et al. [48] present a GAN capable of producing hyper-realistic fake images of people's faces, building from a subset of the CelebA-HQ dataset using ≈30,000 images.Two main components must be considered in order to understand why such a large dataset is required for their model; • Karras et al. [48] present a GAN capable of producing high resolution 1024x1024 pixel RGB images.In computer vision applications, each image is conceptualised as a multidimensional numeric matrix (i.e. a tensor).Each of the numbers within the tensor can essentially be considered a variable, resulting in a dataset of approx.3 million variables per individual photo.• In order to efficiently map out these 3 million numeric values, the featured GAN uses progressively growing convolutional layers (nº layers ≈ 60) with 23.1 million adjustable parameters.
• The present study uses feature spaces that have already undergone dimensionality reduction derived from GM landmark data.In the case of the largest dataset, this results in a target vector of 60 variables that need to be generated.The present study additionally only uses fully connected layers with no convolutional filters, resulting in a model of <11,000 adjustable parameters.A GAN targeting 3 million values with 23.1 million parameters would thus require a far larger dataset than one targeting 60 values with 11 thousand parameters, explaining why with just 30 specimens, GAN convergence is still possible.Regardless of the mathematics behind DL theory, the statistical results presented here provide enough empirical evidence to argue the value of the proposed GAN with as little as 30 individuals.Nevertheless, even in cases where datasets are too scarce for GANs to be developed from scratch, pre-trained models can be adjusted to different domains via multiple DL techniques.This arguably opens up new possibilities for the incorporation of Transfer Learning into GMs [24].
Finally, it is important to highlight how no absolute protocol can be established for generative modelling of any type.DL practitioners are usually required to adapt their model according to the dataset at hand, using the best practices established in other studies as a baseline from which to work from.Under this premise, recommendations established for the augmentation of GM datasets using GANs can be listed as follows: • Best results are obtained when scaling the target domain to values between -1 and 1.

•
Hidden layer densities should be adjusted according to the number of dimensions within the target domain (Fig. 2).Tanh activation functions in both the generator and the discriminator are recommended.

•
Dropout, batch normalization and kernel initializers (discriminator σ = 0.1, generator σ = 0.7) are recommended to regulate the learning process and avoid mode collapse.

•
The Adam optimization algorithm is recommended when using Least-Square loss, while RMSProp is more efficient when using the Wasserstein (WGAN or WGAN-GP) function.A minimum batch size of 16 obtains the best results.

•
LSGAN is recommended when training data is limited, increasing the number of epochs to at least 1000.

•
WGAN and WGAN-GP work best on larger datasets, while approximately 400 epochs are usually enough to produce realistic data.

•
The smaller the target domain, the larger the latent vector required for generator input.

•
For conditional augmentation, optimal results are obtained by training GANs on each sample separately, rather than using CGANs.

Conclusions
To the authors' knowledge, this is the first comparative study in DL and GM using GANs for high dimensional numeric simulations that further employ advanced descriptive statistical metrics for evaluation.While augmented data is by no means a substitute for real data, real-life DL practices and applications have shown "meaningful" synthetic data to significantly increase the confidence and power of statistical models.In many cases, this has even been seen to exceed human-level precision.
While GANs are difficult to the complexity of these AIAs should not be cause for discouragement.There currently exists a wide range of literature and helpful guides dedicated to teaching scientists about AIA development, even for those with no background in mathematics or applied statistics.With platforms such as ScienceDirect (https://www.sciencedirect.com/) in 2019 alone reporting over 3000 papers including the term Deep Learning (in keywords, title or abstract), and ca.6000 for Machine Learning, AI can be considered one of the most popular lines of research in modern science.This presents a promising future for applications in archaeological and palaeontological research.

Figure 2 .
Figure 2. Descriptive figure presenting the optimal GAN architecture for geometric morphometric data augmentation.Input (I) and output (O) neurons are represented in black with bias (b) in green.The output of the generator and the input of the discriminator is represented by the n number of dimensions in the ℝ n dimensional target feature space.The latent vector (L) input for the generator must be adjusted according to the dimensionality of the target feature space.Hidden neurons (h) in layers h n have a density (d) that is also conditioned by the shape of the target distribution.Finally, the discriminator has an additional dropout layer (red) with a threshold of p = 0.4.

Figure 3 .
Figure 3.This is a figure, Schemes follow the same formatting.If there are multiple panels, they should be listed as: (a) Description of what is contained in the first panel; (b) Description of what is contained in the second panel.Figures should be placed in the main text near to the first time they are cited.A caption on a single line should be centered.

Figure 4 .
Figure 4. Plots of distributions both before (black) and after (red) GAN data augmentation, using the best performing models as described in Table3.Descriptive statistics for each are included in Tables S2, S4 and S6.

Figure 5 .
Figure 5. Histograms and scatter plots of augmented DS3 using bootstrap and GAN.New points are marked in red.

PreprintsFigure 7 .
Figure 7. Example of the Augmentation (LSGAN x100) of O'Higgins and Dryden's Hominoid dataset comparing gorilla (Gorilla gorilla), chimpanzee (Pan troglodytes) and orangutan (Pongo pygmaeus) skulls [69].A -Augmented Principal Components analysis with corresponding Linear Discriminant Analysis (LDA) balanced accuracy scores.B -Boxplots of original and augmented loss and confidence when making predictions.Reciever Operator Characteristic data is included to the right.C -Decision boundaries for LDA across PC1 (x-axis) and PC2 (y-axis).

Table 1 .
List of hyperparameters and settings tested during optimization of GAN model architectures.

Table 2 .
Summary of each dataset's target domain with univariate calculations of distribution normality in the top two PC scores.

Table 3 .
Best obtained absolute difference and p value calculations for robust equivalency testing of each of the synthesized distributions using different GANs.

Table 4 .
Best obtained absolute difference and p value calculations for robust equivalency testing.Values calculated comparing the original target distribution (DS1) with synthetic data generated by different GANs with different sized latent vectors as generator input.

Table 5 .
Best obtained absolute difference and p value calculations for robust equivalency testing.Values calculated comparing the original target distribution (DS3) with synthetic data generated by different GANs with different sized latent vectors as generator input.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 November 2020 doi:10.20944/preprints202011.0696.v1Table 6 -
Best obtained absolute difference and p value calculations for robust equivalency testing after x number of epochs.Example of GANs trained on a subset of 30 individuals from DS2.

Table 7 .
Comparison of descriptive statistics obtained when comparing traditional bootstrapping procedures for numeric data augmentation and the best performing GAN on DS3.Dataset was augmented to size 100.

Table 8 .
-Best obtained absolute difference and p value calculations for robust equivalency testing when comparing targeted generation of data using CGAN and GAN on DS3.Both CGAN and GAN were trained using Wasserstein Gradient Penalty loss.

Table 9 .
-Descriptive statistics for augmented DS3 targeting label values specifically.Numbers marked in bold indicate the synthetic data that obtained the most significant rTOST equivalency pvalues.Both CGAN and GAN were trained using Wasserstein Gradient Penalty loss.