1. Introduction
Evaluating deep learning models is essential for understanding the limitations of their performance. It is well known that neural networks, and to a greater extent deep neural networks, can require vast amounts of data to effectively learn a model of a system [
1]. Along with increasing the training dataset size by simply collecting more examples, other techniques are available that can potentially enhance a model’s performance, both during training and inference.
The objective of training deep neural networks for learning classification models is to subsequently perform the classification task on new data, which may also include data that are different or absent from the dataset used to train the network on. During training, a neural network learns a model of a given dataset that is representative of a system, for example, by performing many iterations (epochs), each consisting of a forward pass through a subset of the dataset followed by a backward pass, which uses backpropagation to calculate and propagate the gradients with respect to a loss function back through the network (other regimes exist). This allows the weights and biases of the neurons—known collectively as the learnable parameters—to be updated with the goal of minimising the calculated error (or loss) between the output and ground truth [
2]. In contrast to these learnable parameters, there is a set of hyperparameters that are fixed at the time of training, and the values assigned to these can impact on the performance of a model, not least by also determining the number of learnable parameters. Hyperparameter tuning is the process of finding values for the hyperparameters that optimise the model [
3].
Deep neural networks of sufficient complexity, when trained using supervised learning, are subject to the problem of overfitting, by which they effectively begin to memorise the data they learn from. As a rule, a model that overfits a given dataset is less able to generalise well to unseen data [
4]. Combatting dataset sensitivity to overfitting can be achieved through applying regularisation techniques such as model parameter regularisation (e.g., [
5,
6,
7]) and data augmentation [
8]. Regularisation techniques invariably have one or more associated tunable hyperparameters (including the choice of whether to use a given technique during training), and correctly tuning these hyperparameters is key to optimising the performance of deep learning models [
9], particularly when datasets are size limited.
Bergstra and Bengio [
10] showed that not all hyperparameters are equal for all models, and for the efficiency of hyperparameter optimisation, they demonstrate both empirically and theoretically that random search is a superior strategy over systematic grid or manual search for finding suitable hyperparameter values, and at a fraction of computational cost. Random search has since become an increasingly popular method for hyperparameter tuning, with modern machine learning platforms providing functionality to automate this process, including quasi-random search techniques for combining a multitude of different hyperparameter values, e.g., W&B Sweep [
11]. While random search increases the speed at which optimal (or near-optimal) combinations of hyperparameters can be found—which can be beneficial when there is, for example, a race to market—it does so largely at the expense of understanding the impact individual hyperparameters have on model performance. Traditional ablation studies, for example with manual hyperparameter selection, retain the advantage of being able to isolate any measurable contribution—positive or negative—from each modified hyperparameter, including combinations of hyperparameters albeit at significantly increased computational cost [
10].
In this study, we measure the effect each hyperparameter, from a chosen set, has on the model performance of an encoder-only transformer, with a view to further optimise the task of classifying sets of isolated, dynamic signs from human pose estimation keypoints. To this end, we perform a large-scale ablation study in the form of a systematic manual search over select hyperparameter intervals, as well as introducing further regularisation techniques, such as the shrinkage methods: Lasso (
norm) [
12] and Ridge (
norm) [
13] parameter regularisation. We also apply augmentation techniques on our chosen sign language recognition dataset, which is small by deep learning standards. In addition to common augmentation techniques that include rotating, adding noise, and scaling the keypoints, we also introduce variability by manipulating the frames in the sequences in several ways. We extend the original benchmark study by Woods and Rana [
14], which was otherwise limited to an architecture search in the form of the number of transformer encoder layers and attention heads, by substantially increasing the hyperparameter search space as well as doubling the number of experiments conducted per configuration grouping.
Regularisation techniques used in deep learning tasks—in our case performing sign language recognition—take several forms. For convenience, we can split these into two distinct groups: model parameter regularisation, and data augmentation. Both groups have associated hyperparameters that may require tuning; however, some hyperparameters have what are deemed good default values, which have been discovered through previous research, and generally do not require any further tuning. We describe typical model parameter regularisation and data augmentation techniques below.
1.1. Model Parameter Regularisation
Model parameter regularisation encompasses both the number of learnable parameters through architectural aspects of a neural network, and restrictions imposed upon those learnable parameters. Regularising model parameters across the entire neural network can help mitigate overfitting by applying penalties during training that encourage learning a less complex model, for example, through having fewer learnable parameters, or by increasing the sparsity of those parameters by promoting fewer neuron activations [
15]. This can be thought of as training a neural network that models more of the broad characteristics rather than overly specific nuances of the dataset. Typically, if the model is too complex, it will begin to learn the noise implicit in the training dataset split to the detriment of generalised performance. The purpose of regularisation is to help achieve a balance between learning to model a dataset well while gaining—and subsequently retaining—the ability to make accurate predictions on unseen data that it cannot learn from. Here, we outline the hyperparameters associated with typical model parameter regularisation techniques.
The learning rate dictates how quickly the parameters are updated during optimisation, i.e., the backward pass, which in turn is controlled by a chosen optimisation algorithm, e.g., Adam [
16] or Stochastic Gradient Descent [
17], each having their own respective hyperparameters. With the addition of a learning rate scheduler (or multiple), of which there are many to choose from, the learning rate can evolve during training based on conditions that include the epoch number or the gradient of the loss value, and so on. A greater learning rate can increase the speed of convergence by taking larger steps towards an optimal configuration, but it can also reduce the potential to achieve an optimised set of parameters by effectively overshooting those that would otherwise most effectively minimise the loss. Conversely, too small a learning rate can lead to a much longer time to achieve convergence, which risks underfitting in the case where the maximum number of epochs is fixed and too small, but also increases the probability of settling on a local minimum in the loss landscape [
18]. Reducing the learning rate according to predefined rules is common for converging on optimal model parameters, but many learning rate schedules exist [
19].
Batch size is another hyperparameter that can have an effect on model performance [
20]. While it is typical to fix the batch size during training, it could in principle, as per the learning rate, also evolve according to other conditions, likely set by more hyperparameters—although this is generally not seen in practice. The type of batching can also be set as a hyperparameter. Taking the entire dataset split in a single batch can speed up convergence and provide more stable parameter updates, but it is more common to use mini-batching, especially in deep learning where datasets are often too big to hold in memory, which is also known to improve generalisation [
21]. A mini-batch variation is so-called random batching, where batches are created from randomly selected examples in a dataset split, only completing an epoch when every example has been seen at least once [
14].
The number of epochs a neural network is trained for is also tunable; too few can lead to underfitting, where the model is yet to achieve somewhere near-optimal performance on the training dataset split, and too many can lead to overfitting, where further training impairs generalised inference—both of which are undesirable outcomes. Early stopping, however, can provide some protection against overfitting caused by training for too many epochs. Again, this can come with associated hyperparameters, including a patience value or a minimum improvement in some performance metric, like loss or accuracy, before training is stopped. Taking model snapshots when a given metric is improved upon, again possibly determined by a configured hyperparameter, can ameliorate the problems caused by training for too long. Deciding which method works best can depend on the deep learning task, or even be discovered by trial and error.
The number of hidden layers, e.g., in a fully connected neural network, is another tunable hyperparameter. Increased depth, along with choice of activation functions to introduce non-linearities to the model (e.g., ReLU [
22]), can enable more complex features to be modelled, but too many hidden layers can contribute to overfitting. Likewise, the number of neurons in a given layer is configurable and has a similar effect in regard to modelling more complex relations between inputs. Increasing the number of neurons can expand the feature space, whereas reducing the number of neurons—as per the encoder part of an autoencoder [
23]—can encourage a neural network to learn only the most important features of a model, which can help avoid modelling the training dataset too closely, to reduce overfitting.
Dropout is a technique that randomly sets a fraction of neuron activations output by a given layer to zero, as set by a hyperparameter [
24]. Dropout can promote learning redundant and more robust representations that do not rely heavily on individual or specific combinations of neuron activations, which can introduce a form of ensembling by effectively training multiple sub-networks within the neural network to which it is applied. The desired outcome is to regularise the network, so it is less able to learn a perfect model of the training dataset. As with many hyperparameters, the appropriate dropout probability is typically derived empirically [
25].
Other techniques include batch normalisation, which ensures the activations of a given layer have zero mean and unit variance to enhance training stability [
26], and label smoothing, which can reduce overconfidence in classification predictions [
27]. Both of these techniques can help regularise a model to reduce overfitting.
Hyperparameters can also be architecture specific. For example, a convolutional neural network comprises many elements that include the number of convolutional layers, the number, size, and shape of the filters (kernels), the stride and padding values, and so on [
28], all of which can be tied to hyperparameters so the optimal combination can be searched for. Likewise, as demonstrated in our parent study [
14], a transformer encoder can have hyperparameters that determine the number of layers and attention heads, as well as the number of neurons in the fully connected layers, among others.
1.2. Data Augmentation
In supervised deep learning, it is common practice to artificially increase the size of a given dataset by applying data augmentation techniques [
29]. This is perhaps most frequently utilised when training neural networks for image processing [
30,
31], but is also applied to other domains that include natural language processing (NLP) [
32], acoustic modelling [
33], time-series classification [
34], and relation classification [
35]. When a single labelled example is augmented, it is modified such that it produces one or more slightly different yet label-preserving, representative examples. Alternatively, new examples can be generated synthetically to also meet the criteria for a given label [
36,
37]. When applied appropriately, data augmentation is a powerful tool used to increase model robustness to variations and help reduce overfitting. Typical augmentation transformations applied to image data include rotating, scaling, flipping horizontally and vertically, skewing, warping, colour alteration (contrast, hue, temperature, etc.), erasing parts of images, random cropping, injecting noise, and even applying neural style transfer [
38], to name several. In our case, we artificially augment our chosen sign language dataset, which consists of human pose estimation keypoints, by applying a limited number of suitable transformations for the domain to examples from the training dataset split only.
It is possible for every data augmentation transformation to be controlled by at least one hyperparameter. For example, the range of angles by which images are rotated would be assigned a respective hyperparameter, as would the range of scaling that was to be applied. In addition to these hyperparameters, there exist more hyperparameters related to which data augmentation methods to use and when, including the mixing probabilities between raw and augmented data when creating batches during training. It is evident that the combination of model parameter regularisation and data augmentation leads to a hyperparameter space that has the potential to be extremely large for a given neural network, and finding optimal hyperparameter values can be a challenging task.
Beyond data augmentation, it is common practice to normalise the input values as a pre-processing step, which can come before and/or after augmentation. Normalisation can take many forms and is used to standardise a dataset to both help and increase the speed of model convergence [
39].
1.3. Related Work
We now consider related work and the variety of techniques utilised to regularise models for sign language recognition from published research that makes use of the same base dataset of keypoints.
For their SPOTER model, Bohacek and Hruz [
40] apply randomised data augmentations to the keypoints during training, with the probability of an augmentation being applied set to 0.5. These augmentations include rotating all keypoints about the centre point by a randomly selected angle in the interval
; two types of perspective transformations, which are performed on all keypoints excluding those for the hands; sequential rotation of the arm keypoints; and adding Gaussian noise. The first transformation uniformly squeezes every frame of a given sequence by up to 15% of the original frame width, and the second performs a perspective transformation ostensibly to simulate the effect of camera tilt with respect to the subject. The sequential rotation of the arm joints is intended to mimic slight variances in a given sign, which the authors claim do not change the meaning of the signs. They subsequently normalise by scaling all keypoints (treating the hands and body keypoints separately) to lie in the interval
before shifting by
in the
x and
y planes. The authors fix the seeds of the random number generators, and as such, their training regime does not assess the impact random seed initialisation has on the variability of model performance. In addition, they do not utilise a learning rate scheduler, nor do they use weight decay with their chosen stochastic gradient descent optimiser. Their transformer model architecture is fixed at 6 encoder and decoder layers, each having 9 attention heads, with a hidden layer dimension of 108 neurons and feed-forward layers with 2048 neurons. They use a learned positional encoding for the encoder input, and a class query for the modified decoder input.
Following SPOTER [
40], Eunice et al.’s [
41] Sign2Pose model appears to have an almost identical architecture. There are some minor differences, which include slight variations to the augmentations applied (e.g., rotations being expanded from randomly selected angles in
in SPOTER to
in Sign2Pose) and the inclusion of weight decay with the
penalty hyperparameter set to
. There is an additional augmentation transformation not performed in the SPOTER study, however, in that keypoints are flipped horizontally with a probability of 0.5, which would be problematic for the signs
left and
right if they are present in the dataset splits used. Another minor difference is the number of so-called head units chosen to define the signing space, although the underlying mechanism is the same. The one striking difference, however, is the method by which frames that are deemed to be redundant are discarded, which appears to have a significant impact on model accuracy.
For their Pose-GRU model, Li et al. [
42] use 2 stacked gated recurrent unit (GRU) layers, regularised by configuring each GRU’s hidden layers with dimensions of 64, 64, 128, 128, which they claim to derive empirically, and do not alter throughout. They randomly select 50 frames from each example sequence, which provides some augmentation through randomisation, and perform classification on every frame within a sequence, including the output pooling, to derive a final prediction. They use cross-entropy loss and the Adam optimiser. In contrast, their graph convolution network (GCN) based model, Pose-TGCN, appears to apply no explicit regularisation techniques other than early stopping, which they apply to all training runs, stopping when the validation accuracy no longer increases or the number of elapsed epochs reaches 200.
Tunga et al. [
43] appear to extend Pose-TGCN by connecting it to a BERT transformer [
44], which they refer to as GCN-BERT. They also use cross-entropy loss and the Adam optimiser, regularised with the weight decay
penalty hyperparameter set to
, and train for 100 epochs. No other details are provided that document any other techniques employed to improve model performance.
1.4. Contributions
This article presents the first comprehensive, large-scale ablation study of an encoder-only transformer for modelling sign language using human pose estimation keypoint data. We quantify the effects that a variety of regularisation and augmentation techniques have on model performance. We show that fixing the random number generator seeds and repeating a single experiment with a hyperparameter value change is an inadequate method for universally determining the outcome of that hyperparameter change. We also, in the case of this task and dataset consisting of sparse values, provide strong evidence to support the hypothesis that the size of the dataset is the limiting factor beyond any of the regularisation or augmentation techniques applied for improving the performance of our model to classify isolated, dynamic signs.
1.5. Article Organisation
The remainder of this article is organised as follows:
Section 2 details the materials and methods used in this study, including comprehensive descriptions of the regularisation and augmentation techniques applied;
Section 3 lists the outcome of every experiment conducted;
Section 4 gives in-depth discussion and analysis of the results; and finally,
Section 5 provides conclusions to be drawn from the study.
4. Discussion
We begin by evaluating the two batching methods. As
Figure 1 shows, single-pass batching requires many more epochs before achieving a model accuracy that is comparable to random batching, at which point neither method is clearly the more optimal. This is because models trained using single-pass batching underfit the data when trained for an insufficient number of epochs. Woods and Rana [
14] showed that, for this task, the majority of models trained using random batching converge at around 130 epochs, and we can expect all models trained using this method to converge by approximately 250 epochs. It is therefore possible that, at 200 epochs, some of the random batching experiments had not fully converged before training had been completed. Nevertheless, given enough iterations through the data for single-pass batching (e.g., ∼1000 epochs), both methods are comparable in terms of performance on the test set. Using the number of epochs as a metric for time to train can be misleading, however, because random batching, in practice, makes multiple iterations through the majority of the entire dataset per epoch, and in doing so, each epoch takes more time to complete than with the single-pass batching method.
Taking the results from single-pass batching at 1500 epochs and comparing with random batching at 200 epochs using an independent two-sample
t-test, we find the difference between the respective train and validation accuracy result groups are statistically significant at
and
, respectively, with both mean accuracies being greater on the random batch experiments. However, for the test accuracy, at
, we find no statistical significance between the two experiment groups showing that both methods are comparable in the test set performance. The full top-1 accuracy results for the batching method experiment analysis are listed in
Table A1 and
Table A2.
Comparing the impact of batch size, we observe no appreciable effect on test set accuracy up to at least batch size
, after which performance drastically degrades (see
Figure 2). We did not test any batch sizes between
and
, so we cannot provide a more precise estimate for which model performance begins to degrade. When using random batching, larger batch sizes equate to fewer iterations through the data, which explains the observed reduction in accuracy once a threshold is met. As the batch size increases, random batching begins to approximate single pass, which we show to perform worse than random batching at lower epochs. We also observe a reduction in validation set performance for batch sizes
, indicating some dataset imbalance.
Reviewing the results from the learning rate experiment, we observe that, for this task, the learning rate that produces the best model performance is between approximately
and
, with a reduction in accuracy outside of that range up to the values measured.
Figure 3 clearly identifies the peak performance range of learning rate values, with
giving the best test set accuracy over the range of learning rates tested.
We analyse the effect that and parameter regularisation has on model performance by plotting the outcomes of their respective experiments separately. In both cases, the comparative model performance from applying no parameter regularisation is overlaid, which shows the mean top-1 performance for the train, validation, and test sets as dotted lines, with the respective calculated uncertainty making up the shaded areas. No shaded area is visible for the training set performance because its value is zero when evaluated to five significant figures. For convenience, we limit the y-axis range of both plots equally to exclude values that correspond with very low performance outcomes, as we do on other plots where performance is severely reduced as the result of a hyperparameter value.
Figure 4 shows the effect of applying elastic net regularisation over a range of values for the
parameter,
, with a fixed
parameter,
. The shaded areas indicate the mean top-1 accuracy minimum and maximum values from the calculated uncertainty for 16 experiments with no parameter regularisation being applied, which allows the effect of elastic net regularisation to be observed. By also referring to
Table 6, it is clear that values of
start to significantly negatively impact on model performance. Conversely, values of
show the models perform better than without any parameter regularisation being applied, with a mean top-1 accuracy gain of approximately 1.7% on the test set for the best-performing value
. The trend suggests values of
may offer further, albeit marginal, gains, but it must be noted that the error of these measurements does lie within the range of the baseline error, which in this case is a model with no parameter regularisation. At this point in the analysis, therefore, we cannot rule out the possibility that the observed gains are superficial. With the assumption that there is a real performance gain at very low values of
, it appears to be the case that, in conjunction with
parameter regularisation, this technique works best for this task when
is included somewhat sparingly.
Applying the independent two-sample
t-test to the results from the experiments with the best-performing value of
and those with no parameter regularisation (the control), we find no significant difference between the two groups for the train accuracy, with
. We do, however, find a significant difference in the validation and test accuracies, with
and
, respectively, with both group mean values being greater than the control group with no parameter regularisation. This strongly indicates that elastic net regularisation with a very low value for
improves model performance compared with no parameter regularisation. The full top-1 accuracy results for the elastic net regularisation and control experiments analysis are listed in
Table A3 and
Table A4.
Figure 5 shows the effect of applying
parameter regularisation alone. As with elastic net parameter regularisation, there is a clear degradxation in model performance for larger related parameter values. The model is, however, less sensitive to larger values of
when compared with the magnitude of
because of the definition of the penalty imposed on the neuron weights (see
Section 2.3.4). There is a clear measured optimal value at
, which does not intersect with the baseline error (wherex the baseline, again, is a model with no parameter regularisation). This value of
provides an observed mean performance gain of approximately 2.1% over no parameter regularisation.
Performing an independent two-sample t-test on the results from the experiments with the best-performing value of and those with no parameter regularisation, we find—again, similar to elastic net regularisation—that there is no significant difference between the two groups for the train accuracy, with , but we do find the difference is significant for the validation and test accuracies, which have and , respectively. In both of these latter cases, the mean values are again greater than the control. We conclude that parameter regularisation improves model performance by a significant margin.
It would be instructive to perform
parameter regularisation experiments as a standalone group without elastic net regularisation. Despite this, the selected value for the
parameter,
, can be attributed to being a fortunate choice for the common baseline hyperparameters. The observed best combination of hyperparameters for the elastic net configuration is almost certainly caused by the contribution made by the
component over
, and we can test this statistically. This would explain why
parameter regularisation performs better than elastic net regularisation, and why elastic net regularisation performs best for extremely low values of
where the
contribution is significantly reduced. Comparing the outcome of the statistical tests performed on the elastic net and
parameter regularisation, with
and
, respectively, we observe that, statistically,
parameter regularisation alone produces a more performant model on the test set, although experimenting with different combinations of hyperparameter values would provide more confidence. The full top-1 accuracy results for the
parameter regularisation and control experiments analysis are listed in
Table A3 and
Table A5.
The effect that the number of neurons in the encoder feed-forward block layers has on model performance is surprising.
Figure 6 shows the test set performance appears to rise in line with the feed-forward layer dimension.
Figure 7 isolates the test set results and rescales them to give a clearer picture of what appears to be happening. We could expect an inflated layer dimension to increase overfitting on the training set, thereby impeding the ability of the model to generalise—especially so in the case of our model, which is clearly already overfitting—and conversely, we could reasonably expect a reduced layer dimension bottleneck to help mitigate overfitting, but this is not the outcome we observe with the experiments conducted. Model performance on the test set trends upwards with an increase in neurons up to the experimental limit of
, having plateaued at around
, with little performance difference in this range of dimensions. For convenience, this is marked in
Figure 7 with a green dashed line. Both increasing the number of neurons beyond our limit of
and the number of experiments per configuration would provide insight for determining the point at which the encoder feed-forward block layer dimension begins to impact on performance from overfitting the training dataset split to the detriment of generalisation to the test set. One possible explanation for the observed behaviour is that overfitting on the training set benefits some classes in the test set that have very similar equivalent training set examples. If this is the case, through increasingly memorising those training set class examples, the model more easily recognises very similar examples in the test set. It is worth noting that we do not see the same behaviour in the validation set, which could contain class examples with enough difference within each class when compared to the training set. This presents an opportunity for further analysis beyond the scope of this study, and would perhaps benefit from the expertise of someone proficient in ASL. What we find here is not dissimilar to the batch size experiment where an effect that we would expect to harm a more balanced dataset is not observed.
Figure 8 shows that applying dropout to the encoder appears to have little effect on model performance for
. There is a slight improvement in test set performance for the range
compared with lower values of
, but given the experimental uncertainty of the measurements, no optimal value up to
can be determined. Dropout probabilities greater than
, however, show a marked decline in performance, with the most severe reduction in accuracy at the upper limit
, which is unsurprising. Dropout also appears to reduce validation set performance across the spectrum. This is notable because dropout is generally considered advantageous to training deep neural networks [
24,
25]. One possible explanation is that whereas techniques like
parameter regularisation encourage a spread of activations with no heavy dependencies on specific neural pathways, other techniques, like dropout, encourage neural pathway redundancy, which may still favour strong (but redundant) activation pathways, and because
parameter regularisation is active by default throughout all dropout experiments, the two techniques could be in conflict when dropout is applied. Again, this presents another opportunity for a future study to measure the effect of applying a range of dropout probabilities against a range of
parameter regularisation values, including none. It is, of course, also likely that the small dataset size plays an important role in the effect dropout has for improving the model’s capacity to generalise to unseen data.
Dropout is applied to the training dataset split input embeddings as an analogue to neuron dropout in the encoder, but as can be seen in
Figure 9, it is clear that no amount of embedding dropout is beneficial for model performance on any of the dataset splits. This categorically rules out embedding dropout as a valid strategy to reduce overfitting.
Augmenting the training dataset split by adding noise has no clear measurable effect on model performance for levels of noise
. The values of
quoted correspond with the maximum absolute value of injected noise per keypoint in a batch. Random batching means that, per epoch, batches will likely be included that contain little to no noise at all being applied.
Figure 10 shows that once the level of injected noise goes above a maximum value of approximately
, the amount of noisy data begins to overwhelm the less noisy data, which impedes the model’s ability to generalise to the test dataset split. The outcome of these experiments is in contrast to the study conducted by Bohacek and Hruz [
40], who quote a measured increase in accuracy on 100 classes from 62.79% to 63.18% by augmenting with added noise.
Augmenting the training dataset split by arbitrarily, but uniformly, rotating the keypoints about the origin, up to a maximum rotation value, appears to provide no real positive or negative effect on model test set performance, across the entire range of rotation values tested,
(see
Figure 11). Taking our best mean top-1 accuracy score from the rotation experiments, where
, and again comparing with the study performed by Bohacek and Hruz [
40], we see an increase of ∼0.43% from augmenting with rotation, whereas they report an increase in accuracy of 2.27%. It is possible that differences in implementation can cause some discrepancy, but with a measured experimental uncertainty of ∼0.56% relating to our best mean score, we conclude that we observe no measurable effect from augmenting with rotation.
Analysing the three groups of scaling experiments together, we can see in
Figure 12,
Figure 13 and
Figure 14 that augmenting the training dataset split by scaling separately along the
x- and
y-axes, and together on both axes, over the selected values,
, offers no test set performance enhancement. With the exception of the extremes of the tested values
and
along the
x-axis, all results fall within the baseline uncertainty where no augmentation is applied. The uncertainties associated with the results from the experiments with
and
do, however, intersect, so no firm conclusion can be drawn. The rationale for this stems from the results for the superficially anomalous lowest value
; augmenting by a maximum scaling value of
means the difference between keypoints that receive no
x-axis perturbation and those that do is minuscule compared with those that, during random batching, receive a much larger range of perturbation values (e.g.,
) where the performance is somewhat increased compared with the baseline. The deviation from the mean at this lowest extreme is commensurate with the deviation at the upper extreme of
, and we therefore conclude that the variation observed shows no real impact on model performance across the entire range of
in all axes other than perhaps reducing the stability of the model.
The subtle effect of simulating increasing the speed of movement in various places by dropping random frames, per batch, up to the configured maximum set by
, appears to have no effect at all on model performance over the range of hyperparameter values tested.
Figure 15 clearly shows no reduction in overfitting nor change in either validation or test set performance. We are somewhat surprised that this technique does not work as expected, and this is perhaps because the effect is too subtle. Increasing the range of values
can take may produce more positive results. Better still would be to intelligently remove frames using real analysis of the motion of various keypoints throughout a given sign, and thereby, for example, reducing the length of time a particular hold lasts, or speeding up the motion from rest to sign by differing amounts. But dramatically altering the sign sequences begins to encroach on synthetic sign dataset creation, and as previously stated, we recommend the involvement of (preferably Deaf) expert signers when manipulating signs beyond the basic techniques we have employed. We do not do so simply because we are not Deaf.
With the intention of both helping to remove some of the silence before a sign begins and to alter the time at which a sign does begin, such that more variety is introduced into the dataset, it is clear that the trim start augmentation becomes detrimental beyond a certain point, as
Figure 16 clearly illustrates. Every mean experiment result sits towards the lower bound of the uncertainty of the baseline, up until
, after which (at
) performance clearly degrades across all dataset splits. This is almost certainly because a significant proportion of signs are being truncated with a sufficiently high
. As with the drop frames experiment, performing this kind of augmentation more intelligently, with perhaps a mechanism to detect the start of the sign to ensure no essential frames are trimmed from the start of any given sequence, would likely yield better results. Likewise, extending the augmentation to detect the end of the sign would likely produce better results still [
46].
The offset augmentation techniques are very similar in that both delay the effective start of a given sign by inserting leading frames and reducing the length of the sequences as required to keep them constant. The only difference is that the offset copy augmentation inserts copies of the first frame data, whereas offset pad inserts empty frames. Referring to
Figure 17 and
Figure 18, it is clear that neither reduce overfitting to the benefit of the model’s ability to generalise to the validation and test sets, with the offset pad augmentation (which inserts blank frames) actively harming the model beyond
. The conclusion that we are able to draw here is that some representative data, even if static, is better than no data at the beginning of a sign sequence. Given that all sequences are padded to a fixed length with zero values for the keypoint coordinates, and the sequence lengths—and by extension, sequence masks—are not used in the encoder, it would be interesting to pad sequences to the maximum sequence length with static copies of the final frame rather than empty values and measure the outcome.
We have seen from the experiments conducted so far that randomised seeds produce a range of results for experiments that are otherwise identical in configuration, and we show that it is necessary to repeat experiments many times to produce a mean value with associated measurement uncertainty so that the impact a given hyperparameter has on outcome can be evaluated. If a particular hyperparameter value is expected to produce a positive or negative change in outcome, we should expect the same polarity in the outcome by repeating the same experiment over a range of fixed seeds. The consequence of this not being true is that changing the value of a hyperparameter and repeating a single experiment is no guaranteed indicator of the efficacy of that hyperparameter change, with either a fixed or randomised seed. More simply, the requirement for multiple experiments with different seeds does not go away.
Reviewing the results from the fixed-seed comparison on singular hyperparameters experiment, we find evidence that fixing seeds does not lead to single-experiment results that show a clear outcome from a single hyperparameter change. For the experiment that compares training runs using the same fixed seeds for
(the control) and
, we plot the change in outcome for the validation and test sets for each given fixed seed such that a positive value means an improved accuracy was observed. We omit the training set outcomes because the difference is negligible.
Figure 19 shows clearly that both the validation and test set accuracy scores can change considerably—and change polarity—when the exact same experiment is repeated with a different fixed seed. Even more notable here is that while
Table 7 shows the mean top-1 accuracy after 16 experiments has
performing better than
(the control), a naive interpretation of the results from this fixed-seed experiment would lead to the conclusion that an overall drop in performance is observed for
. Repeating the process for the experiment that compares training runs using the same fixed seeds for
(the control) and
, we see a similar result (see
Figure 20), with a spread of difference in outcome for both the validation and test set accuracy.
Figure 21 shows the difference in outcome for the experiment with
(the control) and
, but shows test set accuracy is reduced for each of the tested fixed seeds. Again,
Table 8 shows the mean top-1 accuracy for
after 16 experiments (each with a randomised seed) is marginally higher than
on the same block of experiments. Running single experiments with any of the fixed seeds in our tested range, in this case, would lead to the erroneous conclusion that choosing
would produce a worse test set accuracy than
, which we do not observe after 16 repeated experiments (see
Table 8).
We now analyse the results from the fixed-seed comparison on normalisation experiment (see
Table 18). Taking each group where the keypoints are normalised before augmentation versus renormalised after augmentation, we can perform an independent two-sample
t-test and measure the statistical significance of the two distributions being comparable, i.e., no measurable difference was observed between the two groups. Insufficient resources prevent us from increasing the sample size, but nevertheless, we find the rotation experiment with
shows no significant difference between the two groups across the training, validation, and test set accuracies, with
,
, and
, respectively. We find the same is true for the
x-y scaling experiment with
. With
,
, and
, for the training, validation, and test splits, there is no significant difference found. We conclude that for the experiments conducted on our models, it is unlikely that renormalising after augmentation offers any performance gain. This is useful because it means the extra computational overhead of renormalising every batch can be avoided in future experiments that use these same techniques.
The dataset used in this study is small by deep learning standards, at only 2663 examples in the 100 sign classes group.
Table 19 shows the impact that dataset size has on model performance by reducing the number of class examples by a ratio defined by the hyperparameter
, where
means no reduction in size and
would mean the number of class examples in each dataset split is halved. Reducing the dataset size shows a drop in top-1 test set accuracy proportional to the size, which is to be expected. To estimate the improvement that an increased dataset size could have on model performance, we can extrapolate to larger dataset sizes using regression analysis. As a conservative best-case estimate, we fit a linear trend to the full set of dataset reduction results and obtain the equation
with coefficient of determination
. For a worst-case estimate, we assume an increase in dataset size eventually asymptotes to a maximum accuracy value, at which the model is essentially saturated, and performance ceases to improve with increased dataset size. We therefore fit a polynomial of degree 2 and obtain the equation
with coefficient of determination
. Using these equations, we predict mean accuracy at increased dataset sizes up until the model reaches 100% accuracy on the linear fit, and until the model reaches a peak on the polynomial fit. These results are shown in
Figure 22, where the linear limit is found to be at a dataset size of 4624 example sequences, and for the polynomial fit, the peak is reached at 3900 example sequences. The linear and polynomial limits are marked with vertical dashed lines. We find the worst-case top-1 mean accuracy is approximately 83% given a sufficiently large dataset, which is a mean value improvement of approximately 3% on our best-performing model configuration. We therefore cautiously estimate that the dataset size would need to be doubled for this model to produce the best possible top-1 test set accuracy. Despite the analysis performed, we must also consider the change in balance that simply adding more example sequences would bring. If the dataset imbalance could be improved, so could our estimates and no doubt also our model performance. Notwithstanding these caveats, it is clearly demonstrated that the dataset size is the predominant limiting factor, given the correctness of our assumptions.
Finally, we note an improved maximum test set accuracy score for 100 classes of
, beating the previous accuracy score of
[
14]. This result was observed in the fixed-seed control experiment group with
(see
Table 17), which is notable because, other than using a fixed-seed value, it utilises the common baseline hyperparameter configuration used throughout (see
Table 2). We have reproduced and updated the results table from Woods and Rana [
14] to include the best maximum top-1, top-5, and top-10 results for 100 classes from this ablation study (see
Table 20). We have also included the best mean top-1 accuracy score, taking the associated top-5 and top-10 accuracy scores from that same experiment group. The best mean top-1 accuracy score was observed in the feed-forward block layer dimension experiment group with
.
For some of the augmentation techniques that show no impact on test set performance, it would have been better to repeat these experiments but by increasing the hyperparameter values until an effect is observed, positive or negative, or until the hyperparameter is exhausted (i.e., ). It is possible that some of the augmentation techniques may still work, e.g., drop frames or offset copy, but only with sufficiently high associated hyperparameter values.
With many variations in the same signs, we conjecture that the difference between some of these same sign examples is greater than any basic augmentation technique allows to bridge the gap. For example, the WLASL-alt examples for the ASL sign
show exhibit some variation that includes hand positions that are slightly to the left or right or to the front or at any intermediate angle. We hypothesise that the model would have to learn separate representations for distinct forms of the same signs, but proving this is beyond the scope of this study. The obvious conclusion is that—in accordance with our observations—minor perturbations like adding noise or rotating keypoints are insufficient compared with including more representative example sequences in the dataset that cover all sign variations, which would ultimately improve model performance. This is an important factor that differentiates image-based deep learning models [
42,
53] from sparse human pose estimation keypoint data-based models [
14,
40,
41,
42].
Performance gains made by optimising individual hyperparameters, which otherwise fall within the measured uncertainty across a sample of experiments, may have a stacking effect when combined with other optimised hyperparameters. Discovering these optimal combinations, however, would require a more thorough and exhaustive combinatorial search, which, given the exponentially explosive nature of all possible combinations, is impractical. This is where random hyperparameter search becomes the preferred choice [
10].