1. Introduction
Accurate sex prediction of archaeological skeletal remains is a fundamental step for reconstructing biological and demographic profiles of past humans. After an archaeological site is surveyed and excavated and unknown human remains are identified, documented, and recovered, the sex and age of deceased individuals are commonly estimated using macroscopic methods of the pelvis, skull, and teeth [
1,
2,
3]. However, because female and male biological maturation rates differ [
4,
5], sex misidentification can lead to data recording bias and depreciated interpretability. After sex has been macroscopically estimated and with the assistance of other biological and archaeological contextual information, the identities and lifeways of the deceased can be reconstructed in bioarchaeological contexts. However, traditional macroscopic sex estimation methods possess varying degrees of accuracy [
6,
7,
8,
9,
10,
11]. For example, the pelvis and cranium might provide conflicting sex estimation results even within the same individual. This process is further complicated by other aspects, particularly of age, as tooth crown calcification and eruption and bone epiphyseal fusion are useful until early adulthood when 3rd molars erupt and bony ossification centers fuse skeletal elements into their final, united shapes. Pelvic, cranial suture, and sternal rib end methods are used to predict age in individuals through later stages of adulthood, albeit with wider margins of error.
Craniometric dimensions are frequently used as proxies for genetic relatedness of past humans due to their potentially heritable nature and correlations with neutral and adaptive genetic variation and selection [
12,
13,
14,
15,
16,
17,
18,
19,
20]. In the absence of genetic information, these methods are used to approximate the genetic and evolutionary relationships of past humans [
21], thus making accurate sex classification an integral first step in the reconstruction of other biological and demographic parameters. Hence, further examinations of sex correlations with other lines of evidence such as burial location, material culture, musculoskeletal stress markers, health, diet, disease, trauma prevalence, and biological relatedness will be skewed if sex is first misclassified.
Machine learning is slowly gaining a foothold in bioarchaeology and forensic anthropology despite our discipline’s deep ties to statistics and computational research for investigation of large quantitative datasets. Cunningham’s [
22] pioneering machine learning social anthropological work for rule-based kinship structure detection set a high bar for anthropologists of all subdisciplines to aspire. However, her work remains largely unrecognized even though it exemplifies the types of problem-and-dataset-driven questions faced by bioarchaeologists. This discrepancy persists despite the promise for bioarchaeological machine learning applications for predicting sex, age, ancestry, body mass, and stature in forensic anthropology, radiography, and anatomy [
23,
24,
25,
26,
27,
28,
29,
30,
31]. Even less bioarchaeological research has focused on missing data imputation [
32].
Therefore, more examples are needed to better contextualize our methodological understandings of sex estimation techniques. This research is an extension of Muzzall et al. (2017) [
33], which improved sex prediction accuracy of the William W. Howells Worldwide Craniometric Dataset and provided another example of the strong potential for machine learning to assist in sex prediction in bioarchaeological contexts. Here, I use a generalized low rank model to impute large amounts of missing data for a stratified cross-validated supervised ensemble machine learning approach. This framework consists of eight algorithms total and is fit to cranial interlandmark and dental metric distances to predict binary sex from six pelvic and cranially estimated samples at Alfedena (600–400 BCE) and Campovalano (750–200 BCE and 9–11th Centuries CE) in central Italy.
Italy is home to one of the most colossal bioarchaeological contexts on Earth and represents humans’ deep history throughout the region. Its central Mediterranean location, deep temporal breadth, and geological and environmental diversities have been influential in shaping the genetic, morphological, and cultural histories of the region [
34,
35,
36,
37,
38,
39]. Humans here developed some of the richest and most divergent forms of social interaction through worship, architecture, iconography and writing, and empires that persisted for long periods of time and across the globe via trade, warfare, and colonization. Central Italy was a particular crossroads between Africa and Europe and the Near East and Iberia and was home to many chiefdoms and nation-states that contained both shared and varied forms of settlement patterns, social and burial organization, material cultures, mortuary behaviors, and skeletal-dental morphologies. As a result, Italy’s bioarchaeological record provides a space to experiment with new methodologies for sex prediction.
3. Results
Results indicate that ensemble machine learning has strong potential for sex prediction and yielded AUC-ROC values greater than 0.90 for the cranial metric data and ~0.74 for the dental metric data. Males are larger than females in all dimensions as shown by the Tukey boxplots in
Figure 1 and
Figure 2 although distributions for the sexes overlap considerably.
AUC-ROC performance for each algorithm along with their standard errors and confidence intervals are shown in
Table 5. The combined craniodental data had the highest AUC-ROC with 0.9722, followed by the combined cranial (0.9644), face (0.9426), vault (0.9116), base (0.9060), and dentition (0.7421). Expectedly, the mean of Y is the worst performing algorithm in all cases (AUC-ROC = 0.500 for each). The SuperLearner algorithm has the highest AUC-ROC for all six bony regions while ranger is a close second for the face, vault, base, cranial, and combined craniodental data. Logistic regression, lasso, and ranger are all close seconds for the dental data.
Additionally, the single best algorithm (or combination of algorithms)—the DiscreteSL—was the ranger random forest algorithm for all 20 cross-validation folds for the face, base, combined cranial data, and combined craniodental data. However, for the vault, ranger was the best performing algorithm 19 times and the decision tree algorithm once. For the dental data, logistic regression was the best performing algorithm 14 times, lasso 4 times, and ranger twice—this algorithmic confusion could be related to the considerably lower AUC-ROC for the dentition compared to any of the cranial data.
The SuperLearner weight distributions show which of the individual algorithms contributed most to the ensemble (
Table 6). For the combined craniodental data, lasso contributed a coefficient of 0.4522, indicating that it contributed this percentage to the SuperLearner ensemble. This was followed by lesser contributes from the ranger algorithm (0.1734), xgboost (0.1700), logistic regression (0.1319), and decision tree (0.0726). For cranial data, ranger contributed a coefficient of 0.4610, followed by lesser contributions from logistic regression (0.1940), lasso (0.1411), decision tree (0.1267), and xgboost (0.0772). Contributions to the face stem mostly from ranger (0.4634) and logistic regression (0.4193), for the vault from ranger (0.5004) and decision tree (0.3234), and for the base from ranger (0.8878). For the dentition, contributions stem mostly from logistic regression (0.5591) and ranger (0.3582).
4. Discussion
AUC-ROC of this SuperLearner ensemble machine learning framework demonstrates strong potential for cranial sex prediction of archaeological human skeletal remains in this particular central Italian context. An important potential contribution of this research is that it reframes the problem of sex estimation as a predictive one and does not rely on assumptions of p-values, traditional hypothesis testing, or causal inference approaches. Instead, the focus was on model performance, standard errors, and confidence intervals. Additionally, the goal here was not to optimize any algorithms for maximum predictive accuracy, but to instead provide a gentle overview of the process and to stimulate the reader into thinking about how this approach could be applied in their own research contexts. This method can also potentially be employed in the field to help resolve disagreements between experts or for indeterminate remains.
Results also support previous research that ensemble machine learning has strong potential for sex prediction in the bioarchaeological record [
33]. Although the actual ground truth (in the binary sense; sex and gender are more dynamic than this in reality) male/female sexes of the individuals included in this study were not known, results support previous research that indicates contrasts between male and female morphological and burial patterns in central Italy during the Iron Age [
39,
40,
41]. Among the three different cranial regions, the face had the highest AUC-ROC values, followed by the vault and base. This could provide further support for the utility of the face for population reconstruction despite its greater environmental plasticity compared to the base and vault due to sensory functions of sight, smell, and taste [
67].
Of particular interest were the general size differences between males and females. Despite their overlapping measurement distributions—and if the modeling process was strongly influenced by size alone—it would be reasonable to expect that the dentition would have had higher AUC-ROC values similar to those of the cranial data. Whether or not the antimeric substitution of left teeth for right teeth in the absence of a right-side tooth and/or the sheer amount of missingness influenced the much lower dental AUC-ROC is unknown. More cranial-dental comparisons are necessary to evaluate the reliability of the dentition in this framework.
The ensembles themselves can be strengthened by including a greater diversity of algorithms and customizing them with varying hyperparameters (pre-training settings) to find the most accurate and best performing tunings [
68]. Other considerations can be more thoroughly incorporated as well, such as different confusion matrix derivations to evaluate performance, such as precision and recall to further highlight class imbalance problems, balanced estimator constructions, false discovery rate, and F1 score. Negative log-likelihood could also be used as the optimizer instead of nonnegative least squares. Other algorithms and methods also might be more appropriate—only a few algorithms with default settings were incorporated in this project but many others can be included in the ensemble (e.g., Bayesian additive regression trees [
69]). Features could be screened to identify more interpretable models and custom algorithms can be included to the researcher’s exact specifications (see Kennedy, 2017 [
61] for the R walkthrough). Moreover, deep learning—a subdiscipline of machine learning that utilizes multi-layered artificial neural networks for modeling, predicting, inferring, and understanding data—might be even more useful [
70]. When dataset sizes and the number of algorithms exceed personal compute potential, the software packages for analyses mentioned in this research have instructions to be run in parallel across multiple cores on a single computer or across multiple machines in cluster or remote settings. Perhaps of great interest to the bioarchaeologist, variable importance information can be extracted from various algorithms to see which cranial and dental dimensions have the highest weights for sex classification.
It is critical to note that due to the antiquity of the samples included in this research, the ground truth sexes of the individuals included were estimated macroscopically using pelvic and skull traits. As a result, future researchers should consider implementing this or similar frameworks using known-sex reference skeletal collections from the Hamann-Todd Osteological Collection (housed at the Cleveland Museum of Natural History), the Robert J. Terry Anatomical Skeletal Collection (Smithsonian Institution, National Museum of Natural History), or the 21st Century Identified Skeletal Collection (University of Coimbra, Portugal). However, my goal was not to concretely establish this ensemble machine learning method in any dogmatic way, but to instead onboard the reader to the basic concepts and their application in bioarchaeology. This study is merely a demonstration of the methods and an advertisement of the potential for generalized low rank imputation and ensemble machine learning processes in bioarchaeological and forensic contexts. Known-sex references samples should be a prerequisite for confirmation of methods presented here, and larger sample sizes might also be important. Cadaver samples and skeletal collections such as those mentioned above would be particularly useful for these procedures. Furthermore, I encourage future researchers to examine the effects that different missing data handling methods (listwise deletion, mean, median, k-nearest neighbor, bootstrap, expectation-maximization, multiple imputation, GLRMs, etc.) have on error estimates in cases of sex prediction in the bioarchaeological record.
Ensemble machine learning techniques should be considered as part of the bioarchaeologist’s toolkit as an additional method for comparison to macroscopic interrogations of the skeleton and dentition that we rely upon for reconstruction of the biological profiles of past humans. These techniques can potentially assist not only in bioarchaeological reconstructions, but also in forensic applications for identification of missing persons and perhaps even to material, faunal, and floral assemblages as well as mortuary studies and settlement organization. Furthermore, GLRMs warrant further exploration and should be considered by bioarchaeologists as a potentially strong data preprocessing tool when faced with missing data and analytical techniques that require full datasets for computation. Social scientists in general would benefit from updating their instrumentation with cross-validated ensemble machine learning techniques when research requires an outcome to be predicted.