Efficiency of the Adjusted Binary Classification (ABC) Approach in Osteometric Sex Estimation: A Comparative Study of Different Linear Machine Learning Algorithms and Training Sample Sizes

Simple Summary This study adopts a dynamic methodology to explore challenges to the practical application of the adjusted binary classification (ABC) approach, which are related to the unmodifiable characteristics of data used in its development, such as intrasexual variation (sexual dimorphism) of variables and methodological factors such as the selected classification algorithm and sample size. The adequacy of a training dataset’s size was judged relative to the classification performance in an independent test set. Finding an optimal classifier was also addressed in this study, wherein the results demonstrate that both statistical modeling and machine learning techniques perform almost equally in the univariate models; however, differences are evident in the multivariate model due to the different number of variables included via the feature selection process, as well as the effect of inadequate training sample size relative to the test set. This approach is particularly useful when quick classification/prediction is required for making real-time forensic decisions. Abstract The adjusted binary classification (ABC) approach was proposed to assure that the binary classification model reaches a particular accuracy level. The present study evaluated the ABC for osteometric sex classification using multiple machine learning (ML) techniques: linear discriminant analysis (LDA), boosted generalized linear model (GLMB), support vector machine (SVM), and logistic regression (LR). We used 13 femoral measurements of 300 individuals from a modern Turkish population sample and split data into two sets: training (n = 240) and testing (n = 60). Then, the five best-performing measurements were selected for training univariate models, while pools of these variables were used for the multivariable models. ML classifier type did not affect the performance of unadjusted models. The accuracy of univariate models was 82–87%, while that of multivariate models was 89–90%. After applying ABC to the crossvalidation set, the accuracy and the positive and negative predictive values for uni- and multivariate models were ≥95%. Sex could be estimated for 28–75% of individuals using univariate models but with an obvious sexing bias, likely caused by different degrees of sexual dimorphism and between-group overlap. However, using multivariate models, we minimized the bias and properly classified 81–87% of individuals. A similar performance was also noted in the testing sample (except for FEB), with accuracies of 96–100%, and a proportion of classified individuals between 30% and 82% in univariate models, and between 90% and 91% in multivariate models. When considering different training sample sizes, we demonstrated that LR was the most sensitive with limited sample sizes (n < 150), while GLMB was the most stable classifier.


Introduction
The reconstruction of a skeletal biological profile of unidentified human remains encompasses four main parameters: sex, age at death, stature, and population affinity [1,2]. Usually, one of the first steps in that process is osteological identification of sex, as it remarkably reduces the number of possible matches and enables the application of other biological profiling methods that are sex-specific [3]. Sex-related markers in the human skeleton broadly reflect size sexual dimorphism (SSD), general robusticity of bones, and entheseal changes due to muscular bulk attachment and activity [4], but they can also be prominent in shape differences that mainly result from biological adaptations [2,4]. Therefore, two methodological approaches are commonly used for sex diagnosis: morphoscopic and osteometric approaches [2]. The osteometric and morphoscopic features can be successfully analyzed on pelvic and cranial bones [5][6][7][8][9], whereas osteometric methods are commonly most productive using long bones [10][11][12][13][14][15][16][17]. One or multiple skeletal elements or combinations of features enable the forensic deduction of sex depending on the degree of fragmentation and presence of a skeletal element [4,[16][17][18][19][20][21]. Undamaged pelvic bones are by far the most accurate bones available for sexing [5]. When the pelvic bones are missing, long bones are considered to have the most important role in sex estimation [22]. Femora were the focus of attention in a multitude of studies owing to their articulation with the pelvis, higher probability of preservation, and function in relation to bipedal motion [10][11][12][13][14].
Recent advances in biological sex identification techniques include the genetic determination of sex using DNA extracted from skeletal remains [1]. Although molecular methods are more accurate, this approach is not free of technical limitations, as it is time-consuming and requires expensive equipment. Moreover, the efficiency of high-quality DNA retrieval from such methods may be affected by environmental contamination or the taphonomic status of bones [1,2].
Osteological-based sex estimation methods have been scrutinized to achieve standardization of methodologies and provide high levels of certainty when evaluating biological sex for legal decision making [20][21][22][23][24]. In addition to the repeatability, greater objectivity, and lower cognitive bias afforded by osteometric analysis of specimens, the most important factor inherent to this method's validity is the classification accuracy. In some jurisdictions, the Daubert standard provides a set of criteria regarding the admissibility of expert witness testimony that includes the general acceptance of a method and demonstration of an estimable error rate [1,2,22]. In practice, methods of sex estimation are developing, and practitioners use the most accurate methodology available depending on the analyzed specimen and status of preservation [1,[18][19][20][21].
Many studies recommend that the development of sex estimation methods should be based upon (1) a collection of a representative population dataset for model training and validation, (2) availability of biologically dimorphic skeletal traits, (3) satisfying the prerequisite statistical assumptions of the classification algorithms, and (4) presenting the probability of the concluded sex to reflect the uncertainty of diagnosis [3,4,9,10,17,18]. From a probabilistic standpoint, qualitative and quantitative sexing methods never reach the desirable 100% accuracy level [3,21,22]. When these markers are used in a binary decision system, each individual with a posterior probability (pp) greater than 0.5 of belonging to a specific group (in our case, sex) is classified in that group. Such sharp distinctions are not supported in the human skeleton, because skeletal measurements for each sex can have any figure in a spectrum of data [2]. This results in fluctuations in the rate of classification errors, as well as in the degree of SSD among the variables and skeletal elements other than the pelvis. Moreover, the level of activity and the population affinity are recognizable Biology 2022, 11, 917 3 of 18 factors that influence the expression and pattern of sexual dimorphism [4]. Individuals on the extreme ends of phenotypic expression exhibit-on average-variable overall body size, producing an even larger amount of overlap between sexes [4], leading to a lower frequency of correctly classified individuals around the traditional dichotomizing point [10,16].
Several researchers investigated the pelvic and extra pelvic bones and tried to increase their accuracy by moving the posterior probability threshold for sex diagnosis up to 0.95, which resembles a magnitude of certitude that the given individual belongs to the given class [3,9,12,[15][16][17][18]. For results based on good statistical evidence, they identified an indeterminate sex interval in which the posterior probability for male and female diagnosis overlaps. Applying such strict fixed criteria in practice does not guarantee that all models will reach this particular level of accuracy [11] due to the unavailability of analyzable skeletal specimens recognized for their high performance, i.e., pelvis. Consequently, some classification models may be significantly constrained due to their insufficient accuracy levels and/or the limited number of features in the classified specimens [3,9,12,13,[15][16][17][18].
The selection of a posterior probability threshold slightly lower than 0.95 may be beneficial as proposed by several authors. Different studies have proposed that it must not be below 0.75 [8] or 0.80 [7,10,16,19]. The authors of these studies concluded that the threshold value to allocate sex should be cautiously tuned considering the posterior probability distributions and misclassification rates at each probability threshold. If the estimated posterior probabilities of individuals fall below these limits (i.e., <0.8 pp to >0.5), the threshold should be higher than this interval to maintain reasonable allocation accuracy [7]. In the case of higher posterior probabilities, the threshold might be lowered to increase the proportion of classified individuals without reducing the overall accuracy rate, albeit with unspecified error rates [7,8,25].
Jerković et al. [11,25] proposed the adjusted binary classification (ABC) method to investigate the individual posterior probability using any quantitative skeletal traits at which a particular (in presented cases 95%) level of global accuracy of sex estimates is achieved. This was accomplished by an algorithm written in the programming language R that excludes those individuals whose measurement values and/or prediction scores would classify them incorrectly (as the opposite sex). The results of their study also implied that the interval of undetermined individuals could be adjusted, such that the model could estimate the sex of a larger proportion of given specimens than a priori raising posterior probability levels to a fixed threshold while minimizing fluctuations in the model's performance [11].
In this work, we are motivated by three related goals considered in the evaluation of statistical modeling and machine learning (ML) algorithms using the ABC approach: (1) comparing the predictive performance of different yet commonly used models at the traditional decision threshold and ABC-selected thresholds, (2) investigating the possibility of increasing their classification performance through the selection of the optimal sexing model and detection of the stability of performance after applying the ABC method on the holdout sample, and (3) identifying the ML algorithm that is best-suited for small training datasets. The outcomes of interest used to evaluate the possibilities and restrictions of the algorithms were the proportion of classified individuals (number of cases legible to classification) and each one's overall accuracy, as well as cross-validated training data and test samples for each group.

Data Splitting
The initial sample (n = 300) was randomly split into a training and a testing subset with an equal proportion of male and female individuals. Accordingly, 80% of these data (n = 240) were assigned to the training set (120 males and 120 females), while the other 20% (n = 60) were assigned to the testing set (30 males and 30 females).

Validation Method
The discriminatory power of each variable and multivariable model was evaluated by cross-validation tests. We used the leave-group-out or Monte Carlo cross-validation algorithm (LGOCV or MCOCV). In this procedure, the initial training sample is randomly split into n calibration and validation subsamples. The calibration set is used for model development, while the validation sample is used to test the model's performance on data that were not used to develop the model [29]. We used 70% of the data for the calibration set, while 30% were used for validation. This procedure was iterated 50 times for each model, and classification performance metrics were calculated as the average of all iterations. This cross-validation technique was selected because it tends to produce estimates of the error rate with smaller variance than the leave-one-out methodology [30].

Sexual Dimorphism
We calculated basic descriptive statistical parameters for all variables for males and females, including the mean, standard deviation, and range. To examine sexual dimorphism, we used an independent-sample t-test. In the last step, we estimated the overlap between male and female measurements through the overlap in their kernel density estimates. The overlap was expressed using an index that indicates the percentage overlap between estimated kernel densities [38].

Feature Selection Univariate Feature Selection
From 13 variables, we selected the five best variables by ranking the features according to their importance using the varImp function on a cross-validated dataset. The variable importance was assessed by a receiver operating characteristic (ROC) curve analysis that was conducted for each predictor. In this procedure, different cutoffs were used to classify specimens into a group using the selected variable. In the next step, the algorithm estimates the sensitivity and specificity for each cutoff and computes the area under the ROC curve using the trapezoidal rule [31,41].

Multivariate Feature Selection
We selected the optimal set of features for the multivariate models on cross-validated data using the recursive feature elimination (RFE) algorithm from the mlbench package [37]. The RFE algorithm recursively removes the predictors and constructs classification mod-Biology 2022, 11, 917 5 of 18 els using the remaining variables. By considering classification accuracy, the algorithm estimates the optimal number of variables and the set of variables that most contribute to the classification [40,41]. In the present study, RFE was conducted for each classifier, considering all 13 variables and their combinations.

Fitting the Classification Models
We developed univariate sex classification models for five variables that showed the greatest importance according to ROC curve analysis and one multivariate model that included variables selected by the RFE algorithm. To develop models, we employed the train function inside the caret package and selected the methods from a list of available models.
LGOCV was incorporated with the trainControl function [31]. We generated posterior probabilities of class membership for each model, for both cross-validated and test datasets. In the present study, we used four linear classification techniques: logistic regression (LR), linear discriminant analysis (LDA), boosted generalized linear model (GLMB), and support vector machine with a linear kernel (SVM). LR and LDA were selected as they are the most common algorithms for sex/ancestry classification in biological and forensic anthropology [4,8,9], while SVM was chosen due to it being a broadly used ML model that also proved to be efficient in osteometric sex classification [42,43]. GLMB was employed as an adopted technique for increasing the accuracy of standard linear models [44].

Logistic Regression
Logistic regression examines the relationship of predictive variables with a binary outcome. It develops a linear model that predicts a transformation of the outcome variablethe logit function. The logit function, which is the natural log of the odds that a specimen belongs to one of the classes, is then used to calculate the posterior probability that the specimen belongs to the first or second class. Classification is usually conducted by assigning a specimen to a group with the highest probability [9,45]. We used the glm method (family = binomial) with no tuning parameters to develop the LR model.

Linear Discriminant Analysis
Linear discriminant analysis explores a linear combination of variables and creates a discriminant function that can discriminate mutually exclusive groups [46,47]. Although the LDA is today mainly considered a dimensionality reduction technique, it works efficiently as a linear classifier [47]. In the latter case, LDA uses a discriminant function score and classifies specimens into the group whose centroid is the closest to the score [46,47]. Posterior probabilities are calculated by the Bayes theorem. We used the LDA method for classification from package MASS (with no tuning parameters) to develop the LDA model [48].

Boosted Generalized Linear Model
The boosted generalized linear model is an adapted standard linear model designed to reduce bias using a boosting algorithm. This modeling technique trains data using the best turning parameter (mstop) in the validation sample and conducts variable selection. If early stopping is used, the effects can be shrunken toward zero. For each boosting iteration, the algorithm fits linear models for each column of the design matrix to the negative gradient vector utilizing the only best-fitting model in the update step [49][50][51]. To develop the GLMB model, we used the glmboost method for classification or regression using packages plyr and mboost with tuning parameters (number of boosting iterations and pruning). Class posterior probabilities were calculated by minimizing the negative binomial log-likelihood [51]. We used default settings for tuning, with no pruning, and the number of boosting iterations (mstop) was set at 50, 100, and 150.

Support Vector Machine with Linear Kernel
Support vector machine is also a method from the family generalized linear models for prediction and classification using a linear combination of variables [52]. In the case of classification, this algorithm searches for a hyperplane that best separates two classes by considering the "sum of the distances from the hyperplane to the closest positive and negative correctly classified samples" [46]. If the hyperplane can be located in the original data instead of higher-dimensional space, we implement an SVM with a linear kernel [52]. To develop the SVM model, we used the method svmLinear that performs classification or regression using package kernlab with tuning parameters (cost, C). The posterior probabilities were obtained using the modified Platt's method [9,53,54]. We used default settings for tuning, with parameter "C" held constant at a value of 1.

Classification Performance Metrics
In binary classification, there are only two classes, which are usually referred to as positive and negative [2,6,14]. This allows for two cases of misclassification: false negative (predicting negative when the actual class is positive) and false positive (predicting positive when the actual class is negative), and both are of equal importance when dealing with forensic contexts [2,13,14]. To evaluate classification models, we constructed confusion matrices and calculated the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The male sex was marked as positive (P), while the female sex was labeled as a negative class (N). Using these parameters, we calculated the basic evaluation metrics in a cross-validated and testing set: accuracy, sensitivity, specificity, positive predicted value (PPV), and negative predicted value (NPV).
Additionally, we calculated the concordance index or c-index, which measures the discriminatory power of the model [55]. This score shows whether the model will provide a higher probability for the actual group in which a specimen belongs, among the randomly chosen pair of specimens, one belonging to the first and one to the second group. The values of the c-index range from 0.5 to 1, where a lower value indicates a model with no discriminatory ability, and the highest value shows perfect discrimination [15,16]. For each classification model, we also calculated the overlap degree between posterior probabilities of male and female specimens, with the same approach described in Section 2.4.1.

Adjusted Binary Classification (ABC) Algorithm
The adjusted binary classification algorithm was conducted to find posterior probability cutoffs that exclude specimens in the overlapping area and provide a classification accuracy, PPV, and NPV of at least 95% [25]. Using the R-code based on the algorithm from the study by Jerković et al. (2021), we calculated posterior probability thresholds for each classification model in the cross-validated dataset and computed the proportion of the specimens that the model could classify [25]. We used developed thresholds to classify the sex of specimens in the testing dataset, and we calculated the proportion of classifiable specimens and the classification metrics.

Assessing the Sample Size Effect
To examine the influence of the training set size on the classification performance in the independent (test) sample, we randomly generated different sample sizes and used them to develop classification models. The sample size conditions specified in this study included 19 samples, ranging from 10% (n = 24) to 100% (n = 240) of the original training sample, with an increment of 5%. For each sample size and classifier, we used the RFE algorithm to obtain the optimal number and combination of variables. Table 1 presents the descriptive statistics for each metric variable, overlap index, and t-test results in the training dataset (n = 240). Statistically significant differences were found between males and females in all variables (p < 0.001). The overlap index was highest for the diaphyseal width measurements MTD and MLD, moderate for the length measurements FML, FBL, and FTL, and low for the epiphyseal measurements. The most sexually dimorphic variable was FEB, followed by VHD, FVDN, and FNAL.  Table 2 depicts the average values of the overall accuracies, sensitivity, specificity, PPV, and NPV derived from the cross-validation analysis (LGOCV) for the LR, LDA, GLMB, and SVM models using the 0.5 posterior probability cutoff. The performance of the four modeling techniques was consistently comparable with differences in the overall accuracy rate by the decimal points. For the univariate models, FVND yielded the highest overall accuracies between 86.33% and 86.50%. FEB and VHD achieved accuracies between 84.36% and 84.89%, FBCB achieved accuracies between 83% and 83.94%, and FNAL achieved accuracies between 82.19% and 82.25%. The multivariate models yielded the highest accuracies (89.17-90.08%).

Classification Performances for Traditional pp Cutoff (0.5) in LGOCV Sample
None of the models reached 95% global accuracy or 95% PPV and NPV. According to the C-index, the discriminant power of the classifiers was perfect (greater than 0.90) for all the variables and classifiers. The variables selected as the best predictors for the four classification methods are presented in the footnote of Table 2. Table 3 shows the posterior probability cutoff values for males and females computed with the ABC algorithm to classify sex with the PPV/NPV ≥95% criteria. In univariate models, these values ranged from 0.625 to 0.930 (for the FEB and FBCB) in males and from 0.806 to 0.999 (for FBCB and FVND) in females, allowing us to overall classify between 28% and 87% of individuals. In the multivariate models, pp thresholds were remarkably lower and ranged from 0.639 to 0.734 (for the GLMB and LR) in males and from 0.778 to 0.867 (for LR and LDA) in females. The proportion of classifiable individuals increased to 81.94-85.58% with a small sexing bias between 0.7% for LR and 7.9% for GLMB.

Application of the ABC Approach in the Training Sample
The accuracy was above 95% independent of the considered variables and ML algorithm, but the female classification accuracy (specificity) showed more variability across the different femoral measurements and algorithms (Table 3). In contrast to univariate models, all multivariate models achieved stable classification results (around 95%) for PPV and NPV, as well as for sensitivity and specificity. Close-to-perfect c-index values (ranging between 0.975 to 0.985) were observed for all models across all studied variables in the simple and multiple variable equations, except for the VHD and FVND.  The classifiers produced different subsets of best predictors in terms of the number and type, but three features were common to all the selected attribute sets (FVDN, FEB, and FNAL). LR led to accurate results with much fewer variables than the other three models with greater group separation, i.e., more individuals were classified closer to 0 Biology 2022, 11, 917 9 of 18 and 1 probabilities. Both LDA and SVM produced the same subset of features. The GLMB algorithm showed the highest discriminatory performance in the training data for the univariate models after applying the ABC approach. More detailed performance indicators analyzed by the posterior probabilities are shown in Supplementary Figure S1. Table 4 reveals that all models produced stable comparable results in the test sample. All the algorithms demonstrated outstanding performance for this dataset using the c-index. For the multivariate model, LDA and SVM had the highest values with near-perfect c-index values of 0.990 and 0.991, respectively. The PPVs and NPVs peaked above 95% in almost all models, but the FBCB was just below this level. The number of individuals classified by all the univariate models was identical in all variables except for VHD in LDA, FVND in GLMB, and FNAL in LR and SVM. The proportion of classified individuals ranged from 36.66% to 90% in males and 0% to 93.33% in females. The highest proportion of classified individuals in both sexes was obtained using the multivariate function. In terms of model performance judged from the proportion of classified individuals, GLMB and SVM ranked the highest amongst the classifiers tested. The differences added by SVM were trivial in comparison to GLMB with regard to the other performance metrics. For example, SVM increased the proportion of classified males using the FNAL model by only 3.33%, corresponding to the only misclassified male as female, but this reduced the sensitivity and NPV to 92.3% and 95.24%, respectively. On the basis of the performance parameters mentioned above, the best classifier was GLMB, and the best single variable was FNAL (proximal end of the femur) followed by FBCB (distal end of the femur), as well as the multivariate model, which included a different set of variables according to the type of classifier. Case-wise classification results are visible in Supplementary Figure S2. Figure 1 demonstrates a moderate to high correlation (0.489 to 0.708) of the endpoint parameters (the pp level and the number of classifiable individuals) with the sample size. A negative correlation between the pp thresholds and the sample size was observed only in LR, meaning that increasing the sample size was associated with lowering the pp thresholds and increasing the number of classifiable individuals. Selecting a pp threshold was not shown to be dependent upon the sample size for LDA and SVM, but increasing the sample size was positively correlated with the number of classifiable individuals in the female group and, subsequently, the overall sample. GLMB was the only classier that showed stability in the number of classifiable females regardless of the sample size.  Figure 1 demonstrates a moderate to high correlation (0.489 to 0.7 parameters (the pp level and the number of classifiable individuals) w A negative correlation between the pp thresholds and the sample size in LR, meaning that increasing the sample size was associated wi thresholds and increasing the number of classifiable individuals. Selec was not shown to be dependent upon the sample size for LDA and S the sample size was positively correlated with the number of classif the female group and, subsequently, the overall sample. GLMB was t showed stability in the number of classifiable females regardless of th  Figure 2 presents the proportion of classified individuals in th models were constructed using different training sample sizes. The im small to larger sample sizes was the greatest on LR for both sex classifiers, the male group was the most affected, particularly with sam 84. Under a small sample size (n < 100), LR performed worse than LD the training sample (i.e., n = 48, 60, and 72, respectively). SVM and GLM results for the classifiable proportions (males, females, and overall), b slightly worse when the sample size was >100. Even though the LR pe the other algorithms with small datasets, as the number of cases inc showed a substantial performance improvement.  Figure 2 presents the proportion of classified individuals in the test sample when models were constructed using different training sample sizes. The impact of going from small to larger sample sizes was the greatest on LR for both sex groups. For other classifiers, the male group was the most affected, particularly with sample sizes of 96 and 84. Under a small sample size (n < 100), LR performed worse than LDA at 20% to 30% of the training sample (i.e., n = 48, 60, and 72, respectively). SVM and GLMB had comparable results for the classifiable proportions (males, females, and overall), but SVM performed slightly worse when the sample size was >100. Even though the LR performed worse than the other algorithms with small datasets, as the number of cases increased to >150, LR showed a substantial performance improvement. In Figure 3, we can observe a fluctuation in the average performance parameters of the various models as a function of sample size, particularly the LR model. The impact of sample size was consistent across the LDA and SVM models. The overall classification error was 10% to 100% for models with fewer than 100 specimens. It decreased to ~2% and 5% as the sample size increased to or above 150, with a sharp rise in the total and sexspecific number of classified individuals and stabilized performance measures (above 95%). The negligible differences were observed between the models when the sample size increased above 150. A sample size between 100 and 150 data points or more was convenient to reach the required level of PPV and NPV for the tested models.

Sample Size Effect
While the proportion of classified individuals increased, PPVs and NPVs showed more considerable variations, but all above the 95% cutoff limit. The exception occurred for SVM in 80% and 90% of the training sample (n = 192 and 216), where the accuracy, sensitivity, and NPV were <95%, although SVM achieved the highest rate of classifiable individuals at 80% of the training size. PPVs and NPVs for SVM reached 95% at a sample size of 10% to 15%, i.e., below 50 data points; however, beyond this threshold, PPV again decreased to 92%. A decrease in both NPV and total accuracy was evident in SVM due to the lowest sensitivity rates obtained with these sample sizes. Both GLMB and LDA provided similar accuracy rates under varying sample sizes. GLMB slightly outperformed LDA overall, notably in examples with sample sizes <100 (Figure 3). GLMB was superior to other algorithms for any volume of training data. Its accuracy started at 95% for 10% of the data and peaked at above 98% when the training data increased to the whole training dataset. None of the learning models other than GLMB could achieve acceptable results when using 35% of the dataset (n = 84), and they could not classify any of the males in the test sample. Supplementary material, Spreadsheet demonstrates the performance indicators at each sample size of the training data using the different classification techniques. Figure 2. Influence of the training sample size on the proportion of classified specimens in the testing sample. Several sample sizes of the training dataset are plotted against the proportion of classified individuals (overall and sex specific rates) in the test dataset (n = 60) using different ML models; each algorithm has its own panel.
In Figure 3, we can observe a fluctuation in the average performance parameters of the various models as a function of sample size, particularly the LR model. The impact of sample size was consistent across the LDA and SVM models. The overall classification error was 10% to 100% for models with fewer than 100 specimens. It decreased to~2% and 5% as the sample size increased to or above 150, with a sharp rise in the total and sex-specific number of classified individuals and stabilized performance measures (above 95%). The negligible differences were observed between the models when the sample size increased above 150. A sample size between 100 and 150 data points or more was convenient to reach the required level of PPV and NPV for the tested models.
While the proportion of classified individuals increased, PPVs and NPVs showed more considerable variations, but all above the 95% cutoff limit. The exception occurred for SVM in 80% and 90% of the training sample (n = 192 and 216), where the accuracy, sensitivity, and NPV were <95%, although SVM achieved the highest rate of classifiable individuals at 80% of the training size. PPVs and NPVs for SVM reached 95% at a sample size of 10% to 15%, i.e., below 50 data points; however, beyond this threshold, PPV again decreased to 92%. A decrease in both NPV and total accuracy was evident in SVM due to the lowest sensitivity rates obtained with these sample sizes. Both GLMB and LDA provided similar accuracy rates under varying sample sizes. GLMB slightly outperformed LDA overall, notably in examples with sample sizes <100 (Figure 3). GLMB was superior to other algorithms for any volume of training data. Its accuracy started at 95% for 10% of the data and peaked at above 98% when the training data increased to the whole training dataset. None of the learning models other than GLMB could achieve acceptable results when using 35% of the dataset (n = 84), and they could not classify any of the males in the test sample. Supplementary Material, Spreadsheet demonstrates the performance indicators at each sample size of the training data using the different classification techniques. Biology 2022, 11, x FOR PEER REVIEW 13 of 19 Figure 3. Influence of the training sample size on the classification performance parameters in the testing sample (n = 60). It should be noted that multivariate models were computed using the variable selection procedure. The interrupted line delineates the desired level of 95%.

Discussion
The presented results showed that the ABC approach is a reliable procedure to estimate sex using femoral measurements in the analyzed population sample with a very high accuracy level, PPV, and NPV (>95%). The overall classification models' performance was generally not dependent on the type of the ML algorithm applied when the full training sample size was employed. Sex bias was noted in the proportion of classified specimens, as well as in the classification performances in univariate models. When we applied multivariate models, the named bias was minimized, and sex could be estimated in more than 80% of individuals in cross-validation and 90% in the testing sample, with accuracy greater than 95%. The results demonstrate that performance using different machine learning methods within the ABC framework can be differently affected by the size of the training sample, which should be considered when developing osteometric sex classification standards. Forensic anthropologists could, therefore, apply the ABC method to construct sex estimation standards when the level of accuracy obtained by traditional thresholding is not sufficiently high for classification in forensic settings.

Sex Estimation Using the Default Cutoff Threshold of 0.5
All 13 femoral variables showed pronounced and statistically significant sexual dimorphism with more marked expression in joint surfaces than the shaft measurements, which is also a common observation in similar studies [10][11][12]14,15]. From these initial set of 13 variables selected, only five features were included in the univariate trait analysis due to their consistently high sexual dimorphism indicators, as demonstrated by the ROC curve. These variables showed a degree of accuracy in the CV sample of 82.3-87.4% for univariate and 91.6-92.9% for multivariate models. None of them met the accuracy criteria set by the present study (95%), which was the case in the most previous studies that achieved accuracies between 60% to 87.5% in the univariate models and from 84% to 92.5% in the multivariate models [10][11][12][13][14]26], even after employing ML methods [13,14].
The current study detected almost no or only slight differences (from 0.06% to 0.94%) between the ML classifiers (LR, LDA, GLMB, and SVM) when the total training sample with an equal sex ratio was analyzed at the 0.5 decision threshold. Similar findings were Figure 3. Influence of the training sample size on the classification performance parameters in the testing sample (n = 60). It should be noted that multivariate models were computed using the variable selection procedure. The interrupted line delineates the desired level of 95%.

Discussion
The presented results showed that the ABC approach is a reliable procedure to estimate sex using femoral measurements in the analyzed population sample with a very high accuracy level, PPV, and NPV (>95%). The overall classification models' performance was generally not dependent on the type of the ML algorithm applied when the full training sample size was employed. Sex bias was noted in the proportion of classified specimens, as well as in the classification performances in univariate models. When we applied multivariate models, the named bias was minimized, and sex could be estimated in more than 80% of individuals in cross-validation and 90% in the testing sample, with accuracy greater than 95%. The results demonstrate that performance using different machine learning methods within the ABC framework can be differently affected by the size of the training sample, which should be considered when developing osteometric sex classification standards. Forensic anthropologists could, therefore, apply the ABC method to construct sex estimation standards when the level of accuracy obtained by traditional thresholding is not sufficiently high for classification in forensic settings.

Sex Estimation Using the Default Cutoff Threshold of 0.5
All 13 femoral variables showed pronounced and statistically significant sexual dimorphism with more marked expression in joint surfaces than the shaft measurements, which is also a common observation in similar studies [10][11][12]14,15]. From these initial set of 13 variables selected, only five features were included in the univariate trait analysis due to their consistently high sexual dimorphism indicators, as demonstrated by the ROC curve. These variables showed a degree of accuracy in the CV sample of 82.3-87.4% for univariate and 91.6-92.9% for multivariate models. None of them met the accuracy criteria set by the present study (95%), which was the case in the most previous studies that achieved accuracies between 60% to 87.5% in the univariate models and from 84% to 92.5% in the multivariate models [10][11][12][13][14]26], even after employing ML methods [13,14].
The current study detected almost no or only slight differences (from 0.06% to 0.94%) between the ML classifiers (LR, LDA, GLMB, and SVM) when the total training sample with an equal sex ratio was analyzed at the 0.5 decision threshold. Similar findings were also noted previously by several studies [14,42,43]. For example, Curate et al. [14] did not detect significant differences in accuracy when estimating sex using femora with four classifiers LR, LDA, SVM, and reduced error pruning tree. Toneva et al. [43] showed a similar performance in sex estimation on cranial measurements when using LR and SVM, while Nikita and Nikitas [42] also confirmed that both LDA and SVM linear algorithms could be considered efficient when dealing with sex/ancestry classification using continuous variables.

Sex Estimation Using ABC Approach
We used the posterior probability scale with two different classification rules-one for each sex that accounts for the "zone of uncertainty" in each variable. From the initial accuracy rates that did not exceed 87% for univariate and 90% for multivariate models, we managed to raise accuracies to 95% or higher by applying the ABC algorithm. Such accuracy was achieved by all models in the CV sample, while only one univariate model (FEB) failed to meet that level. Losses in terms of the overall proportion of specimens left "indeterminate" ranges for univariate models between 24.8% and 72% in CV and 18.3% and 70% in testing sample. They were remarkably lower in the multivariate models and ranged from 13.4% and 19.3% in CV and 8.3% and 10% in the testing dataset. The variations in applied ML classifiers in the training and test samples were also minor, in terms of both the classification performance and the proportion of classified specimens. The overall results concurred with the original proposal of the ABC method, whereby, using LDA to classify sex from handprint measurements, the authors obtained a proportion of unclassified specimens of 23-71% for univariate and 12-13% for multivariate models with accuracies above 95% [25]. In contrast to the examined ABC approach, other studies that raised the pp thresholds (0.80, 0.90, and 0.95) to embrace the concept of the trichotomic approach left a large number of individuals unclassified [3,9,12,15,18,25].
The unclassified individuals in test samples of univariate and multivariate models ranged between 53.5-85.7% and 39.1-50.9% for the humerus [15], between 60.3-83.3% and 37.3-66.7% for the femur [12], and between 70% and 93% for the multivariate cranial models [9] at pp > 0.95 depending on the available skeletal element, the type and number of variables in the models, and the population sample under study.

The Training Sample Size Effect on the Performance of Multivariable Models
The number of observations in the training set and the classification rates were correlated, agreeing with the understanding that the models' generalizability improves with the feeding of more data points [56,57]. Unlike the almost equal performance of different ML classifiers in the full training sample size, remarkable differences were observed when we simulated different sample sizes of the training datasets to generate multivariate models. The most robust model for limited metric data was GLMB, followed by LDA and then SVM, while the least robust model was LR. In cases of a small training sample size, the calculated pp for both males and females was very high, approximately 1.0 (Supplemental Materials, Spreadsheet of the performance indicators at each sample size of the training data using the different classification techniques).
Our findings suggest that LR may not be the best method for developing sex classification standards with limited sample sizes, and that LDA could be a more robust classifier with small samples, in agreement with the findings of previous studies [58,59]. The performance of LR can be explained by its parameters calculated using the iterative maximum likelihood estimation (MLE), which requires a large sample size to obtain estimates. The performance instability of the LR with small sample sizes can be attributed to the absence of MLE of one or more coefficients of the explanatory variables which take values of plus or minus infinity, because of the "separation effect" associated with sampling artefacts, as well as the severely unequal rates of classifiable cases after applying the ABC approach [60].
Therefore, the study did not find elements to support the suggestions of Bartholdy et al. [16] who recommended the use of LR over LDA due to less stringent model assumptions and the potential to provide probabilities instead of dichotomous results. Moreover, LR and LDA methods only differ in the way they estimate the coefficients, but the same formula can be used to derive the pp for both LR and LDA models when the values of α (constant) and β (coefficient) are known [18,59]. With regard to the other ML methods, the complexity underlying their computation hinder the application of this formula [42]. The core design of the current study differed from that presented by Bartholdy et al. [16] in several experimental factors such as (1) the stratification of sample size from the small to the moderately large sample, (2) the higher number of features analyzed, and (3) the proportion of classified individuals recorded and compared at different cutoffs on the probability scale. Although their results indicated that both classifiers produced consistent accuracies, they concluded that LR is better than LDA and supported their conclusions with studies based on different data types (i.e., nonmetric morphological data with ordinal scores) which often violate the normal distribution assumption.
Among nonstandard algorithms, GLMB showed the most stable performance, almost independently of the sample size and when there were more variables than observations. This observation was also described by Tutz and Binder [44]. Considering GLMB's simplicity and performance, the traditional LDA does not rank much lower than GLMB and SVM [42]. SVM could be a practical ML alternative to LDA when an adequate sample size is available because the SVM method requires tuning of a single hyperparameter [13,42]. These results imply that the ABC approach could also be applicable with a relatively small sample size training set, which could be helpful not only in forensics, but also in bioarcheological contexts.

Study Limitations and Future Directions
Although the ABC approach showed overall excellent performance in the presented osteometric sex estimation case, we consider several shortfalls. When there are substantial differences between the posterior probability distributions in the training and testing datasets, the method's performance can be hampered despite fulfilling the equal prior assumptions in both datasets [21]. This can be illustrated with the case of SVM at the sample size (n = 192), where we achieved the highest number of classifiable individuals at their calculated pp thresholds, but with accuracy, sensitivity, and NPV below 95% (see also Supplementary Materials, Spreadsheet of the performance indicators at each sample size of the training data using the different classification techniques).
Comparison of the magnitude of sexual dimorphism among skeletal traits or worldwide populations has to consider possible differences in degrees of closeness to the opposite sex distribution, either because of different distances between mean values or, even in the case of equal distances, because of different extents to and densities of the intrasexual variability [61]. The overall performance of classifiers on the unseen data depends on the extent to which a test dataset represents the original distribution rather than its size because the ML model is constructed to best describe the distribution and structure pattern in the training data [57]. Differences in the probability distribution (covariate shift) [13,62] between the training and test samples should also be considered because it may lead to deterioration of the model performance on the independent population sample as the model is not pretrained on the degree of overlap between distributions of males and females in the new test sample [13].
One of the potential drawbacks is that the ABC adjustment of pp thresholds for males and females can be imbalanced in univariate models, leading to considerably different rates of classifiable individuals per sex (low or high). If the classified groups are extremely unequal in size, the misclassification rate for the smaller ones will be very high regardless of the classification method. The PPVs and NPVs were, thus, lower in models with lower classifiable instances rates and could increase more rapidly with increasing specificity than with increasing sensitivity. A greater overlap between groups results in a larger disparity in the pp thresholds with unequal rates of classifiable individuals, and a lower PPV and NPV can be expected [9,18,63,64]. However, the most important finding is that those "imbalances" in the number of classifiable individuals can be almost ruled out by considering multiple predictors [25].
We also consider that we employed the default tuning parameters because controversy still prevails regarding the effects of different conditions of hyperparameter tuning on classifier performance, as well as the different methods of variable selection [42]. The "statistical opportunism" effect can lead to overfitting, which is problematic because the same variables may not be selected uniformly by each classification technique and/or sample size, even those with the strongest sexual dimorphism [9]. Previous studies with different skeletal elements datasets (cranial and/or postcranial bones) did not agree on the classifier that ranked best in all the classification applications such as sex and ancestry estimation, e.g., the studies by Nikita and Nikitas [42] and Santos et al. [9]. Future research should focus on the difference between the ML techniques in calculating the posterior probabilities of data when there is an extreme subclass imbalance with different priors for forensic and archeological applications.
Lastly, it should be emphasized that the application of the ML methods to CT images by considering the manually taken linear measurements does not take full advantage of available computation methods. For example, modern approaches employ deep learning techniques and convolutional neural networks that can be adjusted to directly estimate sex from images without taking measurements or placing landmarks [64,65]. Thus, principles contained in the ABC approach can also be tested in more advanced contexts to additionally improve classification accuracy when required.

Conclusions
Compared to the traditional classification approach using fixed pp thresholds, i.e., 0.5 and 0.95, the ABC achieves remarkably higher accuracy in a relatively large proportion of classifiable individuals while adjusting the precision of the classification at a 95% level. The ABC approach is an "uncertainty-aware framework", allowing customized posterior probability computations to find an acceptable classification performance across different sample sizes while controlling for PPV and NPV, but not the sensitivity and specificity rates.
The results presented here reveal that, in sex classification, dataset size is not necessarily an obstacle to compute a high-performing model since the average performance of classifiers reached 100% on some small datasets and the generalizability of the model on test data depended on accurate estimates of the moments of the training sample distribution. We also studied the pp distribution of each variable, showing their effect on the performance of different traits regardless of the model used.
The efficiency and reliability of the ABC approach for estimating sex were most apparent when (1) the sample size was large enough with n > 100 (>150 for LR only), (2) the variables were sexually dimorphic with minimal overlap between male and female distributions, (3) the posterior probability cutoffs for each sex were approximately balanced, and (4) the multivariate models were used to overcome the imbalances in the classifiable proportions of individuals.
Supplementary Materials: The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/biology11060917/s1: Spreadsheet of the performance indicators at each sample size of the training data using the different classification techniques. Figure S1, Comparison between the estimated posterior classification probability densities of the two sex groups in the training set; Figure S2

Informed Consent Statement: Not applicable.
Data Availability Statement: Turkish data were obtained from the PhD thesis of O. Gulhan and are available at Cranfield University, UK (Skeletal sexing standards of human remains in Turkey. Available online: https://dspace.lib.cranfield.ac.uk/bitstream/handle/1826/12272/O%20Gulhan% 20PhD.pdf?sequence=1 (accessed on 24 April 2020). The permission for use was secured from Prof. Andrew Shortland (personal communication) prior to data analysis.