Linear and Non-Linear Regression Methods for the Prediction of Lower Facial Measurements from Upper Facial Measurements

: Accurate assessment and prediction of mandible shape are fundamental prerequisites for successful orthognathic surgery. Previous studies have predominantly used linear models to predict lower facial structures from facial landmarks or measurements; the prediction errors for this did not meet clinical tolerances. This paper compared non-linear models, namely a Multilayer Perceptron (MLP), a Mixture Density Network (MDN), and a Random Forest (RF) model, with a Linear Regression (LR) model in an attempt to improve prediction accuracy. The models were fitted to a dataset of measurements from 155 subjects. The test-set mean absolute errors (MAEs) for distance-based target features for the MLP, MDN, RF, and LR models were respectively 2.77 mm, 2.79 mm, 2.95 mm, and 2.91 mm. Similarly, the MAEs for angle-based features were 3.09 ◦ , 3.11 ◦ , 3.07 ◦ , and 3.12 ◦ for each model, respectively. All models had comparable performance, with neural network-based methods having marginally fewer errors outside of clinical specifications. Therefore, while non-linear methods have the potential to outperform linear models in the prediction of lower facial measurements from upper facial measurements, current results suggest that further refinement is necessary prior to clinical use.


Introduction
Orthognathic surgery aims to address malocclusion or facial deformities in cases where the midface of a patient is anatomically correct [1].A successful orthognathic treatment can notably improve a patient's quality of life and facial aesthetics [2,3].This surgery typically does not entail the changing of morphology but rather of the relative positions between the maxilla, mandible and cranium [4].It is the clinician's responsibility to establish this original dentofacial shape, estimate the final position, and determine the actions required to achieve this position [5].These assessments are typically based on the clinician's experience, physical models, and established population normative measurements derived from cephalometric analysis.This can be time-intensive and surgeon-dependent [1,5,6].As a complementary method, patient-specific methods provide a data-driven technique to guide the surgery using the patient's own geometry [1].
A traditional approach to evaluating the harmony of patient geometry involves the harmony box [7].This method determines normal ranges for critical cephalometric angles, such as SNA, from a many-to-one relationship using multiple cephalometric angles.In this method, the standard errors derived from a linear regression model constructed for a specific population are used to establish a table of floating normative values for that population.A clinician can then compare a patient's measurements to the table and determine if orthognathic treatment is necessary and, if so, which specific treatment should be considered.However, limitations include a small selection of variables, no consideration of multi-dimensional target relations, and a dependence on the singular target feature selection.To address this, Bingmer et al. [8] extended the harmony box approach with a multiple linear regression model.Here, they used a wider range of variables, multiple output variables, and data from a normal population.They demonstrated the viability of using the residual error values to determine the normality of unseen patient geometry.Using this model, if a patient's residual errors are large, it may indicate a non-harmonious relationship requiring orthognathic treatment.
In a broader application focused on the prediction of mandible measurements, Gillingham et al. [9] employed multiple linear regression models to predict missing mandible cephalometric measurements from remaining measurements that span the craniofacial complex.They compared their linear regression models against the mirroring of geometry across the symmetry plane and found that their linear regression model outperformed mirroring in 6 of 16 measurements.Cheng et al. [4] predicted reposition vectors based on cephalometric coordinates using linear ridge regression and a non-linear transformedbased neural network.They concluded that the neural network had better generalisation capability compared to their linear model.Wang et al. [1] developed a shape-based sparse coefficient regression model using cephalometric landmarks to predict mandibular and maxillary structure from midface geometry.These coefficients, based on the 3D locations of the landmarks, were then used to predict mandibular landmarks.Their model demonstrated good qualitative results with better normality than the true input data, suggesting more harmonious outputs.Similarly, Zhu et al. [10] also applied both landmark-and measurement-based approaches, using linear regression models to predict landmarks on the maxilla based on data from the mandible and vice versa.They noted that their results did not achieve sufficiently low error metrics for clinical application.
Based on the previous review of the most relevant literature [1,4,[7][8][9][10], it seems that limited attention was given to the prediction of mandible morphology solely from that of the cranium or maxilla as described by Zhu et al.This may be beneficial for orthognathic applications where the lower facial shape may be inharmonious or for mandible reconstruction where the mandible may be absent or only partially present.Furthermore, not all studies focused on reproducing cephalometric measurements [1,4,10], which the authors consider necessary to describe mandible shape variation for comparison with the literature.In addition, limited attention was given to non-linear methods [4] in comparison with linear approaches [1,[7][8][9][10].Non-linear models may be able to better represent the poor linear relationship between lower and upper facial measurements [11], warranting further study.Finally, while not directly comparable, current predictions of the mandible from the morphology of the maxilla yield reconstruction accuracy ranging from approximately 2.9 to 7.9 mm [10], which may be insufficient for clinical applications assuming a target value of 2 mm [12].
Therefore, in this paper, regression models were constructed based on clinically relevant cephalometric measurements derived from CT scans of 155 healthy subjects.Specifically, this is the first study where multiple linear regression models were compared in terms of predicting cephalometric mandible measurements based on upper facial measurements against non-linear models.The non-linear models included a random forest, a multi-variate multi-layer perceptron, and a multi-variate mixture density network.Finally, the ability of the models to meet clinical thresholds of 2 mm and 2 degrees was evaluated.
The following is an overview of the current paper's layout.Section 2 discusses the materials and methods.It describes the collection and processing of CT scans into relevant cephalometric measurements, followed by model and hyper-parameter optimisation background theory and their implementation.Section 3 combines the results and their discussion.This section will initially present the hyper-parameter optimisation outputs and their insights.Subsequently, the optimised models are evaluated against each other.The results of the models are discussed in terms of adherence to clinical thresholds, an overall comparison of error metrics, the clinical interpretation of these results, and potential future work.The paper concludes with a summary of the findings and implications in Section 4.

Data Collection and Processing
A total of 627 DICOM (Digital Imaging and Communications in Medicine) format head and neck CT scans were obtained from the HNSCC (Head and Neck Squamous Cell Carcinoma) database [13].This dataset is a publicly accessible repository, with all individuals anonymised and de-identified.In accordance with international standards and guidelines, ethical clearance for this project was granted by the Health Research Ethics Committee of Stellenbosch University, South Africa (Ethics Reference Number: X23/07/027).This database consists of typically older individuals diagnosed with HNSCC.Scans were included based on the following criteria: (1) an in-plane resolution of 0.488 mm and slice thicknesses of 1.25 mm; (2) the absence of obvious bone pathology in the maxillofacial region; (3) the presence of all necessary cephalometric landmarks; (4) minimal scan artefacts; and (5) the presence of contrast agents.Notably, to allow for a larger sample size, the current work did not exclude subjects based on malocclusion class, dentition, or sex.The final dataset's demographics are summarised in Table 1.A file listing the unique identifiers of the selected subjects is available in the Supplementary Materials.All scans were processed using the open-source software 3D Slicer 5.2.2 [14] (https: //www.slicer.org/).Scans were segmented by applying a density threshold to isolate osseous structures, followed by manual removal of scanning artefacts such as streaking.Targeted and global Gaussian smoothing filters were then applied to reduce the voxelated appearance and remove minor discontinuities before exporting the resultant masks to facevertex meshes in STL format.Following segmentation, the STL files were imported into MATLAB, where a previously developed in-house script was used to place 3D landmarks on the mandible, maxilla, and cranium.Landmark definitions were based on those of Bayome et al. and Gillingham et al. [11,15] and are listed in Tables A1 and A2 in the Appendix A. Figure 1 illustrates the placement of the mandible landmarks, and Figure 2 similarly shows the placement of the cranium and maxilla landmarks.
overall comparison of error metrics, the clinical interpretation of these results, and potential future work.The paper concludes with a summary of the findings and implications in Section 4.

Data Collection and Processing
A total of 627 DICOM (Digital Imaging and Communications in Medicine) format head and neck CT scans were obtained from the HNSCC (Head and Neck Squamous Cell Carcinoma) database [13].This dataset is a publicly accessible repository, with all individuals anonymised and de-identified.In accordance with international standards and guidelines, ethical clearance for this project was granted by the Health Research Ethics Committee of Stellenbosch University, South Africa (Ethics Reference Number: X23/07/027).This database consists of typically older individuals diagnosed with HNSCC.Scans were included based on the following criteria: (1) an in-plane resolution of 0.488 mm and slice thicknesses of 1.25 mm; (2) the absence of obvious bone pathology in the maxillofacial region; (3) the presence of all necessary cephalometric landmarks; (4) minimal scan artefacts; and (5) the presence of contrast agents.Notably, to allow for a larger sample size, the current work did not exclude subjects based on malocclusion class, dentition, or sex.The final dataset's demographics are summarised in Table 1.A file listing the unique identifiers of the selected subjects is available in the Supplementary Materials.All scans were processed using the open-source software 3D Slicer 5.2.2 [14] (https://www.slicer.org/).Scans were segmented by applying a density threshold to isolate osseous structures, followed by manual removal of scanning artefacts such as streaking.Targeted and global Gaussian smoothing filters were then applied to reduce the voxelated appearance and remove minor discontinuities before exporting the resultant masks to face-vertex meshes in STL format.Following segmentation, the STL files were imported into MATLAB, where a previously developed in-house script was used to place 3D landmarks on the mandible, maxilla, and cranium.Landmark definitions were based on those of Bayome et al. and Gillingham et al. [11,15] and are listed in Tables A1 and A2 in the Appendix A. Figure 1 illustrates the placement of the mandible landmarks, and Figure 2 similarly shows the placement of the cranium and maxilla landmarks.[9,15].Additional upper facial measurements incorporating the Porion landmark were also defined.All measurements were extracted from the defined 3D landmark coordinates using a Python script.The mandible measurements are listed in Table 2, and the cranium and maxilla measurements are listed in Table 3.  [9,15].Additional upper facial measurements incorporating the Porion landmark were also defined.All measurements were extracted from the defined 3D landmark coordinates using a Python script.The mandible measurements are listed in Table 2, and the cranium and maxilla measurements are listed in Table 3.To predict lower facial measurements from upper facial measurements, all available upper facial measurements were used as input features.For the ideal lower facial measurements, facial symmetry was assumed by disregarding the right-sided measurements and keeping only the left-sided measurements.This resulted in a total of 13 input features (upper facial measurements), with 9 target output features (lower facial measurements) listed, respectively, in Tables 2 and 3. To assess landmark placement repeatability, 10 scans were randomly selected for an intra-class correlation (ICC) test with two raters.Each individual, known as a rater, annotated 10 scans (i.e., the subjects) based on the defined landmarks.The consistency was assessed for multiple raters with assumed two-way mixed effects.Accordingly, Equation (1) [16] was used to determine the score, ICC score , for each measurement.In this equation, MSBS represents the mean squared errors between subjects, and MSE represents the total mean squared error.The reader is referred to the supporting information provided by Liljequist et al. [17] for further implementation details.

Go
The data were randomly split into training, validation, and test datasets with a respective 70-20-10 split.The training set was composed of 108 individuals.The validation set, used for hyper-parameter optimisation and model selection, included 31 individuals.Finally, the test set for the final model comparison contained the remaining 16 individuals.To reduce the impact of the data's magnitude ranges and different unit types, the training set was scaled with a minmax scaler, as shown in Equation (2).Here x ij represents the ith subject from the jth measurement type.Similarly, x * ij refers to the same data point but scaled with the min-max function.Finally, x j represents all values of the jth measurement.
The training data's scaling factor was also applied to all instances in the validation and test sets.

Model Construction
All models are implemented in the Python programming language.The Ordinary Least Squares Regression (OLS) and Random Forest (RF) models were each implemented as a model for each output measurement using the scikit-learn package [18].The Multi-Layer Perceptron (MLP) and Mixture Density Network (MDN) were implemented with multiple target outputs using the PyTorch package [19].The neural networks were constructed to favour training logic simplicity and to determine whether standard implementations are sufficient.Therefore, constant learning rates were used, and no regularisation techniques were implemented.The models were trained for 4000 epochs.The hyper-parameters were optimised using Bayesian optimisation (BO) via the BoTorch implementation from the Ax platform [20,21].Refer to Section 3.1 for details regarding specific hyper-parameters.The following is a brief overview of the construction and general theory behind the methods used.

Ordinary Least Squares Regression
Ordinary least squares (OLS) regression is a standard technique to determine the linear relationship between a set of input and output features.This technique is commonly applied in estimating mandibular morphology [4,[7][8][9][10].Moreover, the OLS model, known for its high bias (linear assumption) but low variance, may underfit complex relationships.However, it also ensures consistent performance across datasets and is robust against overfitting [22].Therefore, given its widespread use in estimating mandibular morphology and its functionality as a simple regression model, the OLS model will serve as a baseline to compare against more complex models.
In this project, an OLS model was constructed for each output measurement; therefore, the number of models constructed was equal to the number of output features, n outputs .Each OLS model adhered to the following described structure but predicted a different single output feature.
The OLS estimator, shown in Equation (3), is a set of coefficients that minimizes the sum of squared vertical distances between the fitted line and the data points [23].Note that the equations are adapted for matrix notation.Here, X is the input training data matrix of the general size n subjects × n inputs , X T the transposed input data matrix, and y is a vector containing the true training dataset target features of size n subjects .For clarity, n subjects refers to the number of subjects in the dataset (training, validation, or testing) and n inputs refers to the number of input features.This convention is followed throughout this paper.
The resultant OLS estimator, β with size n inputs , can be used along with the test input data, X test , to predict previously unseen target values from the testing dataset, denoted ŷ, as in Equation ( 4).ŷ = X test β (4) The model inherently assumes a linear relationship, as it predicts the target feature through a linear combination of input features.However, it is possible to extend this technique so that it can capture some non-linear relationships.This can be achieved by applying a dimensional transform to the input data.The linear model is thereafter applied to this transformed space.One such method is the z-space feature transform, as represented in Equation ( 5) [24].This transform involves constructing a new dataset, X ′ , by adding columns where each column is X raised to successive powers from 1 up to k.
The transformed dataset, now of size n subjects × (k × n outputs + 1 is then similarly used to determine a transformed coefficient, β′ .This is subsequently applied in Equation ( 4) to predict new data.

Multi-Layer Perceptron
The multi-layer perceptron (MLP) is a relatively non-complex neural network architecture.If implemented with non-linear activation functions, it can approximate any continuous function [25], giving it low bias but high variance.Generally, this also makes it prone to overfitting.However, these issues can be mitigated by hyper-parameter optimisation (implicit regularisation) and various explicit regularisation techniques [22].Despite its popularity across machine learning, the MLP model has seen less frequent usage in estimating mandible cephalometric measurements.Therefore, the MLP is included as a non-linear, representative basic neural network architecture for its potential in bias-variance control and for its relative absence in cephalometric measurement prediction.To this end, a single MLP was implemented to learn from multiple input measurements to predict multiple output measurements.
The MLP generates model outputs based on the provided weights and the input data through a forward-pass loop.For a general case, this forward-pass propagation is shown in Equation ( 6) [23].Here the input measurements are represented by x with a size n batch × n inputs and the estimated output is represented by ŷ with a size n batch × n outputs .This provided forward-pass equation is adapted to matrix notation and simplified with biases omitted.In addition, the current implementation assumes constant neuron counts, n neurons , per layer and a batch input with size n batch .
The network consists of an input layer with weights matrix W (1) sized n inputs × n neurons .This layer connects to m hidden layers, each characterised by an activation function, Φ, and a weights matrix, W, sized n neurons × n neurons .For regression tasks, the output layer utilizes a linear activation function, Φ m , and a weights matrix, W, sized n weights × n outputs .The activation functions between the hidden layers generally involves non-linear transformations.These functions enable the neural network to learn non-linear relationships.A visual representation of an MLP implementation is provided in Figure 3.
In training the model, the model evaluates its output using a loss function.In this implementation, the loss function is the root mean squared error, as defined by Equation (7).In this equation, y again represents the true values as per the training or validation dataset.
Using the backpropagation algorithm [26], the loss output is propagated through the neural network to compute the gradients of the loss with respect to the weights, W. The gradient information along with an optimiser, is used to update the weights of the network.In the present study, Adaptive Moment Estimation (Adam) [27] was chosen as the optimiser.This optimisation process then iterates over the provided dataset in batches to minimise the loss function.Each epoch is regarded as a complete pass of the dataset.Within the epochs, the data are divided into batches.On each pass of a batch through the network, the weights are updated based on the calculated gradients.
The network consists of an input layer with weights matrix  ( ) sized  ×  .This layer connects to  hidden layers, each characterised by an activation function, Φ, and a weights matrix, , sized  ×  .For regression tasks, the output layer utilizes a linear activation function, Φ , and a weights matrix,  , sized  ×  .The activation functions between the hidden layers generally involves non-linear transformations.These functions enable the neural network to learn non-linear relationships.A visual representation of an MLP implementation is provided in Figure 3.In training the model, the model evaluates its output using a loss function.In this implementation, the loss function is the root mean squared error, as defined by Equation (7).In this equation,  again represents the true values as per the training or validation dataset.
Using the backpropagation algorithm [26], the loss output is propagated through the neural network to compute the gradients of the loss with respect to the weights, .The gradient information along with an optimiser, is used to update the weights of the network.In the present study, Adaptive Moment Estimation (Adam) [27] was chosen as the optimiser.This optimisation process then iterates over the provided dataset in batches to minimise the loss function.Each epoch is regarded as a complete pass of the dataset.Within the epochs, the data are divided into batches.On each pass of a batch through the network, the weights are updated based on the calculated gradients.

Mixture Density Network
The mixture density network (MDN) [28] has a similar architecture to the MLP; however, the mixture density network extends the MLP by predicting multiple ( ) distributions per target feature rather than only the expected value.This ability allows the MDN to model multi-modal behaviour [23], which may exist in skull morphology.Furthermore, it uniquely provides a measure of uncertainty (standard deviation) when applied to unseen data.Finally, to the authors' knowledge, no previous study has implemented a MDN for the prediction of cephalometric measurements.Therefore, in this paper, the MDN serves as a representation of a multi-modal extension of a neural network-

Mixture Density Network
The mixture density network (MDN) [28] has a similar architecture to the MLP; however, the mixture density network extends the MLP by predicting multiple (n mixtures ) distributions per target feature rather than only the expected value.This ability allows the MDN to model multi-modal behaviour [23], which may exist in skull morphology.Furthermore, it uniquely provides a measure of uncertainty (standard deviation) when applied to unseen data.Finally, to the authors' knowledge, no previous study has implemented a MDN for the prediction of cephalometric measurements.Therefore, in this paper, the MDN serves as a representation of a multi-modal extension of a neural network-based model and as the first implementation of a MDN in estimating cephalometric measurements.
The MDNs feed-forward propagation equation is shown in Equation ( 8), where the output θ, comprises of means (µ) with size n batch × n outputs × n mixtures , standard deviations (σ) with size n batch × n outputs × n mixtures , and mixing coefficients (π) with size n batch × n mixtures .The network layout is also visually illustrated in Figure 4.
As outlined in Bishop [28], to implement this model, constraints are required on all mixing coefficients and standard deviations.The mixing coefficients should sum to 1.This can be accomplished by applying the soft-max equation to each mixing coefficient π k representing the weighting factor for the kth distribution, as shown in Equation ( 9).This function converts a set of unnormalised scores into a probability distribution.The second constraint is that the standard deviations are strictly positive, which can be enforced by applying an exponential function such as Equation (10).This transforms any realvalued inputs into positive values.Finally, the estimated mean value can be taken as in Equation (11).
In combining these parameters, a probability density function can be formed for K independent and isotropic Gaussian distributions conditioned with means, µ and isotropic variances σ 2 .By using isotropic variances, the model assumes that the uncertainty is uniform across all output measurements for each component of the mixture.The related probability density function, shown in Equation (12), can be incorporated into the model's loss function.The negative loss-likelihood function, Equation (13), is formed by taking the negative log of the probability density function.
In this equation, y n with size n outputs , represents our target true measurements for the nth subject in the batch.Similarly, x n with size n inputs , refers to the input measurements for the nth subject.Furthermore, I is the identity matrix.This negative loss likelihood function can then be minimised in the training loop. .The network layout is also visually illustrated in Figure 4.As outlined in Bishop [28], to implement this model, constraints are required on all mixing coefficients and standard deviations.The mixing coefficients should sum to 1.This can be accomplished by applying the soft-max equation to each mixing coefficient π representing the weighting factor for the kth distribution, as shown in Equation ( 9).This function converts a set of unnormalised scores into a probability distribution.The second constraint is that the standard deviations are strictly positive, which can be enforced by applying an exponential function such as Equation (10).This transforms any real-valued inputs into positive values.Finally, the estimated mean value can be taken as in Equation (11).Similar to the Multilayer Perceptron (MLP), the loss function and the backpropagation algorithm are employed to propagate the loss backward through the network.The weights were again optimized using the Adam gradient-descent-based optimiser [27].This updates the network weights, W, after each batch pass.

Random Forest
The random forest (RF) model for regression comprises an ensemble of multiple regression decision trees.RF models provide the capacity to learn non-linear relationships while also being computationally inexpensive and having higher interpretability in the form of feature importance [29].Importantly, tree-based models have different inductive biases compared to neural-network-based models.Therefore, the RF model is included as a representative tree-based and inexpensive non-linear model.
In this implementation, a RF model with multiple inputs is constructed for each output measurement.The decision tree implementation from Sci-Kit Learn is based on the classification and regression trees (CART) algorithm [30].In CART-based regression trees, the dataset is recursively split by solving a minimisation problem aimed at reducing the overall loss function, denoted as G.This is represented as follows: Here Q m represents a subset of the data at node m, consisting of the input measurements X of size n subjects × n inputs and their respective target output measurements, y with size n subjects .The parameters are represented with θ (of size n paramaters .The data subsets, along with the parameters govern the splitting criterion, that minimises the total loss, G.This minimisation leads to data entries with similar target values being grouped together. In model training, the CART implementation employs binary splits for each leaf, dividing data entries into a left dataset, Q l m (θ ), and a right dataset, Q r m (θ), based on the selected split value.For the input features, the algorithm evaluates all possible splits, m, to identify those that maximise the reduction in loss.The loss is calculated between the target output and the split's mean output value for that measurement.The calculation of the total loss is the weighted sum of the losses of each partition, as shown in Equation (15).
In this equation, the variables n l m and n r m represent the number of data entries in the left and right datasets, Q l m (θ) and Q r m (θ), as determined by the splitting of X and y at node m.This RF implementation used the mean squared error for its loss calculation.The related loss calculation for each data subset is shown in Equation (16).
In this equation, ŷm represents the mean of the subjects for the specified upper facial measurement in the selected subset.This equation quantifies the error as the mean squared deviation of the actual targets from the subset's mean.The aforementioned steps are recursively applied to all training data until a specified stopping criterion is met.Once the tree is fully developed, each node (or split) is conditioned based on the training data, which segments the data into leaf end nodes.Each leaf node is associated with a ŷm value.
In predicting unseen data, the tree determines the appropriate leaf node using the new input measurements and the previously conditioned nodes.The estimated measurement output value, ŷ, is assigned the ŷm value associated with the determined leaf node.
For the current work, tree depth was selected as the stopping criterion.The resultant tree then forms part of the specified number of estimators in the random forest.A random forest regressor takes the average output of all its trees as the forest's output.Finally, Sci-Kit Learn's implementation also uses the concept of bagging.This is where each regression tree in the network is trained on a unique set of bootstrapped samples of the training dataset.For more implementation details, see the Sci-kit Learn [18] documentation.

Model Hyper-Parameter Optimisation 2.3.1. Grid-Search
A two-step approach for hyper-parameter optimisation was implemented.First, a grid search was used to provide insight regarding parameter effects and to subsequently decrease the search space.A grid search is an exhaustive search where models are trained for a cartesian product of sets of hyper-parameters.However, this approach is computationally expensive for high-dimensional search spaces and has a fixed resolution [31].The latter is not ideal for the optimisation of continuous parameters such as a neural network's learning rate.Bergstra and Bengio [31] suggested that adaptive learning models and random searches can more effectively find optimal model hyper-parameters.

Bayesian Optimisation
For the second step, Bayesian optimisation (BO) was implemented for adaptive hyperparameter optimisation.The Bayesian search was applied to a reduced parameter space, as determined by the grid search, to find the optimal model parameters.For a detailed review of BO, refer to Shahriari [32].Here, a high-level overview of BO is provided, as implemented in Ax-platform [21] using the BoTorch model [20].The Ax-platform's mixed BO model integrates Sobol sequence sampling [33] to generate initial values, which are then used to establish a prior for BO.
BO is an optimisation algorithm with two main components, namely a surrogate and an acquisition function.The surrogate function is a less computationally expensive approximation of the objective function.This allows estimates of model performance for different hyper-parameter sets without direct evaluation of the objective function.One common implementation of the surrogate function is Gaussian Process Regression (GPR) [32].GPR enables the model to estimate both the output value of a function and the uncertainty associated with that estimate.This uncertainty prediction is then used in the acquisition function.
The acquisition function determines which areas of the hyper-parameter space are expected to improve model performance.It does this by evaluating regions in the hyperparameter space with high uncertainty or potential information utility.In the current work, the objective function would be the loss function.For this case, the expected improvement (EI) function as implemented in Ax-platform is shown in Equation (17).
Here f is the surrogate function and f * is the smallest loss observed.The EI function determines the expected value of hyper-parameters, providing the largest decrease in loss.The parameters with the highest expected improvement are then used in the computationally expensive objective function.The objective function then computes the true performance of these hyper-parameter sets.Subsequently, the results are used to update the GPR model.This process is repeated for a specified number of trials, allowing the model to dynamically adapt and converge to a solution.

Model Choices and Hyper-Parameter Optimisation
All model selection choices in this section were based on the validation dataset.The base OLS model was compared against feature-transformed OLS models.All other models underwent hyper-parameter optimisation.Initially, a grid search was applied to the MLP.The resultant grid search outputs were then used to reduce the hyper-parameter optimisation space.With this reduced hyper-parameter space, the MLP underwent Bayesian optimisation.The optimised MLPs parameters were transferred to the MDN.This was followed by a minor grid search on the MDN to determine a viable range of mixture coefficients.Finally, Bayesian optimisation was applied to the MDN with the reduced hyper-parameter space.Due to lower computational costs, the random forest was directly optimised with Bayesian optimisation with a large hyper-parameter space.

Linear Regression
The standard OLS linear regression (LR) was compared to OLS with z-space transforms.The transforms applied include k = 2 (LR_z1), k = 3 (LR_z2), and k = 4 (LR_z3).Results for distance and angle measurements are reported separately.For the distance-based features, the mean absolute error (MAE) for each model based on the validation set was 2.78 mm, 3.10 mm, 3.59 mm, and 5.83 mm for the LR, LR_z1, LR_z2, and LR_z3, respectively.Similarly, for angle-based features, the validation MAE values were 3.00 • , 3.03 • , 3.39 • and 5.93 • , respectively.These results indicate that the base LR model, without transforms, had the lowest error rates.The differences in errors were more pronounced for distances than for angles.Given the clinical importance of maintaining errors within 2 mm or 2 • within applications for virtual surgery preparation [12], the results were also assessed with error frequency diagrams.The frequency of prediction errors for the base OLS and for the transformed models is plotted for distances and angles, respectively, in Figures 5a and 5b.

Linear Regression
The standard OLS linear regression (LR) was compared to OLS with z-space transforms.The transforms applied include k = 2 (LR_z1), k = 3 (LR_z2), and k = 4 (LR_z3).Results for distance and angle measurements are reported separately.For the distancebased features, the mean absolute error (MAE) for each model based on the validation set was 2.78 mm, 3.10 mm, 3.59 mm, and 5.83 mm for the LR, LR_z1, LR_z2, and LR_z3, respectively.Similarly, for angle-based features, the validation MAE values were 3.00° , 3.03°, 3.39° and 5.93°, respectively.These results indicate that the base LR model, without transforms, had the lowest error rates.The differences in errors were more pronounced for distances than for angles.Given the clinical importance of maintaining errors within 2 mm or 2° within applications for virtual surgery preparation [12], the results were also assessed with error frequency diagrams.The frequency of prediction errors for the base OLS and for the transformed models is plotted for distances and angles, respectively, in Figure 5a and Figure 5b.  Figure 5a indicates that the base LR and the closely following LR_z1 model had the highest number of distance errors adhering to clinical requirements.The plot also demonstrates that the frequency of absolute distance errors decreased as higher-dimensional transforms were applied.Conversely, Figure 5b reveals that this trend did not extend to angular measurements.Here, LR_z1 and LR_z2 had the lowest number of errors below 2 degrees, whereas the higher k = 4 (LR_z3) transform still performed the worst.This may indicate that the relationship between input and the target angle features is more complex than that of the distances.Additionally, across both graphs, the transformed models showed a notably higher number of errors greater than 8 mm/ • .These extreme values might explain the discrepancy between model performance regarding MAE and the frequency of errors under 2 mm/ • .It is possible that this is because of overfitting.As the data increases in dimension, the models start overfitting on the training data [34], which leads to worse validation performance.Given these results, the base LR model, which demonstrated the lowest MAE values, a comparable number of errors below 2 mm/ • , and the fewest extreme values were selected for further comparison.

Multi-Layer Perceptron
The MLP underwent a grid search to determine the effect of changing the hyperparameters and to allow for a more focused Bayesian search.The grid search hyperparameters are shown in Table 4.To start analysing the grid search, the LeakyReLU activation function was chosen, and the batch size was set to 16.With these parameters, the learning rates for each neural network configuration are plotted against the final validation loss in Figure 6.The results indicate that the neural network configurations had the lowest average validation loss when the learning rate was 1 × 10 −5 .Therefore, this value was selected as the model's learning rate for further analysis.Now, the neural network configurations for different batch sizes are plotted against the final validation loss in Figure 7.This plot shows that all models had similar performance across batch sizes, with the batch size of 8 generally having slightly better performance.As a result, for subsequent steps, the batch size was kept at 8. In the final stage of the grid search, using the previously determined parameters, Figure 8 displays the validation loss for various activation functions plotted against different architectural configurations.This figure demonstrates that the ELU activation function had the overall most stable performance across all configurations.The ELU function Now, the neural network configurations for different batch sizes are plotted against the final validation loss in Figure 7.This plot shows that all models had similar performance across batch sizes, with the batch size of 8 generally having slightly better performance.As a result, for subsequent steps, the batch size was kept at 8. Now, the neural network configurations for different batch sizes are plotted against the final validation loss in Figure 7.This plot shows that all models had similar performance across batch sizes, with the batch size of 8 generally having slightly better performance.As a result, for subsequent steps, the batch size was kept at 8. In the final stage of the grid search, using the previously determined parameters, Figure 8 displays the validation loss for various activation functions plotted against different architectural configurations.This figure demonstrates that the ELU activation function had the overall most stable performance across all configurations.The ELU function also had the lowest instance of validation loss at a 32-32-32 configuration.This may be In the final stage of the grid search, using the previously determined parameters, Figure 8 displays the validation loss for various activation functions plotted against different architectural configurations.This figure demonstrates that the ELU activation function had the overall most stable performance across all configurations.The ELU function also had the lowest instance of validation loss at a 32-32-32 configuration.This may be explained by ELU having the ability to centre the activations, similar to batch normalisation, and ELU being more robust to noise [35].explained by ELU having the ability to centre the activations, similar to batch normalisation, and ELU being more robust to noise [35].Furthermore, the data suggest that increasing the neuron count beyond 64 did not improve performance.Additionally, the plot did not reveal any notable performance differences between architectures with two or three layers.It has been suggested that neural network depth is associated with its ability to learn more complex relationships [36].Therefore, if non-linearity in the data were more significant relative to the noise, then it is speculated that there would be a trend with lower final validation loss for 3 layers compared to 2 layers.
Based on the grid search results, the parameters listed in Table 5 are the reduced hyperparameter search space for Bayesian optimisation (BO).With this search space, the Sobol-BO algorithm was applied for 200 trials.Of those 200, the Ax-platform determined that 6 Sobol sequencing trials were required to initially specify a prior.The parameter set with the lowest validation loss, trial 89, is also included in Table 5.In Figure 9, the BO trials are plotted against the final validation losses.Further hyperparameter tuning, with the current hyper-parameter space, did not appear to be computationally practical, as increasing the trial count did not lead to a noticeable reduction in loss.It is also apparent that the optimisation algorithm quickly converged to a minimal value and that various parameter combinations produced similar loss outcomes.It is speculated that these minima are relatively shallow and broad.Furthermore, the data suggest that increasing the neuron count beyond 64 did not improve performance.Additionally, the plot did not reveal any notable performance differences between architectures with two or three layers.It has been suggested that neural network depth is associated with its ability to learn more complex relationships [36].Therefore, if non-linearity in the data were more significant relative to the noise, then it is speculated that there would be a trend with lower final validation loss for 3 layers compared to 2 layers.
Based on the grid search results, the parameters listed in Table 5 are the reduced hyperparameter search space for Bayesian optimisation (BO).With this search space, the Sobol-BO algorithm was applied for 200 trials.Of those 200, the Ax-platform determined that 6 Sobol sequencing trials were required to initially specify a prior.The parameter set with the lowest validation loss, trial 89, is also included in Table 5.In Figure 9, the BO trials are plotted against the final validation losses.Further hyperparameter tuning, with the current hyper-parameter space, did not appear to be computationally practical, as increasing the trial count did not lead to a noticeable reduction in loss.It is also apparent that the optimisation algorithm quickly converged to a minimal value and that various parameter combinations produced similar loss outcomes.It is speculated that these minima are relatively shallow and broad.

Mixture Density Network
The final grid-search parameters from the MLP were transferred to the MDN.With these parameters, a small grid search was then completed to optimise the mixture count and batch size.The results of this grid search are shown in Figure 10.The results suggest that the number of mixtures did not have a relationship that was easily distinguishable from noise.However, there was a significant impact in changing the batch size, which prompted a full BO search.

Mixture Density Network
The final grid-search parameters from the MLP were transferred to the MDN.With these parameters, a small grid search was then completed to optimise the mixture count and batch size.The results of this grid search are shown in Figure 10.The results suggest that the number of mixtures did not have a relationship that was easily distinguishable from noise.However, there was a significant impact in changing the batch size, which prompted a full BO search.
The hyper-parameter search space and final outputs of the Bayesian search (trial 184) are shown in Table 6.The BO trials against validation loss are plotted in Figure 11.The initial 6 trials were based on Sobol sequencing.The plot shows that convergence occurred within 25 trials.Similar to the MLP, the MDN also seemed to have a relatively flat loss landscape with possible broad minima.This indicates that further hyper-parameter tuning within the current search space will likely not provide much value.
The final grid-search parameters from the MLP were transferred to the MDN.With these parameters, a small grid search was then completed to optimise the mixture count and batch size.The results of this grid search are shown in Figure 10.The results suggest that the number of mixtures did not have a relationship that was easily distinguishable from noise.However, there was a significant impact in changing the batch size, which prompted a full BO search.The hyper-parameter search space and final outputs of the Bayesian search (trial 184) are shown in Table 6.The BO trials against validation loss are plotted in Figure 11.The initial 6 trials were based on Sobol sequencing.The plot shows that convergence occurred within 25 trials.Similar to the MLP, the MDN also seemed to have a relatively flat loss landscape with possible broad minima.This indicates that further hyper-parameter tuning within the current search space will likely not provide much value.

Random Forest
For the final model, a grid search was not deemed necessary as the random forest model is relatively computationally inexpensive.This allowed a larger, more broad Bayesian search.The 500 BO trials plotted against validation loss are shown in Figure 12.The first 8 trials used Sobol sequencing.Similar to the previous applications of the BO algorithm on the MLP and MDN, the results also suggest that the RF network was mostly insensitive to hyper-parameter changes.The hyper-parameter search space parameters and the hyper-parameter set with the lowest RMSE (Trial 4) are shown in Table 7.As trial 4 had the lowest RMSE value for tested hyper-parameter sets, this indicates that the final set was found during the initial Sobol sequencing.
insensitive to hyper-parameter changes.The hyper-parameter search space parameters and the hyper-parameter set with the lowest RMSE (Trial 4) are shown in Table 7.As trial 4 had the lowest RMSE value for all tested hyper-parameter sets, this indicates that the final set was found during the initial Sobol sequencing.

Model Comparison
All results in this section were derived from the averaged outputs of 10-fold crossvalidation.This method has shown similar or better results compared to leave-one-out cross-validation [36].Furthermore, all error metrics are defined based on the difference between the estimated lower facial measurements and the true lower facial measurements, as measured in Section 2.1.In comparing the final models' performance on the test data, these errors were also evaluated in relation to the clinical thresholds.
The hyper-parameter optimisation as per Section 3.1.1was repeated for a modified dataset.The modified dataset, from here on referred to as the symmetry dataset, assumed upper facial symmetry by dropping all upper right-side facial measurements as input features.Note that with the initial dataset, only lower facial symmetry was assumed.This comparison provides insight into the effect of minor asymmetry and the number of input features on prediction accuracy.The resultant validation mean absolute error (MAE) is shown in Table 8.These results indicate that the inclusion of the right-side input features

Model Comparison
All results in this section were derived from the averaged outputs of 10-fold crossvalidation.This method has shown similar or better results compared to leave-one-out cross-validation [36].Furthermore, all error metrics are defined based on the difference between the estimated lower facial measurements and the true lower facial measurements, as measured in Section 2.1.In comparing the final models' performance on the test data, these errors were also evaluated in relation to the clinical thresholds.
The hyper-parameter optimisation as per Section 3.1.1was repeated for a modified dataset.The modified dataset, from here on referred to as the symmetry dataset, assumed upper facial symmetry by dropping all upper right-side facial measurements as input features.Note that with the initial dataset, only lower facial symmetry was assumed.This comparison provides insight into the effect of minor asymmetry and the number of input features on prediction accuracy.The resultant validation mean absolute error (MAE) is shown in Table 8.These results indicate that the inclusion of the right-side input features in the asymmetry dataset provides an overall improvement in the average performance across models.However, there is no noticeable impact on the linear regression model's ability to predict angles.This discrepancy may suggest that angular measurements have a higher complexity than distance measurements.Speculatively, the improvement in the non-linear methods may indicate that, in the presence of more data, these models still have residual learning capacity available to learn meaningful relationships, leading lower prediction errors.In contrast, it is possible that by adding additional features, the linear regression model did not have the capacity to learn additional complex relationships in the angular target features.Assuming this trend holds in the test data, and given the improved average performance across models in the validation dataset, the analysis of the testing (hold-out) dataset proceeded with the asymmetrical dataset.The mean absolute error (MAE), root-mean-square error (RMSE), and mean average percentage error (MAPE) were calculated for each feature and then averaged across features with similar measurement types, namely distances and angles.The test set MAE values for the distance features with the MLP, MDN, RF, and LR models ranged respectively from 1.85 to 4.00 mm, 1.83 to 4.03 mm, 1.88 to 4.06 mm, and 1.79 to 4.14 mm.Similarly, for angular measurements, the MAEs ranged respectively from 2.27 to 3.75 • , 2.25 to 3.75 • , 2.08 to 3.78 • , and 2.57 to 3.42 • .The error metrics from the testing dataset are presented in Table 9.The calculated metrics for the distance measurements indicate comparable error values across models.The MLP and MDN had slightly lower errors.In contrast, for angular measurements, accounting for noise, the MAE values showed almost no differences.Using the more outlier-sensitive RMSE, the results indicate that non-linear methods have barely noticeable lower angle error scores.Additionally, the MAPE values for both measurement types did not show discernible differences between models.Among the models evaluated, while neural network-based models achieved the lowest scores, none showed noticeably lower error values than any other.A direct comparison with other studies is not feasible due to variations in data dimensions, differences in features (landmarks and different cephalometric measurements), and input feature sets that span both the upper and lower face during prediction.Nevertheless, these studies may provide insight into expected error ranges and possible trends.
The study by Cheng [4], which focused on predicting mandible repositioning vectors, found that their transformer-based model (MAE: 1.34 mm), along with MLP (MAE: 1.36 mm), RF (MAE: 1.41 mm), and ridge regression model (MAE: 1.37 mm), also demonstrated comparable performance.They concluded that their transformer had slightly better generalisation capability on the test data compared to other models.The transformer model's MAE values ranged from 1.08 to 1.80 mm.In Gillingham's research [9], multiple linear regression models were used to predict mandibular cephalometric measurements from other mandibular and upper-facial measurements.Their RMSE values ranged from 1.74 • to 5.01 • and from 0.64 mm to 2.89 mm.It is important to note that RMSE typically provides estimates that are as conservative or more conservative than those provided by MAE.Finally, Zhu [10] applied a landmark-based approach to predict the mandible from the maxilla and vice versa.In predicting the mandible, they found that their 2-dimensional linear regression performed best, providing average scalar position errors ranging from approximately 2.9 mm to 7.9 mm.
In relation to the current work, the lower errors from Gillingham's model may be attributed to the inclusion of mandible features with better correlation (compared to upper facial measurements) as input features.Similarly, Cheng's model also utilised both upper and lower facial cephalometric measurements as input features.However, as previously mentioned, Cheng predicted vectors containing landmark coordinates and not cephalometric measurements.Another difference between the studies is Cheng's larger sample size (n = 383) as opposed to the current work (n = 155), Gillingham's research (n = 100), and Zhu's sample size (n = 42).In addition, both Gillingham and Cheng applied a more strict exclusion protocol, excluding subjects based on malocclusion, likely resulting in more harmonious relationships between features.
As before, given the clinical requirements (2 mm/ • ) for virtual surgery planning, it is important to determine the model's performance in relation to this.Therefore, the absolute errors are also represented in frequency diagrams, with distances shown in Figure 13a and angles in Figure 13b.In descending order, for distance errors, the MLP, LR, MDN, and RF models had a relative frequency of errors within the threshold of 53.13%, 51.04%, 48.96%, and 44.79%, respectively.For angular measurements, the MLP, MDN, LR, and RF models had a relative frequency of errors within specifications of 39.58%, 37.5%, 35.42%, and 33.33%, respectively.Notably, if a more lenient cut-off of 3 mm/ • is used, then the difference in cumulative frequency of errors between models diminishes.For distances, the MLP, MDN, LR, and RF models had relative frequencies of 69.79%, 69.79%, 66.67%, and 65.62%, respectively.For angles, the new relative error frequencies were 54.17%, 54.17%, 54.17%, and 50.00%, respectively.
Across models, it is evident that the models have more difficulty learning the relationship of input features to angular target features compared to distance target features.In terms of model-specific performance, in both the lenient and conservative cases (3 and 2 mm/ • , respectively), the RF model had the highest frequency of errors outside of specification.Conversely, the MLP had the most errors under the conservative threshold by a slight margin.The MLP and MDN outperform the other models under the lenient criteria.A limitation of the RF model is that it cannot extrapolate outside of its training data [37].This inability may contribute to the RF model's high error frequency.The model outputs can be interpreted on a feature basis.Figure 14a,b show the distributions of signed validation errors for each model and feature.This again shows similar error ranges across models, with very few prominent differences.Generally, the signed mean and median values were consistent across both models and features.On a feature basis, the error distributions from the different models tended to have a similar skew.In terms of error outliers, these were predominantly caused by the same individuals and were similar across models.The model outputs can be interpreted on a feature basis.Figure 14a,b show the distributions of signed validation errors for each model and feature.This again shows similar error ranges across models, with very few prominent differences.Generally, the signed mean and median values were consistent across both models and features.On a feature basis, the error distributions from the different models tended to have a similar skew.In terms of error outliers, these were predominantly caused by the same individuals and were similar across models.The model outputs can be interpreted on a feature basis.Figure 14a,b show the distributions of signed validation errors for each model and feature.This again shows similar error ranges across models, with very few prominent differences.Generally, the signed mean and median values were consistent across both models and features.On a feature basis, the error distributions from the different models tended to have a similar skew.In terms of error outliers, these were predominantly caused by the same individuals and were similar across models.

Implications, Limitations, and Future Directions
In summary, this study provided a unique comparison between linear and non-linear methods for predicting lower facial cephalometric measurements from upper facial cephalometric measurements.Similarly, the combination of methods compared was also unique.An overview comparing key differences between the current work and related research is provided in Table 10.In addition, this study was the first to construct regression models based on cephalometric measurements derived from the publicly available HNSCC dataset.Lastly, to the authors' knowledge, a mixture density network has not previously been applied to cephalometric measurements.

Implications, Limitations, and Future Directions
In summary, this study provided a unique comparison between linear and non-linear methods for predicting lower facial cephalometric measurements from upper facial cephalometric measurements.Similarly, the combination of methods compared was also unique.An overview comparing key differences between the current work and related research is provided in Table 10.In addition, this study was the first to construct regression models based on cephalometric measurements derived from the publicly available HNSCC dataset.Lastly, to the authors' knowledge, a mixture density network has not previously been applied to cephalometric measurements.From the model comparison in Section 3.2, it is apparent that in predicting 1D lower facial measurements from 1D upper facial measurements, non-linear models provide similar or marginally better performance compared to the traditional OLS linear models.Specifically, deep learning-based models showed lower mean absolute errors and better adherence of errors to clinical specifications than the OLS and RF models.In addition, the analysis showed that all evaluated models frequently produced errors exceeding the clinical threshold of 2 mm/degree.This indicates that refinement is required.Furthermore, the exhaustive hyper-parameter search in Section 3.1 revealed no noticeable difference between 2 and 3 layers for neural network-based models.Although deeper layers are commonly associated with the capability to model more complex relationships [38], this finding suggests that the nonlinear relationship might not be as prominent in the current dataset.This notion is supported by the observation that the OLS model only performed marginally worse compared to the nonlinear models, indicating that the OLS model's inherent inductive bias of assumed linearity may be sufficient for this dataset.As a result, although deep-learning methods slightly outperformed the linear regression model, the added effort involved in their training is difficult to justify based on prediction accuracy alone.However, the information density and noise levels of this study's chosen 1D measurements might be insufficient for the non-linear models examined to effectively learn relationships between the features.Additional data may better describe the relationship between lower and upper facial measurements.
Therefore, one potential limitation of this study is that the current sample size may be too small for non-linear models to effectively learn meaningful non-linear relationships.Furthermore, the current study's exclusion protocol did not exclude individuals with malocclusion.Non-harmonious relationships may have led to poorer model specificity.It is speculated that model performance can be refined by increasing the training sample size and implementing a stricter exclusion protocol.Furthermore, the input and output features were based on manually placed landmarks by a single rater.This approach may introduce bias and noise into the measurements.To mitigate this, it has been proposed that duplicate measurements should be taken and averaged [39] or to use automatic landmarking algorithms [40].In this project, to assess rater noise, ICC tests were used and found agreement scores ranging between 0.7 and 1.0, with an average of 0.95, indicating that landmark-placement repeatability ranged from acceptable (>0.7) to excellent (>0.9) [41].In addition, to reduce variability in predictions, 10-fold cross-validation was employed on the hold-out set.Finally, the resulting models are only applicable to the demographics of the training set.Research has shown statistically significant differences in cephalometric measurements for different population groups [42].
In terms of clinical applications, in predicting lower facial measurements from upper facial measurements, the MLP was generally the best for both conservative and lenient tolerances (respectively, 2 mm/ • and 3 mm/ • ).However, for lenient tolerances, the differences between the MLP and MDN were marginal, and the MDN has the added advantage of providing an estimate of the prediction's uncertainty.In addition, Zhu et al. [10] recommended that models should ideally provide the clinician with a selection of estimated values.An MDN can fulfil this criterion by enabling the clinician to sample estimated measurements from the predicted distribution.The clinician can then select their preferred value based on their experience and the patient's requirements.
For future work, these results indicate that non-linear relationships between upper and lower facial measurements may be present but are not significant or well-captured by the current selection of models.It is also possible that the non-linearity in the current dataset is dominated by noise.However, the non-linear relationship could become more apparent at higher dimensions.Shape models such as SSMs, which are essentially linear regression models that employ Principal Component Analyses to reduce extremely dense features to a limited parameter set, have shown improved performance compared to 1D linear regression models in predicting derived measurements [9].Therefore, it is proposed that further investigation is required in utilising linear shape models and comparing them to non-linear shape models for the prediction of lower facial shape based on upper facial shape.Shape models can also be used to generate artificial data to provide 1D measurement-based models with additional data.For future work regarding 1D measurement models, further comparison should consider the use of regularisation methods for neural network-based models.In a general application of neural networks, evidence indicates that the application of regularisation "cocktails" may allow neural-network-based methods to outperform state-of-the-art tabular data models [43].It is possible that this applies to cephalometric measurement datasets.It has also been suggested that neural networks may have improved performance if uninformative features are removed [44].Therefore, due to the suspected high presence of multi-collinearity between input measurements in this current dataset, along with the ICC scores indicating that certain feature definitions may be more susceptible to measurement error or noise than others, it is recommended that feature selection should be considered.

Conclusions
This study aimed to determine whether non-linear models outperform linear models in predicting lower facial measurements from upper facial measurements.Accordingly, linear regression, random forest, multilayer perceptron, and mixture density network models were implemented.The neural networks and the random forest underwent an exhaustive grid and Bayesian search for hyper-parameter optimisation, whereas featurespace transforms were applied to the linear regression.The validation and testing results indicated that all models performed at comparable levels; however, the neural networkbased models had marginally more errors below clinically relevant thresholds (2 mm/ • ) and lower MAE values than the linear regression and random forest models.Therefore, based on the results of this study, an MDN is suggested due to its ability to produce estimates of uncertainty along with predicted values.Nevertheless, future work is required to further improve prediction accuracy under conservative thresholds.

Figure 1 .
Figure 1.Mandible landmarks.Cephalometric measurements were derived based on the coordinates of the obtained landmarks, again based on work by Bayome et al. and Gillingham et al.[9,15].Additional upper facial measurements incorporating the Porion landmark were also defined.All measurements were extracted from the defined 3D landmark coordinates using a Python script.The mandible measurements are listed in Table2, and the cranium and maxilla measurements are listed in Table3.

Figure 2 .
Figure 2. Cranium and maxilla landmarks.Cephalometric measurements were derived based on the coordinates of the obtained landmarks, again based on work by Bayome et al. and Gillingham et al.[9,15].Additional upper facial measurements incorporating the Porion landmark were also defined.All measurements were extracted from the defined 3D landmark coordinates using a Python script.The mandible measurements are listed in Table2, and the cranium and maxilla measurements are listed in Table3.
-Sagittal Distance from Gonion to sagittal plane Co-Sagittal Distance from Condylion to sagittal plane Co-Sig Distance from Condylion to Sigmoid notch ID-Me Distance from Infradentale to Menton Go-Me Distance from Gonion to Menton Co-in-Co-out Distance from the innermost condylar point to outermost condylar point (also known as condylar width) Me angle Angle spanned by the right and left Gnathion-Me Me-Go-Co Angle spanned by Menton, Gonion and Condylion.Co-Go-FH Angle spanned by Condylion, Gonion and the Frankfurt Horizontal plane

Figure 3 .
Figure 3. Neural network layout for a MLP with m hidden layers, each with n neurons per hidden layer.Not depicted on the diagram are the activation layers, neuron weights, and biases between neuron connections.

Figure 3 .
Figure 3. Neural network layout for a MLP with m hidden layers, each with n neurons per hidden layer.Not depicted on the diagram are the activation layers, neuron weights, and biases between neuron connections.

Figure 4 .
Figure 4. Neural Network layout for a MDN with  (k) distributions, m hidden layers, each with  neurons per hidden layer.Not depicted on the diagram are the activation layers, neuron weights and biases between neuron connections.The plot on the right depicts an example of the predicted weighted distribution (bolded line) for a target feature with  = 3.The lines with colour represent all individual predicted distributions for that feature.

Figure 4 .
Figure 4. Neural Network layout for a MDN with n mixtures (k) distributions, m hidden layers, each with n neurons neurons per hidden layer.Not depicted on the diagram are the activation layers, neuron weights and biases between neuron connections.The plot on the right depicts an example of the predicted weighted distribution (bolded line) for a target feature with n mixtures = 3.The lines with colour represent all individual predicted distributions for that feature.

Figure 5 .
Figure 5. Frequency of absolute errors for LR models based on predicted measurements from validation dataset.The dashed line indicates the clinical threshold of 2 mm/°: (a) Absolute errors for distances.(b) Absolute errors for angles.

Figure 5 .
Figure 5. Frequency of absolute errors for LR models based on predicted measurements from validation dataset.The dashed line indicates the clinical threshold of 2 mm/ • : (a) Absolute errors for distances.(b) Absolute errors for angles.

Figure 11 .
Figure 11.MDN BO: Bayesian optimisation showing trials against final validation loss.Note: two extreme outliers (trial 54 and trial 189) were removed for visualisation purposes.3.1.4.Random Forest For the final model, a grid search was not deemed necessary as the random forest model is relatively computationally inexpensive.This allowed a larger, more broad Bayesian search.The 500 BO trials plotted against validation loss are shown in Figure 12.The

Figure 11 .
Figure 11.MDN BO: Bayesian optimisation showing trials against final validation loss.Note: two extreme outliers (trial 54 and trial 189) were removed for visualisation purposes.

Figure 12 .Table 7 .
Figure 12.RF BO: Bayesian optimisation showing trials against RMSE.Table 7. Hyper-parameter search space for RF.Hyper-Parameter Search Space Values Number of trees 1:2000 Max Tree Depth 1:2000 Min number of samples per leaf 1:50 Min number of samples per split 2:50 Determined Hyper-Parameters Values Number of trees 13 Max Tree Depth 680 Min number of samples per leaf 15 Min number of samples per split 23

Figure 13 .
Figure 13.Frequency of errors for predicted testing dataset measurements.The dashed line indicates the clinical specification of 2 mm/°.(a) Absolute errors for distances.(b) Absolute errors for angles.

Figure 13 .
Figure 13.Frequency of errors for predicted testing dataset measurements.The dashed line indicates the clinical specification of 2 mm/ • .(a) Absolute errors for distances.(b) Absolute errors for angles.

Figure 13 .
Figure 13.Frequency of errors for predicted testing dataset measurements.The dashed line indicates the clinical specification of 2 mm/°.(a) Absolute errors for distances.(b) Absolute errors for angles.

Figure 14 .
Figure 14.Error distribution per feature per model.The dashed red line indicates the clinical specification of 2 mm/ • .The solid line in the box represents the median, whereas the dashed line in the box represents the mean.(a) Signed error distribution for distances.(b) Signed error distribution for angles.

Table 1 .
Demographics of included subjects.

Table 1 .
Demographics of included subjects.

Table 2 .
Definitions of mandible measurements.

Table 3 .
Definitions of cranium and maxilla measurements.

Table 2 .
Definitions of mandible measurements.

Table 3 .
Definitions of cranium and maxilla measurements.

Table 4 .
Hyper-parameter search space for multi-layer perceptron grid-search.

Table 5 .
Hyper-parameter search space for MLP Bayesian-search.

Table 5 .
Hyper-parameter search space for MLP Bayesian-search.

Table 6 .
Hyper-parameter search space for MDN Bayesian-search.

Table 7 .
Hyper-parameter search space for RF.

Table 8 .
Change in validation loss with addition of features."Asymmetric" refers to the dataset with additional features.

Table 9 .
Average error metrics for the test (hold-out) dataset.

Table 10 .
An overview of similar research.

Table 10 .
An overview of similar research.

Table A2 .
Definitions of cranium and maxilla landmarks.Lateral point on orbit rim located at distinct change of curvature where zyogomatic bone meets with the frontal bone Orbitale Most inferior point on lower orbit rim Maxillary point Most concave point on zygomatic process of the maxilla Porion Most superior point on the upper margin of the external auditory meatus Nasion Deepest point on concavity formed by nasal bone and frontal bone Anterior nasal spine Most anterior point of nasal floor projecting from maxilla Prosthion Most anterior point on lower maxilla located on sagittal plane A point Deepest point between ANS and Prosthion located on midsagital plane Sella Most inferior point on sella turcica located on midsagittal plane Posterior nasal spine Most posterior point of nasal floor projecting from maxilla