Evaluating the Use of Machine Learning to Predict Expert-Driven Pareto-Navigated Calibrations for Personalised Automated Radiotherapy Planning

: Automated planning (AP) uses common protocols for all patients within a cancer site. This work investigated using machine learning to personalise AP protocols for fully individualised planning. A ‘Pareto guided automated planning’ (PGAP) solution was used to generate patient-speciﬁc AP protocols and gold standard Pareto navigated reference plans (MCO gs ) for 40 prostate cancer patients. Anatomical features related to geometry were extracted and two ML approaches (clustering and regression) that predicted patient-speciﬁc planning goal weights were trained on patients 1–20. For validation, three plans were generated for patients 21–40 using a standard site-speciﬁc AP protocol based on averaged weights (PGAP std ) and patient-speciﬁc AP protocols generated via regression (PGAP-ML reg ) and clustering (PGAP-ML clus ). The three methods were compared to MCO gs in terms of weighting factors and plan dose metrics. Results demonstrated that at the population level PGAP std , PGAP-ML reg and PGAP-ML clus provided excellent correspondence with MCO gs . Deviations were either not statistically signiﬁcant ( p ≥ 0.05), or of a small magnitude, with all coverage and hotspot dose metrics within 0.2 Gy of MCO gs and OAR metrics within 0.7% and 0.4 Gy for volume and dose metrics, respectively. When compared to PGAP std , patient-speciﬁc protocols offered minimal advantage for this cancer site, with both approaches highly congruent with MCO gs .


Introduction
Automated planning (AP) is fast becoming the state of the art in radiotherapy planning for intensity-modulated radiotherapy (IMRT) and volumetric-modulated radiotherapy (VMAT) [1][2][3] and can be classified into one of two categories: knowledge-based planning (KBP) or rules-based planning (RBP).KBP uses statistical techniques [2,[4][5][6][7] trained on historical clinical datasets, to inform planning for novel cases through prediction of optimisation objectives [8], dose-volume histograms [9][10][11] or voxel-level dose [2].RBP employs logic to converge on a solution.For example, a lexicographic ordering that optimises planning goals (PGs) in strict sequential order [12][13][14] and protocol-based automatic iterative optimisation (PBAIO) that uses algorithms to automatically adapt planning parameters during optimisation.Various PBAIO approaches have been developed, including scripts that manipulate dose-volume objectives by moving them a specified increment at the start of every new pass [15] or modify weighting factors so objective values meet specified targets [16].There are PBAIO scripts that record the iterative process during manual planning and use this to generate an AP algorithm [17] and commercially available Auto-Planning software that automatically generates new contours during optimisation to help meet clinical goals [18].The majority of these AP techniques have been shown to produce plans non-inferior to manual planning and are used in clinical practice.Comprehensive reviews of all techniques are found in the literature [1,2,4].
The most clinically desirable plans are 'Pareto optimal'.That is, no dosimetric improvements can be made to a PG except at the detriment of another.The various AP methods therefore aim to converge upon this set.However, planning can be complex given PGs may conflict with one another and clinical desirability is dependent upon appropriate management of these trade-offs.Therefore, although the most clinically desirable plans are Pareto optimal, achieving Pareto optimality does not guarantee clinical desirability.
For KBP, trade-off balancing is automatically determined by the underlying clinical plans in the knowledge-base.For RBP, balancing must be explicitly defined in a process known as 'calibration'.Calibration is the process of balancing the relative priority of PGs such that they align with the oncologists' preferences.The dominant approach to RBP calibration is trial-and-error [19][20][21] (TAE) where AP parameters are iteratively updated until an acceptable solution for a given clinical site is obtained.The approach is time consuming with improvements made only with respect to previously tried examples.It does not allow for the intuitive exploration of competing PGs and, as with manual planning, may yield solutions that are not fully congruent with oncologists' clinical preferences [22].One way to manage the limitations of TAE is to use a KBP calibration approach where AP calibrations are derived from machine learning (ML) on historical clinical datasets [23,24].This approach may be more efficient than TAE but will depend strongly on the knowledge base composition.A third approach is to utilise Pareto navigation techniques during the calibration process ('Pareto guided automated planning' or PGAP).This involves exploring a set of unique and systematically produced Pareto optimal solutions, each representing a differently balanced AP solution.Due to the number of solutions necessary for this to be effective, it can be resource intensive.Nevertheless, it is an a posteriori multicriteria optimisation (MCO) method allowing exploration of the trade-off relationships between PGs [22,25,26].Recent work has demonstrated the utility of PGAP in yielding plans consistent with oncologists' preferences for prostate patients with and without elective nodal irradiation under conventional and extreme hypofractionation regimes [16,22,27].
Despite advances in available calibration methods, RBP calibration takes a 'one size fits all' approach with a single AP protocol (or wishlist) used for all patients of a given clinical site.This assumes an AP calibration that achieves a clinically optimum dose distribution for one patient is optimal for all patients within that clinical site.The validity of a 'one size fits all' approach has not been explicitly explored in the literature and there is evidence that points to site-specific RBP leading to sub-optimal or clinically unacceptable plans for a reasonably large proportions of cases.For lung stereotactic body radiotherapy, Vanderstraeten et al. observed that up to 24% of automated plans were considered clinically unacceptable without further tweaking [28].For locally advanced nasopharngeal carcinoma, Zhang et al. conclude that "automatic VMAT is not good enough to completely replace manual VMAT" [29].Finally, though independent quality assurance of 229 prostate cancer patients planned using AP, Janssen et al. demonstrated that 17% of plans were suboptimal and could be improved [30].This evidence highlights deficiencies in the 'one size fits all' approach and indicates that personalisation of AP protocols to individual patients may be required to ensure optimality.
In contrast, KBP utilises a fully individualised approach, with ML models using anatomy based predictive factors to generate patient-specific optimisation objectives or dose distribution parameters.The predicted parameters are used to form static objective function inputs to a standard gradient decent optimisation.Whilst optimisations using this approach are inherently patient tailored, the relationship between anatomy and objectives/dose parameters is complex, with wide variances across a patient cohort.Accurate modelling is therefore challenging, generally requires large training datasets and can yield models with clinically relevant prediction errors [31].Furthermore, the quality of the model is highly depended on the optimality of the underlying training dataset [32], which is not guaranteed.In summary, modelling uncertainties for KBP and the 'one size fits all' approach for RBP mean current AP solutions may not yield optimal, patient tailored plans.To address this problem we propose a hybrid AP solution where KBP is utilised to predict patient specific AP protocol parameters that act as an input for an already validated RBP solution.In this regard RBP is no longer reliant on a 'one size fits all' set of protocol parameters, but instead can utilise a protocol fully personalised to the individual patient.Application of KBP in this manner has the advantage that a validated RBP approach, by its nature, has suitably suppressed the relationship between anatomy and AP protocol parameters such that a single parameter set can yield acceptable plans across a treatment site.In this regard, the purpose of KBP is not to ensure RBP yields acceptable plans, but rather to further refine and individualise AP protocol parameters with the aim of fully personalising treatment plans.Importantly, with much of the variance already reduced through RBP, it is theorised that unlike standalone KBP approaches, uncertainties in the KBP models in a hybrid solution will be of low clinical significance.
The purpose of this work was to develop and evaluate a novel KBP-RBP hybrid planning solution for prostate cancer using PGAP.This new methodology utilised ML to identify the relationships between anatomy and optimum patient-specific calibration parameters (determined via Pareto navigation) such that individualised AP protocols could be generated for novel patients.Recent studies illustrate the clinical relevance of incorporating geometric features in the AP process for robust optimisation [33] and development of a hybrid approach in which geometric features are used as KBP inputs for calibration of an RBP system [34].The KBP-RBP hybrid solution developed in this work considered advanced KBP techniques based on geometric features.It was trained on a representative dataset and validated for an independent set of novel patients.For validation the solution was compared against patient-specific expert-driven Pareto navigation (MCO gs ), which is considered the gold standard, and a standard PGAP approach using a 'one size fits all' site specific protocol (PGAP std ).The evaluation aimed to answer: (i) does personalising protocols via ML improve plan quality compared to PGAP std and (ii) Is there a significant difference between the PGAP approaches and MCO gs .

Overview
This work was completed with reference to the RATINGS framework [35] and builds on successful implementation of a PGAP std system, which uses Pareto navigation techniques to calibrate a PBAIO AP solution [16,36].In this work, training and validation was performed using 'gold standard' training and validation datasets, where patient-specific PBAIO calibration parameters, alongside their corresponding plan and dose distribution (MCO gs ), were generated for individual patients by an expert operator using the in-house PGAP solution's Pareto navigation interface.
Figure 1 presents an overview of the solution developed and evaluated in this work, with PGAP std provided as a reference.Predictive ML models are trained on a MCO gs calibrated dataset with the aim of identifying the relationships between anatomical features and patient-specific PBAIO calibration parameters.Once trained, predicted calibration parameters can be generated for novel patients and used to form the inputs for the PBAIO system with the aim of generating plans of equivalent quality to MCO gs .This method contrasts with PGAP std where all patients are planned with the same site-specific AP protocol.
Two ML techniques were employed: multivariate polynomial regression (PGAP-ML reg ) and k-means clustering (PGAP-ML clus ).The process followed a traditional ML model generation framework with validation on an independent dataset.MCO gs was used as the reference and ground truth in all modelling.In this work, PGAP std was defined by taking the mean gold standard calibration parameters values for each patient in the training dataset.PGAP std , PGAP-ML reg and PGAP-ML clus were validated against MCO gs using an independent set of patients. .Classic approaches define a single site-specific template applied to all treatment cases.In this work, a ML approach is defined that achieves a patient-specific planning template based on patient anatomy.

Patients
The full dataset for this study consisted of 40  The number of patients selected for training reflected numbers found in previous work related to RBP [8,16,36] and planning parameter prediction for PSV [37].
Computed tomography scans were in the head-first supine position with 3 mm slice thickness.Delineated ROIs included prostate, seminal vesicles, rectum, bladder and bowel delineated up to 2 cm superior of the prostate.Patients with non-standard areas of avoidance such as hip prostheses or hernias were excluded from the patient datasets, as well as patients with non-standard margins.Forty-five PSV patients were considered in total of which five were excluded for not meeting the criteria: three having a non-standard area of avoidance and two having non-standard margins.Two PTVs were derived: (1) PTV60 defined as prostate expanded 5 mm isotropically (6 mm craniocaudally) and (2) PTV48 defined as prostate and base seminal vesicles expanded by 10 mm isotropically.PTV suffixes indicated the prescribed dose in Gy.
All plans in this study were generated within RayStation (Raysearch Laboratories, Stockholm, version 8B) using a single 360 • VMAT arc.Patients were planned according to a 20 fractions simultaneous integrated boost technique with PGs derived from local clinical goals based on the UK PIVOTAL trial [38].

Planning System Overview
The AP system used in this study is the Experience-Driven plan Generation Engine by Velindre Cancer Centre (EdgeVcc).It is a PGAP system built on a PBAIO framework with a Pareto navigation calibration interface.It is written in Python version 2.7 and implemented in the RayStation TPS using its native scripting functionality.What follows is an overview of the system, focusing on the definition and calibration the AP protocols that define the balancing of competing trade-offs during plan generation.A full description is provided by Wheeler et al. (2019) [16].With this PGAP system, plan generation is dependent upon a base site-specific 'Au-toPlan protocol' containing a set of PGs which define the plan.The AutoPlan protocol requires PGs be divided into three priority levels: P1, P2 and P3.Primary normal tissue PGs (P1) are the highest priority and ensure necessary sparing to tissue at increased risk of unacceptable toxicity when the dose received exceeds a certain level (e.g., serial organs such as the spinal cord).Target PGs (P2) ensure target volume dose objectives are met including PTV coverage and hot spots.All other planning objectives are known as trade-off PGs (P3).Each PG is assigned a numeric weighting factor that the PBAIO AP solution will use to determine prioritisation of each objective during plan generation.Weighting factors are determined in one of two ways.Prioritisation of P1 and P2 are well defined for all patients and sites and are managed by algorithms where PGs are assigned a fixed weight, with P2 ROIs compromised in favour of P1 via ROI retraction to manage conflicts.Appropriate balancing of P3 PGs is not as well defined and requires calibration to derive suitable weighting factors.
Calibration is performed using the Pareto navigation interface.This allows for the exploration of different P3 trade-off options and is equivalent to an a posteriori MCO planning methodology.For calibration, a set of plans with differing P3 weighting factors is generated using the PBAIO framework.A qualified professional navigates the different options to select the optimum balancing of P3 weighting factors for a given patient.This process is performed using a sliding interface that uses linear interpolation of neighbouring Pareto plans to enable information in the TPS to update in real-time including dose-volume histograms (DVHs), numerical information related to dose and 3D dose maps on CT scans.The associated P3 weighting factors of the chosen distribution are then stored in the AutoPlan protocol.In this study, these represent the gold standard set of PBAIO calibration parameters for the given patient and are used as the PBAIO input to generate MCO gs .

AutoPlan Protocol
The base AutoPlan protocol used in this study (presented in Tables 1-4) was based on a clinically approved and implemented solution for PSV.It was created in-line with local practice and similar PGs have been considered appropriate to manage dose distribution for this clinical site in other work [8,37].The AutoPlan protocol contains seven P1 and P2 PGs which aim to control maximum bowel dose and PTV homogeneity within fixed tolerances.It also contains seven trade-off (P3) PGs: (1) average dose to the rectum, (2) average dose to the bladder, (3) PTV dose conformality, (4) maximum dose to the rectum, (5) intra-PTV dose fall-off, (6) maximum dose to the bladder and ( 7) medium-high dose to the bowel.

Generation of Ground Truth Dataset (MCO gs )
For each patient, an expert medical physicist generated a gold standard set of PBAIO calibration parameters using the Pareto navigation functionality to explore and select the optimum P3 weighting factors.For a given PG, typically 5 different weighting factors were sampled for navigation.When navigating multiple PG, all weighting factor combinations were sampled, therefore the total number of plans required increased as an exponential function of the number of PG [39].Preliminary work showed PGs 1-3 exhibited the most notable trade-off relationships with negligible influence on PGs 4-6.Therefore, navigation was performed in two stages to ensure reasonable computational times when generating Pareto sets with PGs 1-3 and PGs 4-6 forming stage one and two, respectively.PG 7 (bowel V 36.0Gy and V 45.6Gy ) was not navigated due to minimal proximity of the associated OAR to PTV contours for the majority of patients resulting in a negligible influence on the overall plan.
PGs 1-3 were navigated simultaneously whilst the latter four were held constant at the level defined in the clinically approved AutoPlan protocol (Tables 1-4).The observer navigated this set of PGs in three separate sessions, each at least one week apart, with the mean weighting factor taken as the final MCO gs values and stored in the patient-specific AutoPlan protocols.PG 4-6 were then navigated in a similar way with PGs 1-3 held at their newly defined values.Following calibration MCO gs plans were generated for each patient using their patient-specific AutoPlan protocols.

Sample Size Justification
The majority of KBP studies in the literature utilise historical datasets of previous clinical plans and therefore substantial training datasets can be curated with low effort.In contrast, the approach in this study required Pareto navigation on each training patient to define the ground truth dataset.Autonomous generation of the Pareto plans for a three PG navigation took 31 h (125 plans each taking 15 min), with approximately five minutes of operator time required for navigation.Whilst Pareto plans could be generated for 3 patients concurrently on a single application server, the time required to generate the ground truth dataset was non-trivial.The size of the training dataset therefore had to balance the competing demands of model accuracy and practicality.
Boutilier et al. [40] presented evidence on the sample size requirements for KBP in prostate cancer.For DVH curve prediction using principle components and linear regression, 75 and 20 samples were required to minimise modelling errors for bladder and rectum DVHs, respectively.For objective function weight prediction, 150 patients were required for a k-nearest neighbour clustering methodology before a statistically insignificant difference from the benchmark was observed.However, only 10 were required for a logistic regression model.The large difference in dataset size requirements is because regression can exploit underlying distributions of the data (e.g., linear or logistic relations), whereas clustering cannot as it is a non-parametric approach.
In our study, it is not objective function weights or DVH curves that were to be predicted, but rather patient specific weighting factors for an already validated AP solution.
In this regard we hypothesised that the underlying variance of the data had already been substantially reduced through utilisation of a PBAIO framework.Therefore we considered that sample size requirements would therefore align with the lower of those proposed by Boutilier et al. (i.e., 10 < n < 20).A training dataset size of 20 patients was therefore selected for our application application as an appropriate balance between model accuracy and practicality.

Modelling 2.6.1. Predictive Features
Geometric anatomical variables relating to ROIs were chosen as predictive factors in-line with previous work [5,6,41,42].Features included volumes of ROIs, distances between ROIs and other variables such as volume ratios.A summary of the selection can be found in Table 5.Over 100 features were initially extracted and data cleaning performed to ensure: robustness during modelling [43], better modelling performance (reduction in type I and II errors) [44] and computational efficiency [45].Data cleaning involved eliminating incomplete features, removing zero-variance features (e.g., all zeros) and removing those with low variance.No features were removed for missing data as the data were fully homogeneous and no variables were considered low variance.Only variables with zero variance were removed and all remaining variables did not differ from a standard normal distribution.Collinearity between variables can leading to modelling bias [46], so associations between features were also explored.A subset of the master set of features was therefore defined.For any two features with a Pearson correlation coefficient of 0.85 or higher, one of the two features was randomly removed.A value of 0.85 was considered a reasonable cut-off and is in-line with other ML studies in the general ML literature [47][48][49].Two feature datasets are therefore defined: the full set of cleaned features (FeatureDS1) and a subset of FeatureDS1 containing uncorrelated features (FeatureDS2).

Modelling Overview
For an overview of the modelling process, see flowchart in Figure 2. ML solutions were built on the training dataset using FeatureDS1 and FeatureDS2.Code was written in Python 2.7 and packages used were sourced from Scikit-learn 1.0.2version library (SKlearn).Model formations for regression varied in terms of the feature dataset used (FeatureDS1 or Feature DB2), the number of features in the model (up to five for raw and 20 PCA features), and the degree of the regression Equation (linear, quadratic or cubic fit).Cluster models could vary in the number of clusters defined and the feature dataset used to define them.Models for every different formation combination were built for comparison and final models chosen from among them using a leave-one-out cross validation (LOOCV) approach.In all cases, 'left-in' patient features were scaled to a mean of zero and standard deviation of 1 using the SKlearn StandardScaler package.This ensured consistency and uniformity during modelling.The left-out patient was scaled according to the left-in data before prediction.Two approaches were explored for each model type: (1) modelling with raw features (not reduced by Principal Components) and (2) modelling with Principal Components.The SKlearn Decomposition package was used for PCA transformation of FeatureDS1.Principal Component generation was performed on left-in patients and the same transformation applied to the left-out patient prior to prediction.
It is not known a priori which models will be most appropriate for clinical use.Therefore an evaluation of all candidate models was implemented using the mean squared error (MSE) between predicted values and MCO gs during LOOCV as a quality score.The model minimising MSE was selected.Once the model was identified, it was retrained using the entire training dataset (i.e., no patients left out) to create the final ML solution.This solution could then be used to generate patient-specific AutoPlan protocols for novel patients with their features transformed with respect to the training datasets scaling and PCA parameters (where applicable) before prediction, as was the case with the left-out patients during model selection.The patient-specific protocol was then used as the input for the PGAP solution for patient-specific AP (PGAP-ML reg or PGAP-ML clus ).

Regression
Regression is a least squares machine learning method that uses one or more independent continuous variables to define a continuous model with predetermined parameters that minimise squared error from the raw data.Two approaches were explored for regression modelling: (1) modelling using combinations of raw features within FeatureDS2 (reg-raw) and (2) forward selection using Principal Components generated using Fea-tureDS1 (reg-PCA).In all cases the same method was followed and regressions built using the SKlearn Linear Model and Preprocessing algorithms.Linear and polynomial regression models were explored in-line with the literature [5,37,50,51] and preliminary research.Modelling and prediction was performed for each PG individually.
As raw features are not ordinal, all possible combinations of features (feature sets) were considered in the reg-raw approach.To limit the search space, up to a maximum of 5 features were allowed within a feature set.With over 100,000 possible feature sets, a separate 'feature set selection' step was performed prior to model selection to identify the optimum feature set per model formation.The methodology involved identifying the feature set with the largest mean adjusted R 2 under each model formation.Although MSE could have been used to define this optimum feature set, it would increase computational demand due to the additional calculations required and was considered impractical.
As Principal Components are ordinal, in the reg-PCA approach the dataset was transformed to Principal Components and models generated using forward selection, i.e., the first Principal Component (PC1) was used for all one feature models, PC1 and PC2 for all two feature models and so on up to the maximum 20 features.For both approaches (reg-raw and reg-PCA), models explored were linear, quadratic and cubic, i.e., 15 model formations for reg-raw and 60 for reg-PCA.The performance of each model formation was assessed using LOOCV and one of these 75 model formations was chosen as the optimal formation.

Clustering
Clustering refers to any class of unsupervised machine learning methods related to grouping data points together such that the degree of difference between variables within a cluster are minimised and therefore smaller than differences observed with data points outside of that cluster.K-means clustering is one such technique.Cluster centroids are defined based on the mean across the data points on each axis in a cluster.
K-means clustering was facilitated by the SKlearn Cluster package.The two approaches considered were: (1) clustering over FeatureDS2 (clus-raw) and (2) clustering over Principal Components of FeatureDS1 (clus-PCA).Training patients were clustered over all data available using a random initial state of 42 and 300 maximum iterations with all possible values of K considered (i.e., 20 models).Validation patients were assigned to a cluster based on the cluster centroid that minimised the Euclidean distance.The mean weight over the training patients was considered the prediction weight for unseen patients assigned to that cluster.
To aid the analysis of cluster performance, two metrics were calculated for each model formation: (1) the sum of the squared differences between each point and its cluster centroid (SSE) and (2) a silhouette coefficient-a value between −1 and 1 that scores the goodnessof-fit of each formation based on average inter-and intra-cluster distances.SSE values close to zero and silhouette scores close to 1 indicate models that are well defined.

Validation and Statistical Analysis
All patients in the validation dataset were planned according to the four approaches: MCO gs , PGAP std , PGAP-ML reg and PGAP-ML clus .For the purposes of analysis all weighting factors, which carry little intrinsic value on their own, were converted to relative weights (expressed as a percentage) through division by the summed weight of all PGs.For the validation cohort, the difference between the modelled relative PG weights and gold standard (MCO gs ) relative PG weights was the primary metric used to assess model quality, with MSE additionally calculated to aid in the comparison with training results.Plans were dosimetrically compared against MCO gs using a pairwise two-way Wilcoxon signed rank statistical testing with dose metrics of interest adapted from the UK PIVOTAL trials [38].PTV homogeneity index (HI) and Paddick's conformality index (CI) were also calculated for the analysis [52].All outliers were defined as values outside of the range [Q1 − (1.5 × IQR), Q3 + (1.5 × IQR)], where Q1, Q3 and IQR are quartile 1, quartile 3 and inter-quartile range (Q3-Q1), respectively.

Predictive Features
FeatureDS1 contained 139 features: 23 volumetric, 14 spatial features and 102 derived.The data therefore contained 139 columns (number of features) and 20 rows (number of patients) and when transformed by PCA reduced to 20 Principal Components.The first Principal Component accounted for 46.5% of the variance in FeatureDS1 and the first 11 accounting for over 95% combined variance.
Of the features in FeatureDS1, 27 were chosen for FeatureDS2: 11 volumetric, 5 spatial and 11 derived.Of the 112 features excluded from FeatureDS2, 45 were correlated with one of the kept features, 58 to two features and 9 to three features.For a comprehensive list of features in FeatureDS2, see Supplementary File S1.

Regression
See Table 6 for a summary of the LOOCV, associated feature sets and performance following training across all 20 training patients.PCA features were not found to minimise MSE for any PG, therefore all chosen models use raw features.Of the 27 features considered during modelling, 16 were among the final models.Of these 16, 8 were volumetric, 3 spatial and 5 derived.The spatial feature 'distance from the centre of PTV48 to the centre of the rectum' was used in four of the six PG models.Six features were each present in two PG models including 'volume of the rectum', 'volume of PTV48', 'distance from the centre of the bladder to the centre of the rectum, 'rectum VIF PTV48 ', 'ratio of the bladder to the rectum' and 'slope between OV rectum,PTV48 0.2cm and OV rectum,PTV48 0.4cm '.The model formations were strong for PTV dose falloff, rectum D max and bladder D max with mean adjusted R 2 greater than 0.990.The quality of the models were reduced, but still adequate, for rectum D mean , bladder D mean and PTV conformality, with R 2 between 0.835 and 0.907.

Clustering
See Table 7 for a summary of cluster model performance.A single cluster yielded the most optimal model of one PGs: intra-PTV dose falloff.Therefore, a single value was defined for all patient for this PG and feature datasets were not essential in generation of the predicted weight.Silhouette and SSE values suggest clusters had high degrees of dispersion within clusters and/or low variance between clusters.However, comparable to regression, validation MSE values were overall more desirable.PCA feature types were selected for 3 out of 6 PG.  8 for an overview of relative weight calibrations across the validation dataset (as see Supplementary File S1 for an illustration).Statistically significant differences (at the 95% level) were observed between MCO gs and the three alternative methods (PGAP std , PGAP-ML reg and PGAP-ML clus ) for three PGs: rectum D mean rectum D max and PTV dose falloff.For PGAP std and PGAP-ML reg significant differences were also observed for bowel V 36.0Gy and V 45.6Gy and the higher priority P1 and P2 goals (PG higher ).Differences were generally small (<3.58%), with PGAP-ML clus closest to MCO gs on average with differences <1.17%.Mean differences from MCO gs were also closest to zero for PGAP-ML clus for six of the eight PG.PGAP-ML reg was the poorest performer overall with deviations <2.49% and 3.57% for PTV conformality and PG higher , respectively.Figure 3 illustrates relative weight deviations from MCO gs at a per-patient level for all three methods.In general per-patient deviations were considered small with maximum deviations of 7.39%, 9.88% and 5.58% for PGAP std , PGAP-ML reg and PGAP-ML clus respectively.PGAP-ML reg was considered the poorest performer of the three methods given the largest range and inter-quartile range differences from MCO gs in all cases.PGAP std and PGAP-ML clus were considered highly comparable.
In terms of outliers, patient 25 was considered the most noteworthy patient with outlying values in six cases: bladder D max (PGAP-ML clus only), intra-PTV dose falloff (PGAP std and PGAP-ML clus ) and rectum D max (all methods).MCO gs absolute weights for rectum and bladder D max and intra-PTV dose falloff were lower for patient 25 than any patient in the training dataset and this was considered the likely underlying cause.Patient 36 was also a notable outlier with outlying values in five cases: bladder D mean and D max (PGAP-ML clus ), and rectum D max (all methods).Patient 36 had the largest bladder volume in the validation set with a volume 1.36 times the maximum volume in the training dataset.Patient 24 had outlying values for bladder D mean and intra-PTV dose falloff for PGAP-ML clus which were attributed to OV bladder,PTV60 being the largest in the validation dataset (1.97 times the median training value) and the absolute value for PTV dose falloff being outside the range defined by the training dataset (which bound PGAP-ML clus weight predictions).Finally, outlying values were observed for patient 21 in three cases: bladder D mean (PGAP-ML clus ) and rectumD mean (PGAP std and PGAP-ML clus ).This patient had the smallest OV rectum,PTV60 but also had rectum and bladder D mean weights outside of the training dataset range.Other patients with 1-2 outliers (patient 27, 30, 31, 38, 39 and 40) were as above, identified as being anatomically atypical or cases where the validation MCO gs weights were out of range of the training values.

Dosimetry
See Table 9 for a dosimetric summary of MCO gs against the three AP solutions.Furthermore, see Figure 4 for an illustration of dosimetric differences from MCO gs for key dose-related metrics for each patient in the validation dataset and Figure 5 for an example dose distributions of each solution.At a population level all three methods provided excellent correspondence with MCO gs with deviations either not statistically significant (at the 95% level), or of a small magnitude.For example, PTV coverage and hotspot metrics were within 0.28 Gy of MCO gs , and OAR objectives within 0.66% and 0.34 Gy for volume and dose metrics, respectively.The most noticeable statistically significant deviation was CI PTV48 for PGAP std which was an improvement on MCO gs by +0.01; however, this was not considered a clinically significant difference.
At patient level, deviations in the performance of the three methods was observed, with PGAP std and PGAP-ML clus highly comparable and PGAP-ML reg the poorest performer.For PGAP-ML reg the most notable deviations from MCO gs were for PTV48 D50%, CI PTV60 , CI PTV48 , rectum D mean and bladder D mean with deviation ranges of [−1.05, 1.22] Gy, [−0.114, 0.0378], [−0.172, 0.0925], [−2.05, 1.93] Gy and [−1.52, 2.03] Gy, respectively.For PGAP std and PGAP-ML clus , bladder deviations were similar to PGAP-ML reg , but substantially reduced for other metrics with PTV48 D50%, CI PTV60 , CI PTV48 and rectum D mean deviations less than ±0.60 Gy, ±0.02, ±0.06 and ±1.18 Gy, respectively.In general, for PGAP-ML clus and PGAP std deviations from MCO gs across all patients were considered small and likely not of clinical significance.In terms of individual outliers there was a low correspondence with those identified in the weight analysis.In the weight analysis, patients 21, 24, 25 and 36 were identified as notable outliers, with a total of 16 outlier weights across the three techniques.In the dosimetric analysis only 3 of these weights corresponded to dosimetric outliers: patient 24, bladder D mean for all techniques and patient 21 rectum D mean for PGAP-ML clus .

Discussion
In our previous work we developed a PGAP solution (built on a PBAIO framework) that utilised a single 'one size fits all' AP protocol for all patients in a given treatment site.The approach was evaluated against traditional TAE manual planning and considered non-inferior.This study builds upon that work in two key ways.Firstly, we introduced ML upstream of the PBAIO AP algorithm to develop a novel hybrid KBP-RBP planning approach, where ML is utilised to generate fully bespoke AP protocols for individual patients.Secondly, PGAP std , PGAP-ML clus and PGAP-ML reg were evaluated against a Pareto navigated gold standard, rather than traditional TAE manual planning that is prone to sub optimality [53].In this regard the efficacy of each automated approach could be comprehensively assessed.
Plans generated from this novel approach and plans generated via PGAP std were compared to a Pareto navigation gold standard (MCO gs ).All approaches yielded plans acceptable for clinical use and at a population level demonstrated excellent congruence with MCO gs .At an individual patient level, PGAP-ML reg was considered the weakest solution, due to algorithms being influenced by anatomical outliers.Both PGAP std and PGAP-ML clus yielded very good agreement with MCO gs across all patients, with PGAP-ML clus considered marginally superior due to fewer extreme outliers.
ML techniques used in this work are not new to radiotherapy planning.PCA [5], regression [5,50] and clustering [54] have all been used in KBP to make predictions based on anatomical features with notable success.This work builds upon this knowledge in two ways.Firstly, previous ML implementation would typically seek to generate a patientspecific input to a native treatment planning optimiser.In contrast this novel approach aimed to generate patient-specific AP protocols to further personalise an already validated RBP solution.Secondly we present a methodology to evaluate the performance of different model formations using a LOOCV decision framework, such that the optimal model for a given site can be selected.This allowed for an automatic and unbiased choice among different models comprised of various feature sets, types of features and types of model.This approach helps to resolve the challenge of defining a ML formation prior to training and allows for bespoke architecture to be utilised for individual PGs, thus removing the requirement for a homogeneous ML approach, which may not be appropriate.Results of this study support this assertion, with different model formations selected during the LOOCV model selection process.
ML in this work relied on a dataset of numerical geometric information derived from delineated patient anatomy.Whilst this methodology is based on previous KBP work, inclusion of other features may improve the versatility and modelling accuracy of the developed approach.A promising method would be utilisation of neural network generated features, which has been implemented successfully for dose prediction [55,56].Neural networks could be utilised to directly generate patient-specific AP protocols or used in a two step approach to generate dosimetric features (rather than anatomical features) from which PG weights are derived [57].However, as plan generation is a geometry-based optimisation problem, modelling wholly on anatomy based features may hold intrinsic value as they can be interpreted and therefore reduce the risk of developing an automated planning 'black box'.
The largest variances in difference from MCO gs for both input parameters (weights) and outputs metrics (dose distribution) was observed for PGAP-ML reg .This is thought to be related to the size and composition of the training dataset not adequately representing the patient population.PGAP std and PGAP-ML clus were more robust to the limited dataset size, with small deviations from MCO gs observed for outlier patients.Given regression allows predictions to be extrapolated beyond the bounds defined by the training dataset, increased robustness of PGAP std and PGAP-ML clus compared to PGAP-ML reg is thought to be due to PGAP std and PGAP-ML clus prediction weights being bounded by the training data.For outlier patients PGAP-ML reg could therefore lead to inconsistent or spurious predictions.As generating the ground truth training data is time consuming, curation of a suitably large dataset for accurate regression modelling may be challenging, especially for busy radiotherapy clinics.Therefore, these results indicate PGAP-ML reg may not be the best suited ML approach for routine clinical application.Across the three methods, PGAP-ML clus was considered the most comparable to MCO gs based on the number of significant differences observed following Wilcoxon testing, the magnitude of dose differences and the fact fewer outliers were observed.However, the superiority of PGAP-ML clus over PGAP std was considered marginal.As PGAP std is equivalent to PGAP-ML clus when K = 1, these results indicate that for the majority of patients individualisation via clustering may not be necessary if a simple site-specific protocol based on an average weight is implemented.However, marginal improvements may be gained when using PGAP-ML clus for patients who are anatomical outliers, most likely for ROIs where large anatomical variances are common, such as for bladder and patient outline ROIs.
A key strength of this study was that training and evaluation was performed with plans generated using a posteriori multicriteria optimisation methodology (MCO gs ), which we consider to be a gold standard in patient-specific plan generation [22].This contrasts with the majority of KBP training approaches and AP comparative studies in the literature, which use manual plans generated with TAE [29,58,59].Our ML models and study results are therefore not confounded by unwarranted variation or sub-optimality of plans within the training and validation datasets, which are known issues associated with TAE manual planning [53].Across all three methodologies at a population level there was excellent correspondence with MCO gs , with all volume and dose metrics within ±0.66% and ±0.34 Gy, respectively.In terms of trade-off balancing, PGAP-ML reg and PGAP std led to a marginal reduction in PTV48 D98% (0.17 and 0.28 Gy, respectively), resulting in a corresponding minor reduction in rectum V40.5Gy and V48.6Gy (0.3-0.4%).This was considered a clinically insignificant difference.No other trade-off differences were observed.In terms of individual patients PGAP-ML clus and PGAP std , yielded plans with high correlation to the gold standard MCO generated comparator (MCO gs ).The correlation was weaker for PGAP-ML reg , which as discussed was attributed to the small training dataset size.Results provide strong evidence that PGAP std , (built on a PBAIO AP framework), generates individualised plans, even when a site-specific protocol is utilised.This is an important finding, not only in validating the use of PGAP std for prostate cancer, but also providing evidence that a posteriori multicriteria optimisation yields minimal benefits over AP in terms of the individuation of patient plans.In terms of the utility of patient-specific protocols, whilst PGAP-ML clus and PGAP-ML reg did not yield marked improvements, anatomical variances were shown to be an important factor in the prediction of weights during training.For example, regression models yielded R 2 values > 0.83, with reasonable MSE during LOOCV.This suggests ML may yield improvements over PGAP std where larger anatomical variations cause the optimality of the PBAIO framework to break down, as has been demonstrated in the application of Pinnacle 3 Auto-Planning for lung [28] and nasopharynx [29] where poor quality planning was associated with anatomical outliers.
Whilst training and validating using MCO gs was a major strength in this work, due to the resource intensive nature of generating these ground truth plans, the size of the training dataset was constrained to 20 patients.This represents a key weakness in the approach, resulting in weak associations between training and validation MSE and, as discussed, the poor performance of PGAP-ML reg for outlier patients where weights were generated via extrapolation.However, despite this weakness, agreement with MCO gs was very good across all methods.It was therefore considered that training and validating on small high quality datasets was preferable to using large low quality manually generated datasets, where variation in plan quality could lead to poor models and/or spurious validation results.To improve the efficacy of training on small datasets a potential solution is to actively select a cohort of patients that suitability samples the extent of variation in the population (including outliers).This contrasts with the random selection approach taken in this work, as this approach does not explicitly screen for outlier geometries to model on.
In terms of similar studies, the most relevant are those assessing the modelling performance of KBP solutions for prostate cancer.For DVH prediction using the commercial KBP system Rapid Plan (Varian, Palo Alto), Cagni et al. [31] demonstrated that even when trained using a set of Pareto optimal plans, clinically relevant prediction errors were observed.Specifically, for rectum and bladder, errors in mean dose of up to 6 Gy (7.7% of the prescribed dose of 78 Gy) and 5 Gy (6.4% of 78 Gy) were observed, respectively.In our study, rectum and bladder mean dose errors were <2.0 Gy (3.3% of 60 Gy) across all three methods.In terms of KBP via objective weight prediction, Boutilier et al. [8] presented a dosimetric assessment of logistic regression and k-nearest neighbour models.Performance of the models were similar, with 95% percentile errors in volume dose metrics of 1.5% and 3.5% for bladder V88% and V68% respectively, and 2% and 4.5% for rectum V88% and V68% respectively.In our study, equivalent metrics were all ≤1.5% for both rectum and bladder.The performance of all three of our approaches is therefore considered very good in the context of previous work and highlights the effectiveness of the PBAIO framework in yielding bespoke plans, even without utilising ML for personalised protocols.
In this study, the absolute weights generated during MCO gs calibration were modelled and each PG were considered individually with their own optimal model defined.This made performing regression and clustering straight forward and helped to identify anatomical features that are important considerations when optimising a given trade-off.An intuitive alternative approach may have been to use a multi-output ML technique such as multi-output regression or deep learning to predict not only PG weights but relative PG weights.There is the potential that such an approach is more generalisable as weights are strongly relative in plan optimisation.Additional improvements would be to replicate these results with larger patient datasets.This would lead to greater statistical power and minimise the discrepancies in model performance between the calibration and validation cohort, which was observed for PGAP-ML reg .Inclusion of more expert observers could lead to a definition of MCO gs with even better congruence with clinical preferences.Finally, repeating the study on a more heterogeneous patient dataset (e.g., head and neck cancer) may yield substantially different results.In this study, MCO gs and PGAP std were highly 'ROI1' VOF 'PTV1' Total volume of a region-of-interest (ROI1) above the most superior slice and below the most inferior computed tomography slice of a PTV (PTV1).Measured in cm 3 'ROI1' VIF 'PTV1' Volume of a region-of-interest (ROI1) within the most superior and most inferior computed tomography slices of a PTV (PTV1).Measured in cm 3 , e.g., rectum VIF PTV48 is the rectum volume within slices containing PTV48 Slope Rate of change, i.e., change in y ÷ change in x

Figure 1 .
Figure 1.An outline of how the KBP-RBP process defined in this work (bottom) differs from the more classic site-specific methods (top).Classic approaches define a single site-specific template applied to all treatment cases.In this work, a ML approach is defined that achieves a patient-specific planning template based on patient anatomy.

Figure 3 .
Figure 3. Plots showing relative weight difference from MCO gs for the validation dataset.Bar chart are order patient 21-40 and box plot represent the overall distribution.

Figure 4 .
Figure 4. Plots showing absolute difference from MCO gs .Distributions are across the validation dataset of key dose related metrics for each of the three calibration techniques.

Table 2 .
Priority 2-Target Goals.Target represents percentage of PTV prescription dose.

Table 3 .
. Target represents dose in Gy (i.e., D mean and D max ) or percentage of ROI volume (i.e., V 36.0Gy and V 45.6Gy ).Targets are automatically adjusted during optimisation (via the PBAIO algorithms) to ensure PGs are minimised.Therefore initial values have negligible influence on the final plan, but may decrease planning time if correctly defined.

Table 4 .
Priority 3-Trade-off Goals (Dose Fall Off).Dose Gradient represents the percentage of the overall treatment prescription.

Table 5 .
Summary of variables considered for FeatureDS1 and FeatureDS2.Features fall into three categories: volume related (volumetric), distance related (spatial) and derivations of volumetric and/or spatial (derived).Variants are denoted where multiple features of their kind are generated.

Table 6 .
Summary of PGAP-ML reg model formations defined during the automated leave-one out process.

Table 7 .
A summary of final PGAP-ML clus models defined using the training dataset.

Table 8 .
Summary of PG relative weights.Values are mean averages across the the validation dataset ± one standard deviation.Boldface indicates statistically significant differences from MCO gs at the 95% level.

Table 9 .
Summary of key dose metrics.Values shown are mean ± 1 standard deviation.Statistical difference at the 95% level of significance is indicated by boldface.