Clinical Use of a Commercial Artificial Intelligence-Based Software for Autocontouring in Radiation Therapy: Geometric Performance and Dosimetric Impact

Simple Summary Auto contouring driven by artificial intelligence can improve the workflow of radiotherapy by accelerating the contouring process. However, quality assurance of artificial intelligence-based tools is necessary for ensuring safety and efficacy in a clinical practice. In this study investigated the geometric accuracy of structural contours created by a commercial software for autocontouring based on artificial intelligence using well established metrics. In particular, the impact on the radiotherapy treatment plan quality from the adoption of artificial intelligence generated contours was investigated. Our results show that the combination of automatically generated contours and careful review by a clinical radiation oncologist results in time saving without affecting the quality of treatment plan. In conclusion, after quality checks that involve both geometric accuracy as well as dosimetric impact, contouring based on AI can be safely adopted in clinical practice. Abstract Purpose: When autocontouring based on artificial intelligence (AI) is used in the radiotherapy (RT) workflow, the contours are reviewed and eventually adjusted by a radiation oncologist before an RT treatment plan is generated, with the purpose of improving dosimetry and reducing both interobserver variability and time for contouring. The purpose of this study was to evaluate the results of application of a commercial AI-based autocontouring for RT, assessing both geometric accuracies and the influence on optimized dose from automatically generated contours after review by human operator. Materials and Methods: A commercial autocontouring system was applied to a retrospective database of 40 patients, of which 20 were treated with radiotherapy for prostate cancer (PCa) and 20 for head and neck cancer (HNC). Contours resulting from AI were compared against AI contours reviewed by human operator and human-only contours using Dice similarity coefficient (DSC), Hausdorff distance (HD), and relative volume difference (RVD). Dosimetric indices such as Dmean, D0.03cc, and normalized plan quality metrics were used to compare dose distributions from RT plans generated from structure sets contoured by humans assisted by AI against plans from manual contours. The reduction in contouring time obtained by using automated tools was also assessed. A Wilcoxon rank sum test was computed to assess the significance of differences. Interobserver variability of the comparison of manual vs. AI-assisted contours was also assessed among two radiation oncologists for PCa. Results: For PCa, AI-assisted segmentation showed good agreement with expert radiation oncologist structures with average DSC among patients ≥ 0.7 for all structures, and minimal radiation oncology adjustment of structures (DSC of adjusted versus AI structures ≥ 0.91). For HNC, results of comparison between manual and AI contouring varied considerably e.g., 0.77 for oral cavity and 0.11–0.13 for brachial plexus, but again, adjustment was generally minimal (DSC of adjusted against AI contours 0.97 for oral cavity, 0.92–0.93 for brachial plexus). The difference in dose for the target and organs at risk were not statistically significant between human and AI-assisted, with the only exceptions of D0.03cc to the anal canal and Dmean to the brachial plexus. The observed average differences in plan quality for PCa and HNC cases were 8% and 6.7%, respectively. The dose parameter changes due to interobserver variability in PCa were small, with the exception of the anal canal, where large dose variations were observed. The reduction in time required for contouring was 72% for PCa and 84% for HNC. Conclusions: When an autocontouring system is used in combination with human review, the time of the RT workflow is significantly reduced without affecting dose distribution and plan quality.


Introduction
Radiation therapy (RT) is considered as an alternative to surgery for early-stage cancer, whereas locally advanced cancer is mostly treated in conjunction with surgery and systemic radiation therapies according to patient's age and comorbidities [1][2][3].Improper delineation of the target volume and organs at risk (OARs) can affect the quality of dose distribution designed during planning of the RT treatment.As a consequence, inadequate target coverage or normal tissue sparing may occur, resulting in a reduced tumor control or an increased probability of side effects [4].Traditionally, tumor volumes and OARs are manually contoured by radiation oncologists.This is a laborious procedure that is subject to both intra-and interobserver variability [5].In this scenario, automatic contouring methods can minimize the clinical workload as well as improving reproducibility of RT.In the contouring workflow, the automatic contours depict a starting point, which is reviewed and, if necessary, manually edited before being sent to the treatment planning system.
Atlas-based contouring [6], statistical models of shape and appearance [7], artificial intelligence-based methods [8], and hybrid strategies are a few examples of the automated contouring techniques that have been introduced and developed with promising outcomes.The spread of artificial intelligence (AI) is impacting the workflow of RT treatment in several scenarios [9], and AI-based autocontouring software has been developed and made available to oncologists to optimize the contouring process [10].A question that arises is whether the automated contours are of sufficient quality for clinical use, which can be answered only after effective validation, that is, evaluation of accuracy and reliability.The existing literature indicates that contour evaluation is performed mostly at the geometric level [11][12][13] using common geometric metrics, including moment-based methods, overlap metrics, and distance-based measures [14].However, geometrical metrics alone do not necessarily reflect the actual clinical impact of the contour differences [11][12][13].Treatment dosimetry, plan quality, and associated clinical decision-making processes are directly influenced by accuracy of contoured regions, and the impact of the geometric agreement into the dose domain and plan quality remains to be fully investigated [14][15][16][17].

Research Objectives
With the capability of automatically providing contours that can be used to generate clinically acceptable plans, commercial tools for automated segmentation can reduce treatment planning time substantially.The objective of this study was to investigate the accuracy of structure contours generated by commercial autocontouring software.Also, we wanted to investigate the dose distributions of the treatment plans generated from autocontoured structure sets.

Patient Data
After approval from the institutional review board of Centro di Riferimento Oncologico (CRO), 40 patients treated at CRO Aviano from September 2017 to June 2022 were selected retrospectively for this study.A total of 20 had been treated for prostate cancer (PCa) and Cancers 2023, 15, 5735 3 of 25 20 for head and neck cancer (HNC).The PCa patients' preparation before CT acquisition included full bladder and empty rectum.Patients with bilateral hip implants or rectal spacer were not included in the study.Patients with HNC cancer required no preparation but were immobilized with a thermoplastic mask.No contrast was administered for any patient before CT image acquisitions.
Patients CTs for planning of the treatment were acquired using a 90 cm wide bore Toshiba Aquilion 16 CT simulator with 5 mm slice thickness for PCa and 2 mm for HNC.Images were reconstructed using FC13 reconstruction algorithm having a 256 × 256 matrix.

Contouring Workflows
Target volume delineations are ruled by international guidelines and scientific associations recommendations.The contoured structures for PCa patients included the entire prostate and its capsule, which represent the clinical target volume (CTV) as well as the organs at risk.The planning target volume (PTV) was created by expanding CTV by 5 mm margin in all directions except 3 mm posterior.For HNC, organs at risk were contoured automatically, while the CTV was not automatically contoured.
Contoured structures, excluding the PTV, were generated for each patient using three methods, as follows: -Manual contouring (C man ).Contours were delineated by a radiation oncologist with at least ten years of experience, also using semiautomated tools like flood fill and interpolation, within the integrated ARIA and Eclipse TPS systems (version: 16.1; Varian Medical Systems, Inc., NewYork, CA, USA) [18] and following the institutional guidelines [19][20][21].These contours were assumed as the ground truth structures.-Fully automated contouring based on artificial intelligence (C AI ).These were automatically created using a research version of Limbus Contour (version: 1.0.18;Limbus AI Inc., Regina, SK, Canada) [22] software.Limbus Contour (LC) employs organ-specific deep convolutional neural network models on the basis of a U-net architecture [23], which were trained on CT images from the Cancer Imaging Archive public database [24].Following the creation of contours, LC applies a number of postprocessing techniques including as outlier removal, slice interpolation, z-plane cutoffs, and contour smoothing [23].Contouring the structure set on a patient required up to 7 min on 3.For better reproducibility, manual contouring was performed by one radiation oncologist for each treated site and interobserver variability was measured between two operators for PCa (Section 2.9).

Treatment Planning and Delivery
The radiotherapy plans that had been previously delivered to the patients were assumed as the reference for dose comparison.Structure sets for these treatments were manually contoured by radiation oncologists (ROs) using the institutional protocol for Radiotherapy Oncology Group.Treatment planning was performed using dose prescription and constraints for planning shown in Tables 1 and 2.
PCa treatments were delivered using the volumetric modulated arc therapy (VMAT) technique with one or two 18 MV full coplanar arcs, 600 MU/min maximum dose rate, and a prescribed dose of 60 Gy in 20 fractions each of 3 Gy.HNC patients received intensitymodulated radiation therapy (IMRT) treatments with nine 6 MV photon beam fields, a maximum dose rate of 300 monitor units (MU) per minute, and a prescribed dose of 70.95 Gy in 33 fractions of 2.15 Gy each.These plans were generated using the Eclipse planning system (Varian Medical).Dose calculations were performed using the anisotropic analytical algorithm (AAA) with a grid resolution of 2.5 mm [25].The treatment schedule consisted of 5 daily fractions per week.The treatments were administered using a Varian TrueBeam or Trilogy linear accelerator.A cone-beam computed tomography image was acquired at the beginning of each treatment session for image-guided RT [26].For evaluation of dose difference due to autocontouring in the planning workflow, treatment plans were exactly same for the structure set C AI,adj following the same planning and optimization procedure as for the plans clinically used.Plans were exported for analysis in RT DICOM formats from the treatment planning system.DICOM files were transferred to a high-performance computer interface for analysis with homemade MATLAB scripts.

Qualitative Assessment of Automated Contouring
An experienced clinician assessed the C AI for each patient using a four-point Likert scale, shown in Table 3, to evaluate qualitatively the automated contouring process.As such a test aims at distinguishing between AI and human operator, this is sometimes referred to as the Turing test [27,28].

Geometric Evaluation
For target and OAR structures, comparisons between C man and C AI contours before and after the physician review (C AI,adj ) were compared with these metrics described herein.For comparing AI-versus human-generated contours, we used different types of geometrical metrics that are based on distance between surfaces, size of overlapping volumes, and difference in size [29].
Dice similarity coefficient (DSC) provides a measure of the volumetric overlap of two contours of a structure with a score range from 0 (no overlay) to 1 (total overlay) [30]: Hausdorff distance (HD) is a bidirectional measure of distance between contour surfaces [30].This metric calculates the distance to the closest point in both directions, from contour C man to contour C AI,adj and vice versa, to figure out the largest surface-to-surface separation between two contours.
Relative volume difference (RVD), also known as relative absolute volume difference, describes the size difference between the regions: where V man and V AI,adj represents the absolute volume corresponding to the C man and C AI,adj contours, respectively.

Evaluation of Dose Differences
To assess the potential impact of AI on dosimetry, we calculated the difference in dose indexes among plans as: where (D X ) man and (D X ) Ai,adj referred to dose parameters for C man and C AI,adj contours, respectively.And X represents the dose metrics such as D min , D mean , and D 0.03cc .Dose distribution to the organs-at-risk (OAR) doses were evaluated using D mean (mean dose) and the highest dose encompassing 0.03cc, D 0.03cc [31].
Homogeneity Index (HI) was utilized to evaluate the dose uniformity within the PTV.HI was assessed using the following formula [32]: Where D 2% (near maximum dose), D 98% (near minimum dose), and D 50% represent the minimum dose covering 2%, 98%, and 50% of the target volume, respectively.Formula for HI comparison between plans for C man and C AI,adj contours was- where HI man and HI AI,adj represent the HI for C man and C AI,adj contours, respectively.Conformity Index (CI) was used to obtain a quantitative evaluation of the PTV coverage by the prescribed dose.CI was evaluated using the following equation [33]: where V TV indicates the volume that receives 95% of the prescribed dose, V PTV represents the PTV volume, and TV PV is the PTV volume inside the V TV .CI comparison between C man and C AI,adj contours plan was carried out by using the formula below: where CI man and CI AI,adj indicates the C man and C AI,adj contours CI, respectively.

Normalized Plan Quality Metric
The plan quality metric (PQM) framework was designed to establish a standardized approach for assessing how well a particular treatment plan achieves specific dose volume objectives that serve as a hypothetical "virtual physician" [34].A PQM scorecard is often created for every objective which assigns a score based on how effectively the objective is achieved by a particular plan.To enable meaningful comparisons across our study cases, we utilized the normalized PQM (nPQM) score, which divides the PQM score by the peak score achievable by the plan of a certain district (PQM max ) and scales to the percentage.The formula used for the normalized plan quality metric was: The PQM scorecard to be used for analysis in this trial is shown in Tables 4 and 5. To calculate the score for a particular objective, there are two different types of functions: threshold and linear score.The threshold score's function awards no points if the objective is not achieved and the maximum number of points for the accomplished objective.The linear score's function makes use of two thresholds.Maximum points are awarded if the plan satisfies the constraint's "ideal threshold" and no points are assigned if it does not exceed the constraint's "minimally acceptable threshold".Using the value of the dosevolume statistic, linear interpolation between the two thresholds is used to calculate the number of scores awarded if the objective is between the two thresholds.

Evaluation of Contouring Time
The amount of time required for contouring was measured in order to estimate the increase in performance made possible with autocontouring.Since in clinical practice AI-generated contours need always to be reviewed and eventually modified by a radiation oncologist, we measured the reduction in contouring time from autocontouring as

Interobserver Variability
The interobserver variability was assessed by comparing autogenerated structures reviewed and adjusted by two different operators.The geometric differences were calculated by assessing DSC, HD, and RVD among AI-assisted contours performed by the two operators: where C AI,adj1 is the contour generated by AI and adjusted by operator 1.
Cancers 2023, 15, 5735 9 of 25 Dosimetric evaluation was performed using the same methods previously described.For instance, the interobserver variability for D min to an organ at risk was calculated as where D min,1 and D min,2 are the minimum doses to an organ at risk generated using AI and adjusted by operators 1 and 2, respectively.

Data Analysis
For geometrical and dosimetric evaluation, we developed an in-house script in MAT-LAB version R2021a (The MathWorks, Inc, Boston, MA, USA) [35] to compare structure sets and treatment plans for both the automatic and the manually edited contours as shown in Figure 1.Wilcoxon rank sum test was employed to perform dosimetric comparisons to determine if there were any significant differences between the individual OARs in each arm in terms of the D min , D mean , and D 0.03cc doses based on the reference dose distribution, and significant differences for the PTV doses in terms of HI and CI were assessed with the alpha (α) value 0.05 for 95% CI.

Qualitative Assessment of Automated Contouring
Results of the quality assessment for the PCa and HNC contours are shown in

Qualitative Assessment of Automated Contouring
Results of the quality assessment for the PCa and HNC contours are shown in Figures 2 and 3, respectively.

Qualitative Assessment of Automated Contouring
Results of the quality assessment for the PCa and HNC contours are shown in

Geometric Comparison
Tables 6 and 7 show the differences among structures contoured with different modalities.The highest average DSC values were observed for the bladder and rectum, followed by the anal canal and prostate.The values of the average HD were 4.19 mm, 2.85 mm, and 1.08 mm for prostate, bladder, and rectum, respectively.The values of the RVD showed the same trend: 0.08, 0.02, 0.01 for prostate, bladder, and rectum, respectively.

Geometric Comparison
Tables 6 and 7 show the differences among structures contoured with different modalities.The highest average DSC values were observed for the bladder and rectum, followed by the anal canal and prostate.The values of the average HD were 4.19 mm, 2.85 mm, and 1.08 mm for prostate, bladder, and rectum, respectively.The values of the RVD showed the same trend: 0.08, 0.02, 0.01 for prostate, bladder, and rectum, respectively.Figure 4 shows DSC, HD, and RVD scores for PCa cases.For PCa, large variabilities in terms of DSC and RVD were observed for anal canal and both femur heads in comparison between C man and C AI,adj contouring.As for HD values, a wide range of values was reported for femur heads.Table 8 provides a complete list of DSC, HD, and RVD values for HNC contours.The brain, mandible, parotids, and thyroid showed a high level of correlation with average DSC scores of 1.00, 0.98, 0.99, and 0.94, and average HD scores of 0.65 mm, 8.13 mm, 1.50 mm, and 9.58 mm, respectively, between the contours of before (CAI) and after physician review (CAI,adj).RVD values were generally close to 0. Table 8 provides a complete list of DSC, HD, and RVD values for HNC contours.The brain, mandible, parotids, and thyroid showed a high level of correlation with average DSC scores of 1.00, 0.98, 0.99, and 0.94, and average HD scores of 0.65 mm, 8.13 mm, 1.50 mm, and 9.58 mm, respectively, between the contours of before (C AI ) and after physician review (C AI,adj ).RVD values were generally close to 0. Figure 5 shows the geometric evaluation results of C AI contour and C man contour both compared with C AI,adj contours.Autocontouring resulted in similar results for the brain, brainstem, mandible, and eyes (DSC > 0.83).For the brachial plexuses, parotids, cochlea and submandibular glands, there was a significant difference between the C man contour and C AI,adj contour, while other OARs had better performance in terms of DSC, HD, and RVD.
Table 9 summarizes the DSC, HD, and RVD values calculated between the C man and C AI,adj contours.The worst metrics were found in smaller structures such as lenses, while larger structures including the brain, mandible, eyes, and trachea showed a high level of correlation, with average DSC around 80% and lower HD and RVD values.Table 9 summarizes the DSC, HD, and RVD values calculated between the Cman and CAI,adj contours.The worst metrics were found in smaller structures such as lenses, while larger structures including the brain, mandible, eyes, and trachea showed a high level of correlation, with average DSC around 80% and lower HD and RVD values.

Dosimetric Comparison
Differences in D min , D mean , and D 0.03cc between manual and AI-assisted are shown in Figure 7.
The quantitative results of the dosimetric comparisons of plans with the C man plan compared with C AI,adj contours are summarized in Table 10.No significant dose differences were measured between manual and autocontour workflows, except the anal canal for PCa cases.

PTV Evaluation
Figure 6 shows the geometric and dosimetric comparison of the Cman plan compared with CAI,adj for prostate PTV in terms of DSC and RVD scores and differences in HI and CI.

Dosimetric Comparison
Differences in Dmin, Dmean, and D0.03cc between manual and AI-assisted are shown in Figure 7.The quantitative results of the dosimetric comparisons of plans with the Cman plan compared with CAI,adj contours are summarized in Table 10.No significant dose differences were measured between manual and autocontour workflows, except the anal canal for PCa cases.Differences in D mean to OARs for HNC between C man and C AI,adj contours are shown in Figure 8a, where the esophagus exhibited relatively large variations in ∆D mean .D 0.03cc to the eyes and cochleas had a difference of a maximum of 13% between the C man and C AI,adj plans (Figure 8b), while other OARs showed <10% differences, except for the constraint of contralateral brachial plexus and brainstem.Each circle symbol represents a value outside the standard deviation.Differences in Dmean to OARs for HNC between Cman and CAI,adj contours are shown in Figure 8a, where the esophagus exhibited relatively large variations in ∆Dmean.D0.03cc to the eyes and cochleas had a difference of a maximum of 13% between the Cman and CAI,adj plans (Figure 8b), while other OARs showed <10% differences, except for the constraint of contralateral brachial plexus and brainstem.Each circle symbol represents a value outside the standard deviation.The dosimetric parameters of the HNC patients are listed in Table 11.The largest differences were seen in both brachial plexuses Dmin and Dmean, with differences up to 82% and 35%, respectively, between the Cman and CAI,adj pairs.The differences were relatively smaller for other OARs between the Cman and CAI,adj contour plan pairs.The dosimetric parameters of the HNC patients are listed in Table 11.The largest differences were seen in both brachial plexuses D min and D mean , with differences up to 82% and 35%, respectively, between the C man and C AI,adj pairs.The differences were relatively smaller for other OARs between the C man and C AI,adj contour plan pairs.Differences between the achieved dosimetric parameters for PCa planning were not significant according to the Wilcoxon test, with the exception of D 0.03cc for the anal canal .Brachial plexuses showed significant differences in terms D mean .The statistical analysis results are shown in Table 12.

nPQM Comparison
nPQM revealed that all the plans optimized from C AI,adj were considered equivalent to C man , with only few plans deemed as inferior to the clinical plan but clinically acceptable.Table 13 summarizes the difference in plan quality for all study sites.

Time Savings
Table 14 reports the average times required for contouring over all test subjects with different methods in absolute and percentage units of time savings.

Interobserver Variability
The qualitative test results showed no significant difference between the two observers.Time saving percentages varied among ROs (from 64% to 72% and 16 to 19 min for PCa, respectively).Only 2% variation was observed in nPQM.A detail geometric differences are shown in Table 15.As shown in Figure 9, the plans with C AI,adj resulted in anal canal coverage that largely differed from the manual contour plan.No significant geometric differences were found for DSC and RVD by comparison of both RO-reviewed contours with the C man contour.Table 16 tabulates the difference in dosimetric parameters for observer variability.

Discussion
Since automatic segmentation tools have become a more efficient alternative to expert manual segmentation, it is important that these applications undergo a thorough review, as the full responsibility of the use of AI falls to humans [36].In particular, the medical physicists have the responsibility of a thorough quality assurance [37] and the radiation oncologist has clinical responsibility of the resulting contours.The purpose of this work was to explore the potential advantages of including an artificial intelligence-based autocontouring system in a clinical pathway in terms of time saving, contour generation accuracy, and radiotherapy plan quality obtained from such reviewed structure sets.The analysis was performed on a dataset of 40 cancer patients equally distributed for PCa and HNC.
As the majority of reported evaluation metrics in the literature are based on geometric metrics [38], and usually evaluate autocontouring without human intervention, we compared both the geometric and dosimetric plan quality performance of the autocontouring software (version: 1.0.18;Limbus AI Inc, Regina, SK, Canada), after physician validation and adjustment, against manual contours.
The first results of this work clearly indicate that with the aid of an AI-based autocontouring system, 72% and 84% of contouring time can be saved for PCa and HNC cases, respectively.More time saving is possible by implementing a fully integrated system that automatically detects the CT image by predefined protocol and contour structures, eliminating manual export/import function.Moreover, the geometric accuracy reached by Limbus AI showed a high compliance with the contours used in the clinical routine.The target and OARs of PCa patients were segmented to high geometric precision, with DSC between C man and C AI,adj ≥ 0.7.The anal canal contours had the largest differences, with an average value of DSC (0.70) as well as a 30% difference in volume between the C man and C AI,adj contours.
In comparison to AI-based C AI,adj contours, most of the structures of HNC cases, including the brain, mandible, eyes, and optic nerves, had a high degree of geometric correlation (DSC > 0.98, HD < 3.32 mm, and RVD near to 0).However, there were also structures with low DSC, such as the brachial plexus (DSC = 0.11-0.13),leading to a large variety of results, which is consistent with the previous literature [6,11].The institutional recommendation to contour a larger larynx, for instance, may result in a poorer geometric correlation of this OAR.Moreover, for this study, the autocontouring software and the oncologist utilized only CT images without contrast enhancement for contouring and revision, while normally, ROs register MRI images to CT images for contouring the OARs.
In principle, the accuracy of contouring has a direct influence on plan optimization, and hence the assessment and decision-making process for treatment plans.As a result, the focus of this study was to determine whether C AI,adj contours could provide equivalent dosimetric findings to C man contours when examined using dosimetric parameters.The prostate PTV conformity index showed nearly no change in dosimetric analysis; however, there was a 22% difference in HI.Although this study did not contain target volume auto-segmentation, we exclusively examined prostate PTV for observation.The modest dosimetric variation in PTV might be attributed mostly to the expertise and different approach to planning by various medical physicists.
The greatest notable dose difference for PCa OARs' dose-volume metrics was in the anal canal for the C man vs. C AI,adj contour plan, whilst other OARs maintained almost the same dose distribution.Femurs indicated slightly higher mean dose, which might be attributed to volume variance in the femur segmentation.In terms of HNC cases, both brachial plexuses showed a greater divergence in the mean dose for the C AI,adj contours as compared to the C man contours.Otherwise, no significant differences in dose-volume metrics were discovered for those plans.Dosimetric disparities between the C man and C AI,adj contour plans, on the other hand, were minimal for organs such as the cochlea, parotids, and submandibular glands.Only for the brachial plexus were mean dose differences statistically significant; otherwise, the Wilcoxon rank sum tests failed to identify a significant difference in the achieved dosimetric parameters between these plan pairs, implying that the C AI,adj -generated plans perform similarly to the C man contour in the dose optimization and evaluation process for HNC planning.
The complex interplay between structure geometry and dose distribution is reflected in the discrepancy between geometric and dosimetric performance.In addition to geometric accuracy, spatial dose distribution and steepness of dose gradients also affect dosimetry performance.Even if there is a significant difference in the dosimetric metrics between the C man and C AI,adj contours for a structure located far away from the high-dose zone, their absolute dosimetric values may be too small to have an impact on plan assessment and decision making.Furthermore, depending on whether it extracts point or volume-based dosimetry, each dosimetric parameter (i.e., maximum, mean, or volume-based parameter) has a distinct reliance and sensitivity to geometric change.For example, when the size of a structure varies in a high-dose gradient zone, the maximum dose may fluctuate more than the mean dose [4].Overall, the complex interplay between structure geometry and dose distribution suggests that employing a commercial autosegmentation system that was not trained on local data necessitates further examination that includes both geometric and dosimetric analysis.This critical situation highlights the significance of adopting normalized plan quality metrics as a virtual physician that integrates both geometry and dosimetry assessment.The overall plan quality of PCa and HNC cases with the C AI,adj contour changed by 8.0% and 6.7%, respectively, when compared to the reference plan that was in a relatively acceptable range.
Interobserver variability analysis was conducted for PCa cases, where the geometric and dosimetric data acquired using each of the studied delineations by two ROs and the manual one was analyzed.Time savings and acceptance of AI-driven contours are approximately the same for both ROs.Except for the anal canal contour, there was a good correlation of geometric metrics (DSC > 0.92, HD < 3.74 mm, and RVD < 0.04) between two ROs.There was also a large dose variation (D 0.03cc was 12% and D mean was 23%) for the anal canal, despite the fact that the dose parameters for other OARs were identically matched between ROs.The overall normalized plan quality variation was 2% between ROs, whereas the difference between the C man and C AI,adj contour plan was 3.2%, suggesting that a standard starting point of contouring can reduce interobserver variability.
We considered manually delineated contours of CRO Aviano as the gold standard in this research.This is not to claim that manual delineation is "better" or "accurate" than AI-based delineation.Experts favored autosegmented contours over manual delineation for specific structures in our ongoing evaluation study.Manual delineation provides a clinically acceptable and recognized contour quality, implying some clinical expertise or local institution practices.As Limbus software (version: 1.0.18;Limbus AI Inc., Regina, SK, Canada) was trained using universal structure sets, software using local institutional datasets can lessen discrepancies because there are always some variances in practice between institutions.
This study has some limitations.Even if the selected cases for each district resulted in a homogeneous dataset, only a subgroup of the patients in this research were evaluated for dosimetry.Although it clearly highlighted the disparity between geometric metrics and dosimetry performance, further research including a wider pool of patient samples will be advantageous in characterizing the dosimetry performance of each unique structure.The contouring was carried out retrospectively using CT images without contrast enhancement and without the registration of MRI and/or PET images, which is now strongly suggested for the contouring of not only treatment volumes, but also OARs in some pathologies.For a more in-depth examination, research registering CT autocontouring with MRI and/or PET images might be a feasible option.To obtain a more complete scenario on how the performance of the Limbus autocontouring system affects the contouring procedure, a comparison with other similar software should be performed.Finally, the 5 mm CT slice thickness in the prostate patients, which is standard practice in our institution, is a relatively large value used in prostates [39].A change in slice thickness from 5 to 3 mm has been shown to affect only the volume of the bladder significantly [40].However, this should not affect the main conclusions of the present study, as the slice thickness was always consistent during the comparison among AI and humans in the prostate patients.Despite its limitations, this study offers a proof-of-concept methodology to investigate the impact of including in the RT workflow an autocontouring software.

Conclusions
In the contouring process, human assessment is required due to the lack of absolute dependability of automatic segmentation.Nonetheless, providing an approach that has the potential to speed up the contouring process in the vast majority of cases would be an improvement over present clinical practice.
The clinical acceptability and efficacy of the AI-driven approach are dependent on the structural segmentation for the site, and clinical criteria stringency, as demonstrated by the cancer sites.The varying performance of C AI,adj contours across structure sets suggests a different approach, in which automatic segmentation is used to generate a subset of contours where AI consistently performs well, and clinical effort is reserved for the complement subset, which may be more sensitive and subject to significantly larger error or variation.
Dose parameter analysis revealed that treatment plans optimized using AI-generated contours did not result in statistically significant differences when examined using normalized plan quality metrics.The results show that plans based on automatically generated contours do not overdose nearby OARs.However, no statistically significant link between geometric and dosimetric metrics was found.The outcomes from dosimetric analysis and interobserver variability suggest that AI-based autocontouring may help to establish a standard starting point for radiation therapy treatment.

Figure 2 .
Figure 2. Evaluation of AI-based contouring for PCa.

Figure 3 .
Figure 3. Physician assessment of AI-based contouring for HNC.

Figure 3 .
Figure 3. Physician assessment of AI-based contouring for HNC.

Figure 4 .
Figure 4. Geometric evaluation results: (a) DSC, (b) HD in mm, and (c) RVD, for CAI,adj contours in comparison with both CAI and Cman contours of PCa cases.Each * represents a value.

Figure 4 .
Figure 4. Geometric evaluation results: (a) DSC, (b) HD in mm, and (c) RVD, for C AI,adj contours in comparison with both C AI and C man contours of PCa cases.Each * represents a value.

Figure 6 Figure 5 .
Figure6shows the geometric and dosimetric comparison of the C man plan compared with C AI,adj for prostate PTV in terms of DSC and RVD scores and differences in HI and CI.

Figure 6 .
Figure 6.(a) Geometric evaluation by DSC and RVD and (b) dosimetric evaluation in terms of homogeneity index and conformity index of prostate PTV.Each circle symbol represents a value outside the standard deviation.Each circle symbol represents a value outside the standard deviation.

Figure 6 .Figure 7 .
Figure 6.(a) Geometric evaluation by DSC and RVD and (b) dosimetric evaluation in terms of homogeneity index and conformity index of prostate PTV.Each circle symbol represents a value outside the standard deviation.Each circle symbol represents a value outside the standard deviation.Cancers 2023, 15, x FOR PEER REVIEW 16 of 26

Figure 7 .
Figure 7. Dosimetric evaluation results: (a) relative difference in mean dose (D mean ), and (b) relative difference in dose of 0.03cc volume (D 0.03cc ), for plan form C man contour in comparison with C AI,adj contours of PCa cases.Each circle symbol represents a value outside the standard deviation.

Figure 8 .
Figure 8. Dosimetric evaluation results: (a) relative difference in mean dose (D mean ), and (b) relative difference in dose of 0.03cc volume (D 0.03cc ), for plan form C man contour in comparison with C AI,adj contours of HNC cases.Each circle symbol represents a value outside the standard deviation.

Figure 9 .
Figure 9. Geometric evaluation results: (a) DSC, (b) HD in mm, and (c) RVD and dosimetric evaluation results; (d) relative difference in mean dose (Dmean) and (e) relative difference in dose of 0.03cc volume (D0.03cc) of interobserver variability.Each circle symbol represents a value outside the standard deviation.

Figure 9 .
Figure 9. Geometric evaluation results: (a) DSC, (b) HD in mm, and (c) RVD and dosimetric evaluation results; (d) relative difference in mean dose (D mean ) and (e) relative difference in dose of 0.03cc volume (D 0.03cc ) of interobserver variability.Each circle symbol represents a value outside the standard deviation.

Author Contributions:
Conceptualization, S.M.H.H. and P.C.; methodology, S.M.H.H.; software, S.M.H.H. and G.P.; formal analysis, G.P. and M.A.; investigation, A.D. (Alessandra Donofrio) and F.M.; resources, G.F., A.C. and R.B.; writing-original draft preparation, S.M.H.H.; writing-review and editing, A.D. (Annalisa Drigo), R.S.R. and M.A.; supervision, P.C.All authors have read and agreed to the published version of the manuscript.Funding: This work was supported by the Italian Ministry of Health (Ricerca Corrente) (no grant number provided).The authors would also like to acknowledge the ACC reti 2021-RCR WP12.Institutional Review Board Statement:The studies involving human participants were reviewed and approved by Comitato Etico Unico Regionale-CEUR Friuli Venezia Giulia, Azienda Regionale di Coordinamento per la Salute (ARCS), via Pozzuolo n. 330-33100 Udine (palazzina B).Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.Data Availability Statement: Data available on request due to privacy/ethical restrictions.

Table 3 .
Scoring values for qualitative assessment of AI-generated contours.
C man , C AI,adj ) represents the Euclidean distance between a and b voxels corresponding to the C man and C AI/ C AI,adj contours, respectively, and the formula is: h C man , C AI,adj = max a∈C man min b∈C AI,adj ||a − b ||

Table 4 .
PQM of PCa treatment plans.

Table 5 .
PQM of HNC treatment plans.

Table 6 .
Summary of DSC, HD, and RVD values measured before and after physician contours, (CAI vs. CAI,adj) for PCa cases.

Table 7 .
Summary of geometric difference metrics measured between Cman and CAI,adj contours measured with different metrics for PCa cases.
Figure4shows DSC, HD, and RVD scores for PCa cases.For PCa, large variabilities in terms of DSC and RVD were observed for anal canal and both femur heads in comparison between Cman and CAI,adj contouring.As for HD values, a wide range of values was reported for femur heads.

Table 6 .
Summary of DSC, HD, and RVD values measured before and after physician contours, (C AI vs. C AI,adj ) for PCa cases.

Table 7 .
Summary of geometric difference metrics measured between C man and C AI,adj contours measured with different metrics for PCa cases.

Table 8 .
Summary of DSC, HD, and RVD values measured before and after physician contours (C AI vs. C AI,adj ) for HNC cases.

Table 9 .
Summary of geometric difference metrics measured between Cman and CAI,adj contours measured with different metrics for HNC cases.

Table 9 .
Summary of geometric difference metrics measured between C man and C AI,adj contours measured with different metrics for HNC cases.

Table 10 .
Relative differences in D mean and D 0.03cc values measured between C man and C AI,adj contours for PCa cases.

Table 10 .
Relative differences in Dmean and D0.03cc values measured between Cman and CAI,adj contours

Table 11 .
Summary of relative differences in Dmin, Dmean, and D0.03cc values for Cman and CAI,adj contours for HNC cases.

Table 11 .
Summary of relative differences in D min , D mean , and D 0.03cc values for C man and C AI,adj contours for HNC cases.

Table 12 .
Statistical test results for D min , D mean , and D 0.03cc values measured between the plans generated from C man and C AI,adj contours of PCa and HNC cases.

Table 13 .
Relative difference in normalized plan quality metric between treatment plans with C man and C AI,adj contours.

Table 14 .
Time savings using AI-assisted autocontouring for study sites.

Table 15 .
Interobserver variability in terms of DSC, HD and RVD values measured between C AI,adj performed by two independent physicians for PCa cases.

Table 16 .
Summary of relative differences in Dmean and D0.03cc values measured between CAI,adj performed by two different radiation oncologists.

Table 16 .
Summary of relative differences in D mean and D 0.03cc values measured between C AI,adj performed by two different radiation oncologists.