Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning

Bayley, Conrad; Rau, Allison; Lau, Harold; Banerjee, Robyn; Quon, Harvey; Tchistiakova, Ekaterina; Kirkby, Charles; Jutras, Jean-David

doi:10.3390/app16115681

Open AccessArticle

Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning

by

Conrad Bayley

^1,*

,

Allison Rau

¹,

Harold Lau

¹

,

Robyn Banerjee

¹,

Harvey Quon

¹,

Ekaterina Tchistiakova

^1,2,

Charles Kirkby

^2,3

and

Jean-David Jutras

^1,2

¹

Department of Oncology, Arthur J.E. Child Cancer Centre, University of Calgary, Calgary, AB T2N 5G2, Canada

²

Department of Physics and Astronomy, University of Calgary, Calgary, AB T2N 1N4, Canada

³

Department of Oncology, Jack Ady Cancer Centre, Lethbridge, AB T1K 0B3, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5681; https://doi.org/10.3390/app16115681 (registering DOI)

Submission received: 23 April 2026 / Revised: 17 May 2026 / Accepted: 18 May 2026 / Published: 5 June 2026

Download

Browse Figures

Versions Notes

Featured Application

This article may help to inform the implementation of auto-contouring software into clinical practice, which has the potential to improve efficiency and consistency in radiotherapy planning.

Abstract

Background: Manually contouring the organs at risk (OARs) is a time-consuming process with significant variability. Multiple commercial options are available to streamline this process by utilizing artificial intelligence. This study compares the performance of two commercially available systems, MIM Contour ProtégéAI v4.0.0 (MIM) and Radformation/Limbus AI v1.7.0 (Limbus), in auto-contouring head and neck (H&N) OARs. Methods: We conducted a retrospective single-institution study with 20 patients who received curative-intent H&N radiation therapy. Seventeen OARs, bilateral where applicable, were manually contoured for each patient, which established a reference for comparison. Dice coefficient (DC), mean Hausdorff distance (HD_mean), and maximum Hausdorff distance (HD_max) were compared relative to the manual reference. Subjective contour quality was graded on a Likert scale. Wilcoxon signed-rank test with Benjamini–Hochberg false discovery rate procedures were used to compare outcomes. Results: Twenty patients were selected with a variety of H&N primary sites. The total number of OARs used for comparison was 480. Three quarters of patients were male, the mean age was 60.5 years (SD = 12.8), and the mean BMI was 23.53 kg/m² (range 16.9–42.4). MIM performed more consistently with the manual reference (adjusted p < 0.05) compared to Limbus when contouring the bilateral brachial plexus and mandible. Limbus performed more consistently than MIM when contouring the left lacrimal gland, larynx, optic chiasm, bilateral optic nerves, right parotid, and esophagus. The DC, HD_mean, and HD_max were not statistically different between MIM and Limbus for the remaining structures. Both AI algorithms had high performance on the mandible, eyes, thyroid, and salivary glands, and poor performance on the optic chiasm, lacrimal glands, and brachial plexuses. The subjective quality of the auto-contours from both systems were good, with most structures requiring no or minor changes only. The structures most likely to require revisions were the pharyngeal constrictors, larynx, and submandibular glands. Limbus achieved better subjective scores than MIM for four of seventeen structures. Conclusions: This comparison of two AI auto-contouring systems demonstrated performance consistent with the manual reference for most H&N OARs. Each system had strengths and weaknesses in auto-contouring specific OARs.

Keywords:

radiotherapy; contouring; head; neck; autosegmentation; autocontouring; treatment planning; artificial intelligence; organs at risk

1. Introduction

An ongoing barrier to efficient throughput in the field of radiation oncology is the time-consuming nature of radiotherapy planning, which requires manual delineation (contouring) of structures on CT imaging. Contouring is especially time-consuming in the head neck (H&N) region where there is a high density of organs at risk (OARs), which can lead to delays in treatment and significant inter-provider variability [1,2,3,4,5,6,7,8,9,10]. Contouring a single set of H&N OARs is estimated to take between 2 and 3 h per patient [11,12].

Atlases and artificial intelligence (AI) algorithms have been used to accelerate this process in recent years, with deep learning (DL) systems being the most accurate and robust method of auto-contouring (AC) [13,14]. However, clinical adoption of these systems has been slow as it is necessary to ensure AC maintains the quality and consistency that manual contouring achieves.

Atlas-based AC systems are trained on the appearance and general shape of anatomical structures and work by projecting and modifying a product shape onto a new image set via a process called deformable image registration [11,15,16,17]. DL uses artificial neural networks to mimic the learning of a human brain to process many inputs, integrate them, and produce an output. This technology has great promise in the field of radiation planning, especially in the H&N discipline [18,19,20,21], as one study implementing a predominant commercially available system found that efficiency gains in the workflow were highest in H&N tumour sites, with a 65% relative (80 min absolute) reduction in contouring time [22]. Studies, including a recent systematic review, have suggested that DL approaches are superior to historically used atlas-based AC approaches [13,23]. Studies have demonstrated that clinical implementation of DL results in fewer necessary contour edits and higher user satisfaction, as well as increased overall workflow efficiency, among dosimetrists and radiation oncologists compared to historical atlas-based systems [24,25,26].

Several companies have employed this technology in efforts to produce a new generation of AC software. Amongst these, two commercially available systems, MIM Contour ProtégéAI (MIM Software Inc., Cleveland, OH, USA) and Limbus Contour (Limbus-Radformation, New York, NY, USA), are becoming more widely used. Although some work has been done evaluating the individual software [13,27,28], there is a lack of published quantitative comparison of these two systems in H&N malignancies, which has the potential to provide significant time and resource savings [13,24]. A study comparing three commercially available AI AC vendors (including MIM-ProtegeAI+, Radformation AutoContour (formerly LimbusAI) and Siemens-DirectORGANS) was recently published; however, it did not focus on the H&N site in particular [29].

This study presents a quantitative and qualitative comparison of the performance of MIM and Limbus, in terms of consistency with manually contoured OARs as per institutional standard and CT-specific guidelines, to guide software implementation and utilization at radiation oncology institutions globally. The primary outcomes were dice coefficient (DC), mean Hausdorff distance (HD_mean), and maximum Hausdorff distance (HD_max). The secondary outcome was a qualitative assessment of both contour suitability for clinical use and the need for manual changes.

2. Materials and Methods

2.1. Patient Selection

This study was approved by the Health Research Ethics Board of Alberta, ID HREBA.CC-23-0273. Twenty patients were selected at random and had to meet inclusion criteria, which were that they received radiation therapy, planned with a volumetric modulated arc technique, and delivered with curative-intent for a primary malignancy of the H&N. Patients were excluded if they did not finish their prescribed course of therapy, they were treated with palliative-intent (including those with metastases to the H&N), or they were treated with 3D conformal radiation therapy. Included patients had a variety of H&N tumour sites, operative statuses (to assess pre-versus post-operative OAR AC ability), BMIs (due to the relative difficulty manually delineating OARs without adequate fat planes), and sex.

2.2. Organs at Risk

We included 17 H&N OARs, bilateral where applicable, that are routinely auto-contoured in our clinic via a DL workflow based upon MIM ProtégéAIsoftware. The OARs included were the brachial plexus, brainstem, esophagus, eye, lacrimal gland, larynx, lens, lips, mandible, optic chiasm, optic nerves, oral cavity/tongue, parotid glands, pharyngeal constrictor muscles, spinal cord, submandibular glands, and thyroid. Some routinely contoured structures were excluded because either one or both DL systems did not have the option to contour them at the time of this study (e.g., cochlea and temporal lobe not supported by MIM, and the temporomandibular joints not supported by either).

2.3. Simulation and Manual Segmentation Process

Patients were scanned on a Philips BigBore CT scanner with the following H&N scan protocol: 120 kV, 2 mm slice thickness, 60 cm FOV, 0.813 pitch, 500 mAs/slice, and 53.1 mGy CTDIvol, with Orthopedic Metal Artefact Reduction (OMAR) enabled when a metal artefact was present. The CT simulation scans of included patients were exported into the local Varian Aria RO software and OAR sets, manually contoured by one of three experienced radiation oncologists, were given to a radiation oncology resident who completed the missing OARs and ensured compliance with the contouring guidelines. The finalized OAR sets were then edited by a senior H&N sub-specialized radiation oncologist before final approval. Emphasis was placed on precise demarcation of OARs based on CT-specific contouring consensus guidelines, which were utilized via side-by-side comparison while performing manual contouring and edits to ensure compliance [4]. Neither contouring physicians had access to the AI-generated contours to ensure a non-biassed approach. These manually generated OAR sets for the 20 patients established the manual reference standard for comparison with the auto-contours from each software. Manual contours are subject to intra- and inter-observer variability, so we compared AI–human algorithm performance metrics to inter-human performance metrics already established in the literature [30,31].

2.4. Auto-Segmentation Process

Although several commercially available AC software systems exist, MIM (v4.0.0) and Limbus (v1.7.0) were chosen as they are used locally and regionally. The same 20 patients containing the 17 hidden manually contoured OARs were processed by MIM and Limbus separately, and subsequent data analysis was done within the MIM software. Both software systems were treated as “black boxes”, tested as is from the vendors, and were not trained on local patient datasets. This approach was chosen as it allowed for a true comparison of the commercially available system performance compared to institutional preference, and was also in accordance with local rules that prevent external models being trained on local patient data. Neither vendor had software access, provided technical support, contributed to data analysis, funded, or participated in manuscript preparation.

2.5. Geometric Accuracy

Primary outcomes were dice coefficients (DCs), mean Hausdorff distances (HD_mean), and maximum Hausdorff distances (HD_max). DCs are the favoured method for comparing 2D and 3D structures in the literature [32], as it is a robust measure of structure overlap, which is mathematically calculated as 2 times the area of overlap divided by the total area of both structures,

D C = \frac{2 |R \cap T|}{|R| + |T|}

(1)

where R is the manual reference standard and T is the test structure contour. Hausdorff distance is mathematically calculated by measuring the distance from every point along the boundary of one structure’s surface to the nearest point on the boundary of the second structure’s surface, and then selecting either the mean of these distances (HD_mean) or the maximum distance (HD_max),

H D_{m e a n} = \underset{t \in T}{mean} \{\min_{r \in R} \{d (t, r\}\}

(2)

and

H D_{m a x} = \max_{t \in T} \{\min_{r \in R} \{d (t, r\}\}

(3)

where t are the points of the test structure T, and r are the points of manual reference structure R. Outcomes were reported as symmetric distance, not directional.

2.6. Statistical Analysis

Both MIM- and Limbus-generated contours were retrospectively compared to the manual reference following export into a MIM Maestro research software with scripted workflows. In the case of bilateral structures, each side was treated as a separate structure for the purposes of statistical analysis. DC, HD_mean, and HD_max were calculated to generate two datasets: one for MIM compared to the manual reference, and one for Limbus compared to manual reference. Finally, the two primary outcome datasets were compared via head-to-head comparison to determine which software produces contours most alike to the manual reference contours for each OAR. Because several geometric metrics were expected to be non-normally distributed and certain measures such as HD_max are sensitive to outliers, paired comparisons between algorithms were performed using the Wilcoxon signed-rank test, with each OAR being analyzed separately. Adjusted p-values were calculated using the Benjamini–Hochberg false discovery rate (FDR) procedure to control for multiple comparisons across OARs and quantitative metrics. Statistical significance was defined as an FDR-adjusted p-value (i.e., q-value) of <0.05.

The secondary outcome was subjective contour quality. The MIM- and Limbus-generated contours were independently viewed by two independent H&N radiation oncologists, who were blinded to the manual reference and either DL system, and graded on a Likert scale shown in Table 1. Both reviewers scored all structures from each system for all patients. The individual reviewer scores were assessed for inter-rater agreement using linear weighted Cohen’s kappa for the paired ordinal Likert scores from the two reviewers. Differences were assessed using a paired Wilcoxon signed-rank test with Benjamini–Hochberg FDR correction for individual structures, and a patient-clustered ordinal GEE model was used for overall comparison between the AC systems.

3. Results

3.1. Sample Characteristics

The 20 patient cohort comprised the following primary tumour sites: 11 nasopharyngeal, four oropharyngeal, one laryngeal, one hypopharyngeal, one salivary gland, one oral cavity, and one unknown primary. Two patients received adjuvant RT for salivary and oral cavity primary tumours in accordance with international guidelines, and the remaining 18 received definitive RT. Three quarters of patients were male. The mean age was 60.5 years [SD 12.8], and the mean BMI was 23.53 kg/m² (range 16.9–42.4).

3.2. Primary Outcomes: Geometric Accuracy

The median DCs and corresponding IQRs for each of the 17 included OARs, bilateral where applicable, are displayed in Figure 1, with stars indicating a significant difference according to the FDR-adjusted p-value (i.e., q-value) of <0.05 threshold. A high DC was demonstrated consistently in OARs such as the mandible (MIM median 0.94, Limbus 0.91), eyes (MIM 0.89–0.91, Limbus 0.88–0.89), and thyroid (0.84 in both cases). Ten structures had statistically significant differences between the DCs of MIM and Limbus in terms of consistency with the manual reference contours. MIM contoured more similarly to the manual reference when contouring the brachial plexus bilaterally (q = 0.004 bilaterally) and the mandible (q = 0.007). Limbus performed more similarly to the manual reference when contouring the larynx (q = 0.002), optic nerves (q = 0.005 on the left, q = 0.016 on the right), left lacrimal gland (q = 0.012), right parotid gland (q = 0.016), left lens (q = 0.032), and esophagus (q = 0.036). In the listed unilateral structures, Limbus was consistent bilaterally, whereas outliers in MIM brought down the mean DC, leading to statistical significance unilaterally. Both systems had contoured the optic chiasm and brachial plexus significantly differently than the manual reference standard.

The median HD_mean values are depicted in Figure 2, with the absolute differences in most values being less than 1.5 millimetres. A low HD_mean was found in OARs such as the lenses (MIM 0.40 mm, Limbus 0.46 mm, both bilaterally), mandible (MIM 0.34 mm, Limbus 0.50 mm), and eyes (MIM 0.61–0.75 mm, Limbus 0.70–0.74 mm). Nine structures had statistically significant differences between the median HD_mean of MIM and Limbus in terms of consistency with the manual reference contours. MIM contoured more similarly to the manual reference when contouring the brachial plexus bilaterally (q < 0.001) and the mandible (q < 0.001). Limbus performed more similarly to the manual reference when contouring the larynx (q < 0.001), optic chiasm (q = 0.045), optic nerves (q = 0.025 on the left, q = 0.045 on the right), right parotid gland (q = 0.025), and esophagus (q = 0.031).

The median HD_max values, which is the single furthest distance in millimetres between the manual reference and auto-contours, are shown in Figure 3. Both systems deviated from the manual reference by several centimetres for some structures, such as the brachial plexuses and larynx, while most structures were only off by <1 centimetre. A low HD_max was found in OARs such as the lenses (MIM and Limbus both 2.1 mm), eyes (MIM 2.9–3.1 mm, Limbus 2.8–3.6 mm), and spinal cord (MIM 4.8 mm, Limbus 5.0 mm). Three structures had statistically significant differences between the median HD_max of MIM and Limbus. MIM contoured more similarly to the manual reference when contouring the brachial plexus bilaterally (q < 0.001 in both cases). Limbus performed more similarly to the manual reference when contouring the larynx (q < 0.001 with a greater than threefold absolute difference).

DC, HD_mean, and HD_max were not statistically different (i.e., q > 0.05) between MIM and Limbus for the following OARs: brainstem, eyes, right lacrimal gland, right lens, lips, oral cavity/tongue, left parotid, spinal cord, submandibular glands, and thyroid.

A summary of the results across the three quantitative primary outcomes is presented in Figure 4, and specific median, IQRs, p-values, and q-values are included in Tables S1–S3 in the Supplementary Material. Only three structures demonstrated consistently more similar contours to the manual reference across all three variables, which were the brachial plexus bilaterally (favouring MIM) and the larynx (favouring Limbus).

3.3. Secondary Outcome: Qualitative Assessment

The secondary outcome of qualitative contour quality, including the degree of manual edits, utilized the Likert scale in Table 1. Overall, the subjective quality of the auto-contours from both MIM and Limbus were good, with all structures having an average score between 1 (optimal for clinical use; no modifications necessary) and 2 (no errors, but changes warranted for clinical use) (Figure 5). There was high exact agreement between reviewers overall, within MIM contours, and within Limbus contours (81.4%, 77.1%, and 85.4%, respectively), with linear weighted Cohen’s kappas of 0.319 (95% CI 0.212–0.422), 0.325 (0.212–0.436), and 0.281 (0.125–0.403), respectively. This agreement is fair overall, though in the context of high raw agreement scores, the inter-rater agreement interpretation is limited due to clustering at a Likert score of 1.

Both systems contoured the following structures almost perfectly according to both reviewers: the eyes, lacrimal glands, and lenses. The structures with the worst subjective average scores across both systems were the pharyngeal constrictors (1.675), larynx (1.450), and submandibular glands (1.382). Limbus achieved significantly lower/better subjective scores than MIM for the larynx, lips, oral cavity, and pharyngeal constrictor muscles using a paired Wilcoxon signed-rank test with FDR correction (Table 2). Overall comparison between AC systems using a patient-clustered ordinal GEE model showed that Limbus had significantly lower odds of receiving a worse score than MIM after accounting for structure and reviewer (OR of 0.373, 95% CI 0.252–0.554, p < 0.001). Both systems were equally poor at identifying salivary gland anatomical variants, as there were cases of contouring a gland that was not present or resected (n = 5 for MIM, n = 3 for Limbus); contouring an incorrect structure (usually a pathologically involved lymph node) as a submandibular or parotid gland (n = 1 for Limbus); and not contouring a structure when present (n = 1 for Limbus).

4. Discussion

4.1. Discussion of Our Findings

Dice coefficients are the most common method for quantifying structure similarity in the medical physics and radiation oncology literature [32,33,34]. Our results revealed consistently low DCs for a select few auto-contoured structures, reflecting inconsistency with the manual reference. These included the brachial plexus bilaterally and the optic chiasm, which are some of the most technically challenging OARs to contour due to several factors. First and foremost, these two structures are relatively poorly visualized, as inherent pixel density contrast compared to surrounding tissue is low compared to other included OARs, including the mandible, which lies at the other end of the contrast spectrum. The contrast between tissues, especially soft tissues of slightly different densities, is further reduced by using CT simulation scanners, such as our Philips system (see Methods for protocol), which have inherently lower resolutions than diagnostic CT scanners and are not used for contouring. Other factors that contribute to lower DCs of the brachial plexuses and optic chiasm is the complexity and variability of the size and shape of the former, and the small size of the latter.

Our institutional experience was that, despite similarly low DCs, ROs still found utility in keeping the brachial plexus contours, whereas the optic chiasm auto-contours were not used. This likely aligned with common practice of contouring a proxy volume for the brachial plexus, which is, by definition, less precise and intended to delineate the possible location of the brachial plexus. It is possible that in the near future MRI fusion will aid the delineation of OARs, either manually or via AI systems, in accordance with existing contouring guidelines [35], which would especially benefit contouring structures such as the optic chiasm due to poor soft tissue distinction with CT.

An interesting finding in our data was the statistically significant difference in DCs for only one side of a bilateral structure, which occurred with the right parotid gland (MIM 0.81, Limbus 0.85, q = 0.016) and left lacrimal gland (MIM 0.47, Limbus 0.58, q = 0.012). When we explored this further, it became clear that in both cases Limbus contoured both sides with precisely the same mean DC relative to the manual reference. However, MIM had a few poorly contoured outliers in each case, which brought down the mean DC, leading to a statistically significant difference compared to Limbus on only one side of the bilateral structure. However, it is unclear if this represents a persistent problem due to our limited sample size.

Both AI systems had poor HD_mean values for the brachial plexus, likely indicating a difference in DL training for contouring this structure compared to institutional practice, and suggests the potential for different performance relative to preferred practice at different institutions. More interesting is the finding that Limbus had a markedly lower HD_mean for the optic chiasm and larynx than MIM. There are differing approaches to contouring the larynx in the literature [4,36,37,38], which may suggest a difference in how Limbus and MIM models were trained, with the former likely following a more similar method to our institutional approach. In general, both systems had outstanding HD_mean values for the mandible and all ocular structures (eyes, lacrimal glands, lenses), suggesting that very little, if any, modifications would be necessary before approving these OARs in clinical practice.

Special attention should be paid to the remarkably lower laryngeal HD_max values generated from Limbus contours (MIM 29.0 mm, Limbus 8.4 mm, q < 0.001). The larynx tends to be an OAR that we spend more manual effort achieving precision in, so this finding was particularly interesting at our institution. However, as discussed above, this difference between AI systems likely comes down to the training method and adherence to different contouring approaches and therefore may be inconsistent across institutions, so individual assessments of similarity to institutional practice may be desirable. A similar finding arose for the brachial plexuses, which, like the larynx, are structures we spend more time generating or editing institutionally, with MIM having a slight edge in terms of HD_max (q = 0.003).

Taking all primary outcomes of our analysis into account together, we found that Limbus outperformed MIM for key OARs, including the larynx, optic nerves, optic chiasm, and esophagus. Although the mandible and optic chiasm were contoured more consistently by MIM and Limbus, respectively, the former’s difference is minimal, and the latter structure is so inconsistently contoured by both systems that it required manual recontouring in most cases.

Several cases exist where a statistical difference in terms of DC, HD_max, or HD_mean was detected, though a clinical difference is unlikely. The mandible was contoured so closely to the manual reference by both systems that both would likely be used clinically without edits, which is further supported by the Likert scores of both being almost entirely 1 s. Another example is the difference detected in parotid gland scores, which is clinically due to outliers in one system as previously discussed, attributable in part to the low patient numbers and resulting under- or over-representation of certain anatomic anomalies. Finally, the optic chiasm and brachial plexuses are contoured so differently from institutional standard by both systems that the statistically significant difference between the two is a moot point, as clinically both tend to be completely recontoured in our institutional experience.

4.2. Our Institutional Experience

Our institution had an existing atlas-based AC workflow in place for H&N OARs, which we compared to our incoming DL-based system (MIM) at the time of commissioning; in practice, our final institutional clinical workflow combines the best of atlas or DL-based auto-contours. A 2020 study from the Netherlands also compared an existing atlas-based system with an incoming DL model, and found improved DCs in 19 of 22 OARs with the DL system in addition to efficiency gains. They also performed a qualitative assessment similar to that in the present study, except with the comparison taking place between atlas- and DL-based auto-contours, and found a preference for DL contours that were “more precise” to be more often confused with the manual reference [39]. One year prior, a different group in the Netherlands assessed a DL system compared to their institutional contours and found good mean DCs for most OARs, but interestingly, it had low score for the brainstem (0.64) compared to our score of 0.82, either illustrating an improvement in technology or different manual contour consistencies, the latter being postulated in their analysis [28].

It is worth noting that when implementing AI AC systems at our centre, there tended to be a physical gap between some structures; for example, the esophagus and pharyngeal constrictors. This is of concern when evaluating maximum doses and could lead to serious clinical consequences if not recognized. In our clinical implementation, we added a post-processing step to manage this issue that we encourage other centres to consider. Modifications to consider include: adding a 1 mm radial expansion to the spinal cord, adding an 8 mm superior expansion on the spinal cord, and cropping the esophagus inferiorly to avoid mean dose calculations being influenced by the AI systems tendency to contour out of the relevant field. No post-processing was done in the present study to ensure fair comparison of the AC system outputs.

Another common and notable error we encountered was both systems erroneously contouring pathologic lymphadenopathy (MIM once, Limbus once), post-operative changes (MIM once, Limbus twice), or fat planes (MIM once, Limbus once) as an OAR, usually as a salivary gland. It is important to recognize and correct this error to avoid underdosing involved lymph nodes or delivering unnecessary doses to non-pathologic tissues. One study aims to improve the training of AI models one OAR at a time, specifically the salivary glands, to allow for more tailored improvements [40]. We believe this is especially important due to our intuitional experience of having submandibular gland auto-contours include adjacent lymphadenopathy, which, if not corrected, could lead to reduced doses to the node, thereby increasing the chance of recurrence.

4.3. Comparison to Other Studies

In Table 3, we present mean DC, HD_mean, and HD_max values from four other studies reported for MIM, Limbus, and other AI algorithms, and we compare these values to those in our study. The means for these metrics were chosen for comparison with other studies instead of the median, as it has been more commonly reported in the literature to date. Overall, our values are very consistent with the available literature, except in one particular study that reported a notably higher DC for Limbus than any other available studies, possibly indicating a difference in methodology or the presence of bias [27].

The field of AI DL AC has been growing rapidly, and a 2023 systematic review and single-arm meta-analysis assessed 22 independent studies applying DL to H&N OAR contouring. The authors found similar DC trends to our data with certain OARs having very high DCs, such as the brainstem, spinal cord, mandible, and optical structures, across their sampled studies and our study. Interestingly, the maximum DC was 0.87 for the brainstem in their analysis, compared to 0.94 for the mandible in our study, possibly reflecting the time between analyses in which AI technology has improved markedly. Our study has the advantage of recency compared to those included in the systematic review, which bolsters the clinical applicability of our results in terms of close temporality. The authors concluded DL contouring for H&N OARs is highly accurate and has the potential to decrease the time spent in this time-consuming process [41].

One recent study by Johnson et al. compared two commercially available AC products (in this case Manteia AccuContour and MIM ProtégéAI) head-to-head [42]. Their methodology was similar to ours in terms of manual reference contour generation and the OARs assessed, as well as their primary outcomes, which also included DC and HD_mean. They reported very high DCs between MIM and the manual reference for most OARs (including the brainstem, esophagus, eyes, mandible, left optic nerve, oral cavity, parotids, and spinal cord), with our study having higher DCs for the lenses (0.76 bilaterally vs. 0.55 and 0.54, left and right, respectively) and their study having higher DCs for the larynx (0.65 vs. 0.52), optic chiasm (0.3 vs. 0.09), and right optic nerve (0.63 vs. 0.55). In terms of the HD_mean values, without taking SD into account, our study had slightly better scores for the right lens, mandible, spinal cord, and bilateral submandibular glands, whereas they recorded slightly better scores for the brainstem, esophagus, bilateral eyes, larynx, left lens, optic chiasm, bilateral optic nerves, oral cavity, and bilateral parotid glands. They did not assess the brachial plexuses, lacrimal glands, lips, pharyngeal constrictors, or thyroid, whereas our study does. Overall, there was an even split between scores of overlapping OARs, indicating similar methodologies and MIM-specific findings. They shared with our study the conclusion that both tested systems reduced variability and time, and that both systems demonstrated reasonable agreement in the same OARs (ocular structures, brainstem, spinal cord, oral cavity, esophagus, and mandible), though poor agreement for the optic chiasm, for reasons discussed above. Both studies support the use of DL systems for AC in radiation planning, having utilized almost identical methodologies that has the potential to serve as a template workflow for other groups wishing to conduct head-to-head comparisons.

Using the Likert scale in Table 1 and assessments performed by sub-specialized H&N radiation oncologists with more than 10 year’s experience, we showed that usually structures either needed no edits or minor edits, with only a few structures consistently needing time-consuming edits (pharyngeal constrictors, larynx, and submandibular glands). A recent study by Goddard et al. also utilized similar methodology in comparing MIM, Limbus, and a third system. They contoured structure sets on half the patients that we did and also used a Likert scale to grade subjective quality, finding slightly better assessments for Limbus than MIM (1.96 vs. 2.07, from a scale of 1–3.70). Our results and those from Goddard et al. are of value because they inform the reader of the practical aspects of implementing such systems, suggesting that, most of the time, there is little to no manual input needed to correct or redo contours once they have been auto-contoured. However, our study has the advantage of having higher patient numbers and a wider variety of OARs, leading to more robust statistical representativeness.

This project aimed to provide a rigorous quantitative comparison of the performance of commercially available software to improve the time utilization and efficiency of radiation oncologists globally. Our goal was to provide radiation oncology groups and centres around the world with information regarding the efficacy of the available software compared to manual contours, as well as grounds for potential implementation. With wider utilization of software such as MIM and Limbus, it is our hope that centres can optimize the utilization of radiation oncologists and dosimetrists, allowing for more availability and time for new patient consults, thereby decreasing wait lists and expediting treatment with resulting societal benefits.

4.4. Limitations and Future Directions

This study has several limitations. Only 20 unique patient cases were used for the comparison between AC software. However, this is greater than comparable studies [27,29]. In addition, we included 17 OAR types, bilateral where applicable; therefore, this study has a higher number of unique OARs than comparable studies [27,29], and the total number of OARs we used for comparison was 480, which allows for robust statistical comparison. A potential limitation to this and other studies comparing so few patients, often due to logistical constraints, is that the sample population may not fully reflect the breadth of anatomical variations possible in an entire population, which therefore limits the generalizability of this and other works. Effort should be made in the future to include a higher number of patients in analyses of AC systems.

This study was also potentially limited by having a single author have final say over the manual segmentation process after the final OAR set was individually approved by the other contouring physicians. We mitigated this limitation by rigorously following international consensus guidelines [4], and by having a senior H&N sub-specialized radiation oncologist modify and approve all contours. Future studies may alleviate this limitation by ensuring that all manually delineated OARs are approved by a panel of radiation oncologists unanimously.

Future directions in this field are plentiful due to the increasing availability and quality of commercial AC offerings, as well as the potential for improved efficiency and consistency in clinical practice. Future work should investigate clinical usability of target volume auto-contours in the H&N, including lymph node levels and primary tumours. Future work should also include metrics for actual time savings and time spent on editing ACs, as these were not included in this study due to logistical challenges using software between centres, which would further inform the implementation of these tools clinically. Finally, future work should endeavour to evaluate whether observed contour differences affect dose–volume parameters at the RT planning phase, which is a limitation of the present study, as this would be a very clinically meaningful metric in terms of safety.

5. Conclusions

This comparison of two commercially available AI auto-contouring software systems demonstrated performance consistent with the manual reference for most H&N OARs in terms of DC, HD_mean, and HD_max. Each system had strengths and weaknesses in auto-contouring specific OARs. Qualitative assessment demonstrated high clinical applicability for both algorithms, with Limbus performing closer to our institutional standard. Overall, both systems have sufficient geometric accuracy qualitative quality to warrant further investigation of their clinical impact.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16115681/s1, Table S1: Dice Coefficients for the evaluated structures in terms of median (IQR), by auto-contouring system, with raw p-values and adjusted p-values (q-values). Paired comparisons between algorithms were performed using the Wilcoxon signed-rank test, with each OAR being analyzed separately. Adjusted p-values were calculated using the Benjamini-Hochberg false discovery rate procedure to control for multiple comparisons across OARs and quantitative metrics. Bolded values are statistically significant after adjustment; Table S2: Mean Hausdorff Distances for the evaluated structures in terms of median (IQR), by auto-contouring system, with raw p-values and adjusted p-values (q-values). Paired comparisons between algorithms were performed using the Wilcoxon signed-rank test, with each OAR being analyzed separately. Adjusted p-values were calculated using the Benjamini-Hochberg false discovery rate procedure to control for multiple comparisons across OARs and quantitative metrics. Bolded values are statistically significant after adjustment; Table S3: Maximum Hausdorff Distances for the evaluated structures in terms of median (IQR), by auto-contouring system, with raw p-values and adjusted p-values (q-values). Paired comparisons between algorithms were performed using the Wilcoxon signed-rank test, with each OAR being analyzed separately. Adjusted p-values were calculated using the Benjamini-Hochberg false discovery rate procedure to control for multiple comparisons across OARs and quantitative metrics. Bolded values are statistically significant after adjustment.

Author Contributions

C.B.: conceptualization, project administration, investigation, data curation, formal analysis, visualization, writing—original draft; A.R.: data curation, writing—review and editing; H.L.: data curation, writing—review and editing; R.B.: data curation, writing—review and editing; H.Q.: data curation, writing—review and editing; E.T.: conceptualization, methodology, writing—review and editing; C.K.: conceptualization, methodology, resources, writing—review and editing; J.-D.J.: conceptualization, methodology, investigation, supervision, formal analysis, visualization, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was approved by the Health Research Ethics Board of Alberta, ID HREBA.CC-23-0273.

Informed Consent Statement

Patient consent was waived due to the retrospective nature of this study.

Data Availability Statement

Data for these analyses were collected with ethics approval from the University of Calgary with the understanding that it would not be made available to those beyond the study team. If data sharing is requested, a new proposal can be discussed and submitted to the University of Calgary. If this is desired, please contact the corresponding author at conrad.bayley@ahs.ca.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OAR	Organ at Risk
MIM	MIM Contour ProtégéAI v4.0.0
Limbus	Radformation/Limbus AI v1.7.0
H&N	Head and Neck
RT	Radiation Therapy
DC	Dice Coefficient
HD_mean	Mean Hausdorff Distance
HD_max	Maximum Hausdorff Distance
DL	Deep Learning
AC	Auto-Contouring

References

Brouwer, C.L.; Steenbakkers, R.J.; Van Den Heuvel, E.; Duppen, J.C.; Navran, A.; Bijl, H.P.; Chouvalova, O.; Burlage, F.R.; Meertens, H.; Langendijk, J.A.; et al. 3D Variation in delineation of head and neck organs at risk. Radiat. Oncol. 2012, 7, 32. [Google Scholar] [CrossRef] [PubMed]
Geets, X.; Daisne, J.-F.; Arcangeli, S.; Coche, E.; Poel, M.D.; Duprez, T.; Nardella, G.; Grégoire, V. Inter-observer variability in the delineation of pharyngo-laryngeal tumor, parotid glands and cervical spinal cord: Comparison between CT-scan and MRI. Radiother. Oncol. 2005, 77, 25–31. [Google Scholar] [CrossRef]
Piras, A.; Boldrini, L.; Menna, S.; Venuti, V.; Pernice, G.; Franzese, C.; Angileri, T.; Daidone, A. Hypofractionated Radiotherapy in Head and Neck Cancer Elderly Patients: A Feasibility and Safety Systematic Review for the Clinician. Front. Oncol. 2021, 11, 761393. [Google Scholar] [CrossRef]
Brouwer, C.L.; Steenbakkers, R.J.H.M.; Bourhis, J.; Budach, W.; Grau, C.; Grégoire, V.; Van Herk, M.; Lee, A.; Maingon, P.; Nutting, C.; et al. CT-based delineation of organs at risk in the head and neck region: DAHANCA, EORTC, GORTEC, HKNPCSG, NCIC CTG, NCRI, NRG Oncology and TROG consensus guidelines. Radiother. Oncol. 2015, 117, 83–90. [Google Scholar] [CrossRef]
Kosmin, M.; Ledsam, J.; Romera-Paredes, B.; Mendes, R.; Moinuddin, S.; de Souza, D.; Gunn, L.; Kelly, C.; Hughes, C.O.; Karthikesalingam, A.; et al. Rapid advances in auto-segmentation of organs at risk and target volumes in head and neck cancer. Radiother. Oncol. 2019, 135, 130–140. [Google Scholar] [CrossRef] [PubMed]
Oktay, O.; Nanavati, J.; Schwaighofer, A.; Carter, D.; Bristow, M.; Tanno, R.; Jena, R.; Barnett, G.; Noble, D.; Rimmer, Y.; et al. Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw. Open 2020, 3, e2027426. [Google Scholar] [CrossRef] [PubMed]
Cardenas, C.E.; Beadle, B.M.; Garden, A.S.; Skinner, H.D.; Yang, J.; Rhee, D.J.; McCarroll, R.E.; Netherton, T.J.; Gay, S.S.; Zhang, L.; et al. Generating High-Quality Lymph Node Clinical Target Volumes for Head and Neck Cancer Radiation Therapy Using a Fully Automated Deep Learning-Based Approach. Int. J. Radiat. Oncol. Biol. Phys. 2021, 109, 801–812. [Google Scholar] [CrossRef]
Hong, T.S.; Tomé, W.A.; Harari, P.M. Heterogeneity in head and neck IMRT target design and clinical practice. Radiother. Oncol. 2012, 103, 92–98. [Google Scholar] [CrossRef]
Segedin, B.; Petric, P. Uncertainties in target volume delineation in radiotherapy—Are they relevant and what can we do about them? Radiol. Oncol. 2016, 50, 254–262. [Google Scholar] [CrossRef]
Multi-Institutional Target Delineation in Oncology Group. Human–Computer Interaction in Radiotherapy Target Volume Delineation: A Prospective, Multi-institutional Comparison of User Input Devices. J. Digit. Imaging 2011, 24, 794–803. [CrossRef]
Teguh, D.N.; Levendag, P.C.; Voet, P.W.J.; Al-Mamgani, A.; Han, X.; Wolf, T.K.; Hibbard, L.S.; Nowak, P.; Akhiat, H.; Dirkx, M.L.P.; et al. Clinical Validation of Atlas-Based Auto-Segmentation of Multiple Target Volumes and Normal Tissue (Swallowing/Mastication) Structures in the Head and Neck. Int. J. Radiat. Oncol. Biol. Phys. 2011, 81, 950–957. [Google Scholar] [CrossRef]
La Macchia, M.; Fellin, F.; Amichetti, M.; Cianchetti, M.; Gianolini, S.; Paola, V.; Lomax, A.J.; Widesott, L. Systematic evaluation of three different commercial software solutions for automatic segmentation for adaptive therapy in head-and-neck, prostate and pleural cancer. Radiat. Oncol. 2012, 7, 160. [Google Scholar] [CrossRef]
Ng, C.K.C.; Leung, V.W.S.; Hung, R.H.M. Clinical Evaluation of Deep Learning and Atlas-Based Auto-Contouring for Head and Neck Radiation Therapy. Appl. Sci. 2022, 12, 11681. [Google Scholar] [CrossRef]
Zabel, W.J.; Conway, J.L.; Gladwish, A.; Skliarenko, J.; Didiodato, G.; Goorts-Matthews, L.; Michalak, A.; Reistetter, S.; King, J.; Nakonechny, K.; et al. Clinical Evaluation of Deep Learning and Atlas-Based Auto-Contouring of Bladder and Rectum for Prostate Radiation Therapy. Pract. Radiat. Oncol. 2021, 11, e80–e89. [Google Scholar] [CrossRef]
Daisne, J.-F.; Blumhofer, A. Atlas-based automatic segmentation of head and neck organs at risk and nodal target volumes: A clinical validation. Radiat. Oncol. 2013, 8, 154. [Google Scholar] [CrossRef]
Thomson, D.; Boylan, C.; Liptrot, T.; Aitkenhead, A.; Lee, L.; Yap, B.; Sykes, A.; Rowbottom, C.; Slevin, N. Evaluation of an automatic segmentation algorithm for definition of head and neck organs at risk. Radiat. Oncol. 2014, 9, 173. [Google Scholar] [CrossRef]
Hoang Duc, A.K.; Eminowicz, G.; Mendes, R.; Wong, S.; McClelland, J.; Modat, M.; Cardoso, M.J.; Mendelson, A.F.; Veiga, C.; Kadir, T.; et al. Validation of clinical acceptability of an atlas-based segmentation algorithm for the delineation of organs at risk in head and neck cancer. Med. Phys. 2015, 42, 5027–5034. [Google Scholar] [CrossRef]
Samarasinghe, G.; Jameson, M.; Vinod, S.; Field, M.; Dowling, J.; Sowmya, A.; Holloway, L. Deep learning for segmentation in radiation therapy planning: A review. J. Med. Imaging Radiat. Oncol. 2021, 65, 578–595. [Google Scholar] [CrossRef] [PubMed]
Nikolov, S.; Blackwell, S.; Zverovitch, A.; Mendes, R.; Livne, M.; De Fauw, J.; Patel, Y.; Meyer, C.; Askham, H.; Romera-Paredes, B.; et al. Clinically Applicable Segmentation of Head and Neck Anatomy for Radiotherapy: Deep Learning Algorithm Development and Validation Study. J. Med. Internet Res. 2021, 23, e26151. [Google Scholar] [CrossRef] [PubMed]
Zhong, Y.; Yang, Y.; Fang, Y.; Wang, J.; Hu, W. A Preliminary Experience of Implementing Deep-Learning Based Auto-Segmentation in Head and Neck Cancer: A Study on Real-World Clinical Cases. Front. Oncol. 2021, 11, 638197. [Google Scholar] [CrossRef] [PubMed]
Maduro Bustos, L.A.; Sarkar, A.; Doyle, L.A.; Andreou, K.; Noonan, J.; Nurbagandova, D.; Shah, S.A.; Irabor, O.C.; Mourtada, F. Feasibility evaluation of novel AI-based deep-learning contouring algorithm for radiotherapy. J. Appl. Clin. Med. Phys. 2023, 24, e14090. [Google Scholar] [CrossRef]
Radici, L.; Ferrario, S.; Borca, V.C.; Cante, D.; Paolini, M.; Piva, C.; Baratto, L.; Franco, P.; La Porta, M.R. Implementation of a Commercial Deep Learning-Based Auto Segmentation Software in Radiotherapy: Evaluation of Effectiveness and Impact on Workflow. Life 2022, 12, 2088. [Google Scholar] [CrossRef]
Vrtovec, T.; Močnik, D.; Strojan, P.; Pernuš, F.; Ibragimov, B. Auto-segmentation of organs at risk for head and neck radiotherapy planning: From atlas-based to deep learning methods. Med. Phys. 2020, 47, e929–e950. [Google Scholar] [CrossRef]
Wong, J.; Huang, V.; Wells, D.; Giambattista, J.; Giambattista, J.; Kolbeck, C.; Otto, K.; Saibishkumar, E.P.; Alexander, A. Implementation of deep learning-based auto-segmentation for radiotherapy planning structures: A workflow study at two cancer centers. Radiat. Oncol. 2021, 16, 101. [Google Scholar] [CrossRef]
Gibbons, E.; Hoffmann, M.; Westhuyzen, J.; Hodgson, A.; Chick, B.; Last, A. Clinical evaluation of deep learning and atlas-based auto-segmentation for critical organs at risk in radiation therapy. J. Med. Radiat. Sci. 2023, 70, 15–25. [Google Scholar] [CrossRef]
Yamauchi, R.; Itazawa, T.; Kobayashi, T.; Kashiyama, S.; Akimoto, H.; Mizuno, N.; Kawamori, J. Clinical evaluation of deep learning and atlas-based auto-segmentation for organs at risk delineation. Med. Dosim. 2024, 49, 167–176. [Google Scholar] [CrossRef]
D’Aviero, A.; Re, A.; Catucci, F.; Piccari, D.; Votta, C.; Piro, D.; Piras, A.; Di Dio, C.; Iezzi, M.; Preziosi, F.; et al. Clinical Validation of a Deep-Learning Segmentation Software in Head and Neck: An Early Analysis in a Developing Radiation Oncology Center. Int. J. Environ. Res. Public Health 2022, 19, 9057. [Google Scholar] [CrossRef] [PubMed]
Van Rooij, W.; Dahele, M.; Ribeiro Brandao, H.; Delaney, A.R.; Slotman, B.J.; Verbakel, W.F. Deep Learning-Based Delineation of Head and Neck Organs at Risk: Geometric and Dosimetric Evaluation. Int. J. Radiat. Oncol. Biol. Phys. 2019, 104, 677–684. [Google Scholar] [CrossRef]
Goddard, L.; Velten, C.; Tang, J.; Skalina, K.A.; Boyd, R.; Martin, W.; Basavatia, A.; Garg, M.; Tomé, W.A. Evaluation of multiple-vendor AI autocontouring solutions. Radiat. Oncol. 2024, 19, 69. [Google Scholar] [CrossRef] [PubMed]
Nielsen, C.P.; Lorenzen, E.L.; Jensen, K.; Eriksen, J.G.; Johansen, J.; Gyldenkerne, N.; Zukauskaite, R.; Kjellgren, M.; Maare, C.; Lønkvist, C.K.; et al. Interobserver variation in organs at risk contouring in head and neck cancer according to the DAHANCA guidelines. Radiother. Oncol. 2024, 197, 110337. [Google Scholar] [CrossRef] [PubMed]
Van Der Veen, J.; Gulyban, A.; Willems, S.; Maes, F.; Nuyts, S. Interobserver variability in organ at risk delineation in head and neck cancer. Radiat. Oncol. 2021, 16, 120. [Google Scholar] [CrossRef]
Taha, A.A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef]
Sherer, M.V.; Lin, D.; Elguindi, S.; Duke, S.; Tan, L.-T.; Cacicedo, J.; Dahele, M.; Gillespie, E.F. Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review. Radiother. Oncol. 2021, 160, 185–191. [Google Scholar] [CrossRef]
Mackay, K.; Bernstein, D.; Glocker, B.; Kamnitsas, K.; Taylor, A. A Review of the Metrics Used to Assess Auto-Contouring Systems in Radiotherapy. Clin. Oncol. 2023, 35, 354–369. [Google Scholar] [CrossRef] [PubMed]
Paczona, V.R.; Capala, M.E.; Deák-Karancsi, B.; Borzási, E.; Együd, Z.; Végváry, Z.; Kelemen, G.; Kószó, R.; Ruskó, L.; Ferenczi, L.; et al. Magnetic Resonance Imaging–Based Delineation of Organs at Risk in the Head and Neck Region. Adv. Radiat. Oncol. 2023, 8, 101042. [Google Scholar] [CrossRef] [PubMed]
Freedman, L. A radiation oncologist’s guide to contouring the larynx. Pract. Radiat. Oncol. 2016, 6, 129–130. [Google Scholar] [CrossRef]
Merlotti, A.; Alterio, D.; Vigna-Taglianti, R.; Muraglia, A.; Lastrucci, L.; Manzo, R.; Gambaro, G.; Caspiani, O.; Miccichè, F.; Deodato, F.; et al. Technical guidelines for head and neck cancer IMRT on behalf of the Italian association of radiation oncology—Head and neck working group. Radiat. Oncol. 2014, 9, 264. [Google Scholar] [CrossRef] [PubMed]
Choi, M.; Refaat, T.; Lester, M.S.; Bacchus, I.; Rademaker, A.W.; Mittal, B.B. Development of a standardized method for contouring the larynx and its substructures. Radiat. Oncol. 2014, 9, 285. [Google Scholar] [CrossRef]
van Dijk, L.V.; Van den Bosch, L.; Aljabar, P.; Peressutti, D.; Both, S.; Steenbakkers, R.J.H.M.; Langendijk, J.A.; Gooding, M.J.; Brouwer, C.L. Improving automatic delineation for head and neck organs at risk by Deep Learning Contouring. Radiother. Oncol. 2020, 142, 115–123. [Google Scholar] [CrossRef]
Van Rooij, W.; Dahele, M.; Nijhuis, H.; Slotman, B.J.; Verbakel, W.F. Strategies to improve deep learning-based salivary gland segmentation. Radiat. Oncol. 2020, 15, 272. [Google Scholar] [CrossRef]
Liu, P.; Sun, Y.; Zhao, X.; Yan, Y. Deep learning algorithm performance in contouring head and neck organs at risk: A systematic review and single-arm meta-analysis. Biomed. Eng. OnLine 2023, 22, 104. [Google Scholar] [CrossRef]
Johnson, C.L.; Press, R.H.; Simone, C.B.; Shen, B.; Tsai, P.; Hu, L.; Yu, F.; Apinorasethkul, C.; Ackerman, C.; Zhai, H.; et al. Clinical validation of commercial deep-learning based auto-segmentation models for organs at risk in the head and neck region: A single institution study. Front. Oncol. 2024, 14, 1375096. [Google Scholar] [CrossRef]

Figure 1. Median dice coefficients and IQRs for 17 structures (24 where bilateral is applicable) for both MIM and Limbus relative to the manual reference. Higher numbers indicate more consistency with the manual reference, with a coefficient of 1.0 being perfect overlap. Asterisks indicate statistically significant differences according to the Wilcoxon signed-rank test and the Benjamini–Hochberg false discovery rate procedure.

Figure 2. Median Hausdorff distances and IQRs for the 17 structures (24 where bilateral is applicable) for both MIM and Limbus relative to the manual reference. Lower numbers indicate more consistency with the manual reference, as they represent a smaller mean distance between contour perimeters. Asterisks indicate statistically significant differences according to the Wilcoxon signed-rank test and the Benjamini–Hochberg false discovery rate procedure.

Figure 3. Maximum Hausdorff distances (medians with IQRs) for the 17 structures (24 where bilateral is applicable) for both MIM and Limbus relative to the manual reference. Lower numbers indicate more consistency with the manual reference, as they represent a smaller maximum distance between contour perimeters. Asterisks indicate statistically significant differences according to the Wilcoxon signed-rank test and the Benjamini–Hochberg false discovery rate procedure.

Figure 4. Summary of the three quantitative primary outcomes (DC, HD_mean, and HD_max) via a Venn diagram. Overlapped regions indicate statistically significant differences in the overlapping variables, with the colour corresponding with the system that performed better.

Figure 5. Subjective score counts for each OAR contoured by (a) MIM and (b) Limbus. Raw values include the 20 scores from each reviewer to a maximum of 40.

Table 1. Likert scale system used to qualitatively score contours.

Subjective Assessment	Score
Optimal for clinical use; no modifications necessary	1
No errors, but changes warranted for clinical use	2
Contains minor errors; some time required to modify	3
Contains gross errors; complete revision necessary	4

Table 2. Paired comparison of subjective contour scores by auto-contouring system. Only structures significant after Benjamini–Hochberg FDR correction are shown.

Structure	N Paired	MIM Mean	Limbus Mean	Mean Difference (MIM−Limbus)	Wilcoxon p-Value	FDR q-Value
Larynx	40	1.80	1.10	0.70	<0.001	<0.001
Pharyngeal constrictors	40	1.93	1.43	0.50	<0.001	<0.001
Oral cavity	39	1.44	1.08	0.36	0.002	0.012
Lips	40	1.40	1.10	0.30	0.003	0.016

Abbreviations: FDR, false discovery rate. Scores were rated on a 1–4 Likert scale, where 1 represents the best subjective score and 4 the worst. Positive mean differences indicate higher/worse scores for MIM relative to Limbus. p-values are from paired Wilcoxon signed-rank tests; q-values are from Benjamini–Hochberg FDR-adjusted.

Table 3. Summary of mean DC, HD_mean, and HD_max (with standard deviations) for both MIM and Limbus software, as well as results previously cited in other studies for the same OARs. Bilateral OARs have been averaged between left/right sides where applicable. Values in bold indicate close agreement with our results.

Organ at Risk	Metrics	MIM (This Study)	Other MIM Studies	Limbus (This Study)	Other Limbus Studies	Other AI Algorithms
Brachial Plexuses	DC HD_mean HD_max	0.29 (0.07) 8.7 (2.3) 63.8 (8.8)	0.38 [3] 3.11 [3]	0.25 (0.09) 12.3 (3.8) 74.9 (11.2)	0.95 [1]
Brain Stem	DC HD_mean HD_max	0.83 (0.05) 1.3 (0.4) 8.6 (3.7)	0.72 [2], 0.82 [3], 0.81 [5] 3.2 [2], 1.2 [3] 17.4 [2]	0.81 (0.08) 1.4 (0.5) 8.2 (3.6)	0.96 [1], 0.73 [2] 3.3 [2] 16.3 [2]	0.67 [2], 0.85 [4] 3.6 [2] 19.4 [2], 6.7 [4]
Esophagus	DC HD_mean HD_max	0.80 (0.07) 1.3 (0.9) 10.2 (7.0)	0.67 [2], 0.70 [3], 0.75 [5] 5.5 [2], 1.0 [3] 35.9 [2]	0.82 (0.05) 0.9 (0.6) 7.6 (5.2)	0.70 [2] 2.8 [2] 22.5 [2]	0.77 [2] 1.2 [2] 11.6 [2]
Eyes	DC HD_mean HD_max	0.88 (0.06) 0.8 (0.4) 3.2 (0.8)	0.89 [2], 0.87 [3], 0.89 [5] 0.7 [2], 0.7 [3] 3.3 [2]	0.89 (0.03) 0.7 (0.2) 3.4 (1.2)	0.98 [1], 0.90 [2] 0.6 [2] 2.9 [2]	0.90 [2], 0.88 [4] 0.7 [2] 3.5 [2], 3.4 [4]
Lacrimal Glands	DC HD_mean HD_max	0.47 (0.16) 1.6 (0.9) 6.9 (2.9)	0.43 [3] 0.7 [3]	0.54 (0.15) 1.5 (1.0) 7.5 (3.8)
Larynx	DC HD_mean HD_max	0.52 (0.09) 6.0 (1.5) 29.1 (5.0)	0.52 [3], 0.65 [5] 3.1 [3]	0.80 (0.09) 1.5 (0.6) 8.0 (3.1)		0.43 [4] 27.2 [4]
Lens	DC HD_mean HD_max	0.76 (0.09) 0.4 (0.2) 2.0 (0.4)	0.62 [3], 0.55 [5] 0.6 [3]	0.70 (0.15) 0.8 (1.3) 2.0 (1.7)	0.96 [1]	0.74 [4] 2.1 [4]
Lips	DC HD_mean HD_max	0.69 (0.11) 2.0 (1.8) 14.2 (7.4)	0.37 [3] 5.3 [3]	0.67 (0.09) 1.9 (1.3) 14.0 (6.2)	0.96 [1]
Mandible	DC HD_mean HD_max	0.94 (0.03) 0.4 (0.2) 6.6 (2.8)	0.90 [2], 0.86 [3], 0.91 [5] 0.7 [2], 0.6 [3] 9.9 [2]	0.91 (0.02) 0.5 (0.1) 7.5 (2.7)	0.98 [1], 0.89 [2] 0.8 [2] 9.8 [2]	0.86 [2] 1.1 [2] 14.5 [2]
Optic Chiasm	DC HD_mean HD_max	0.09 (0.17) 7.8 (4.4) 17.6 (7.6)	0.13 [3], 0.30 [5] 2.5 [3]	0.22 (0.18) 3.3 (1.5) 11.5 (2.6)	0.56 [1]	0.35 [4] 7.2 [4]
Optic Nerves	DC HD_mean HD_max	0.54 (0.16) 1.7 (1.2) 10.0 (6.5)	0.53 [3], 0.58 [5] 0.8 [3]	0.66 (0.10) 1.1 (0.6) 8.2 (4.2)	0.89 [1]	0.66 [4] 6.0 [4]
Oral Cavity	DC HD_mean HD_max	0.80 (0.08) 3.0 (1.4) 12.6 (4.6)	0.77 [2], 0.76 [3], 0.82 [5] 3.8 [2], 3.2 [3] 18.5 [2]	0.80 (0.06) 2.8 (1.0) 13.6 (3.7)	0.94 [1], 0.72 [2] 4.4 [2] 18.9 [2]	0.74 [2], 0.75 [4] 4.4 [2] 21.4 [2], 24.3 [4]
Parotids	DC HD_mean HD_max	0.81 (0.07) 1.8 (0.6) 13.3 (5.4)	0.75 [2], 0.80 [3], 0.78 [5] 2.0 [2], 1.3 [3] 13.1 [2]	0.83 (0.06) 1.4 (0.6) 11.6 (4.6)	0.97 [1], 0.76 [2] 2.0 [2] 13.6 [2]	0.72 [2], 0.84 [4] 2.5 [2] 17.1 [2], 12.1 [4]
Pharynx Constr	DC HD_mean HD_max	0.60 (0.07) 2.4 (1.1) 21.0 (7.0)	0.45 [3] 2.1 [3]	0.62 (0.06) 1.9 (0.7) 16.4 (5.9)	0.82 [1]
Spinal Cord	DC HD_mean HD_max	0.79 (0.07) 1.0 (0.3) 5.6 (3.5)	0.75 [2], 0.65 [3], 0.79 [5] 3.6 [2], 0.7 [3] 18.3 [2]	0.78 (0.07) 1.0 (0.5) 5.5 (1.9)	0.95 [1], 0.76 [2] 1.0 [2] 6.0 [2]	0.68 [2], 0.53 [4] 1.6 [2] 7.2 [2], 196.7 [4]
Submand Glands	DC HD_mean HD_max	0.76 (0.18) 1.5 (1.0) 8.8 (4.2)	0.76 [3] 2.7 [2], 0.8 [3] 12.3 [2]	0.75 (0.16) 1.4 (0.9) 8.7 (4.6)	0.94 [1] 2.6 [2] 12.5 [2]	0.63 [2] 2.7 [2] 11.9 [2]
Thyroid	DC HD_mean HD_max	0.81 (0.06) 0.9 (0.2) 7.9 (2.5)	0.65 [3] 1.8 [3]	0.81 (0.06) 0.9 (0.3) 9.1 (3.5)	0.88 [1]

Sources: [1] D’Aviero et al. Int J Environ Res Public Health 2022, 19, 9057; [2] Goddard et al. Radiation Oncology 2024, 19:69; [3] Wan, Hanlin. MIM Software Inc. Cleveland, OH, USA (Contour ProtegeAI+ White Paper); [4] Ng et al. Appl. Sci. 2022, 12, 11681; and [5] Johnson et al. Front. Oncol. 2024, 14, 1375096.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bayley, C.; Rau, A.; Lau, H.; Banerjee, R.; Quon, H.; Tchistiakova, E.; Kirkby, C.; Jutras, J.-D. Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning. Appl. Sci. 2026, 16, 5681. https://doi.org/10.3390/app16115681

AMA Style

Bayley C, Rau A, Lau H, Banerjee R, Quon H, Tchistiakova E, Kirkby C, Jutras J-D. Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning. Applied Sciences. 2026; 16(11):5681. https://doi.org/10.3390/app16115681

Chicago/Turabian Style

Bayley, Conrad, Allison Rau, Harold Lau, Robyn Banerjee, Harvey Quon, Ekaterina Tchistiakova, Charles Kirkby, and Jean-David Jutras. 2026. "Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning" Applied Sciences 16, no. 11: 5681. https://doi.org/10.3390/app16115681

APA Style

Bayley, C., Rau, A., Lau, H., Banerjee, R., Quon, H., Tchistiakova, E., Kirkby, C., & Jutras, J.-D. (2026). Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning. Applied Sciences, 16(11), 5681. https://doi.org/10.3390/app16115681

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Two Auto-Contouring Systems for Head and Neck Organs at Risk to Institutional Reference Standard in Radiotherapy Planning

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Patient Selection

2.2. Organs at Risk

2.3. Simulation and Manual Segmentation Process

2.4. Auto-Segmentation Process

2.5. Geometric Accuracy

2.6. Statistical Analysis

3. Results

3.1. Sample Characteristics

3.2. Primary Outcomes: Geometric Accuracy

3.3. Secondary Outcome: Qualitative Assessment

4. Discussion

4.1. Discussion of Our Findings

4.2. Our Institutional Experience

4.3. Comparison to Other Studies

4.4. Limitations and Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI