Implementation of a Commercial Deep Learning-Based Auto Segmentation Software in Radiotherapy: Evaluation of Effectiveness and Impact on Workflow

Proper delineation of both target volumes and organs at risk is a crucial step in the radiation therapy workflow. This process is normally carried out manually by medical doctors, hence demanding timewise. To improve efficiency, auto-contouring methods have been proposed. We assessed a specific commercial software to investigate its impact on the radiotherapy workflow on four specific disease sites: head and neck, prostate, breast, and rectum. For the present study, we used a commercial deep learning-based auto-segmentation software, namely Limbus Contour (LC), Version 1.5.0 (Limbus AI Inc., Regina, SK, Canada). The software uses deep convolutional neural network models based on a U-net architecture, specific for each structure. Manual and automatic segmentation were compared on disease-specific organs at risk. Contouring time, geometrical performance (volume variation, Dice Similarity Coefficient—DSC, and center of mass shift), and dosimetric impact (DVH differences) were evaluated. With respect to time savings, the maximum advantage was seen in the setting of head and neck cancer with a 65%-time reduction. The average DSC was 0.72. The best agreement was found for lungs. Good results were highlighted for bladder, heart, and femoral heads. The most relevant dosimetric difference was in the rectal cancer case, where the mean volume covered by the 45 Gy isodose was 10.4 cm3 for manual contouring and 289.4 cm3 for automatic segmentation. Automatic contouring was able to significantly reduce the time required in the procedure, simplifying the workflow, and reducing interobserver variability. Its implementation was able to improve the radiation therapy workflow in our department.


Introduction
Radiation therapy (RT) is an important treatment option in the management of cancer. It aims at delivering a high radiation dose to target cancer cells to ensure clinically required tumor control probability and concomitantly spare the nearby healthy tissues to prevent acute RT-related toxicity and late effects.
Accurate contouring of Clinical Target Volumes (CTV) and Organ at Risk (OAR) is important for treatment planning and delivery. Generally, the segmentation of tumor regions and normal tissues is manually performed by the clinical staff, based on the images acquired during planning computed tomography (CT). This approach is prone to a high degree of inter and intra observers' variability, being time-consuming, and representing a bottleneck in the planning workflow [1].
To improve the efficiency of this process, auto-contouring methods have been proposed. One of the most popular approaches is atlas-based segmentation [2,3]. However, contouring algorithms based on deep-learning techniques are being increasingly used, showing better results than atlas-based approaches [4,5]. The purpose of our study was to investigate the clinical implementation in our institution of a specific deep learning-based auto contour commercial software to assess the impact on the radiotherapy workflow in four specific disease sites: head and neck, prostate, breast and rectum.

Deep Learning Auto-Segmentation
A commercial deep learning-based auto-segmentation software, Limbus Contour (LC), Version 1.5.0 (Limbus AI Inc., Regina, SK, Canada), which uses deep convolutional neural network models based on a U-net architecture specific for each structure, was recently introduced in our institution. The software relies on models trained, using public datasets [6][7][8][9][10], as well as datasets obtained through institutional data agreements [11][12][13][14][15]. The number of scans used in the training set for each model varies: each model is trained on hundreds or thousands of scans. Models were trained using TensorFlow; typical image augmentation and regularization techniques were applied. Each model is validated internally by Limbus AI by comparing the model output on a set of test scans to expert human contours on the same test scans. The models are also validated in published studies that investigate qualitative and quantitative accuracy and time savings [1,[16][17][18].
LC obtains information related to the acquisition protocol by reading the DICOM metadata of the CT images. The corresponding auto-segmentation model is then automatically used to create auto-segmented contours that are exported alongside the CT images to the treatment planning software to be eventually edited and then validated by the clinicians.

Patients' Selection
Four disease sites were selected for the present study, namely Head and Neck (H&N), prostate, rectum, and breast cancer. We focused on these four settings, considering their high frequency and important impact on the radiotherapy workflow. For each type of treatment, three patients treated in our center were selected.
For H&N, we chose oropharyngeal cancer to guarantee the standardization of OARs contouring. Patients eligible for the study received radical radiotherapy. The prescription dose was 70 Gy delivered in 35 fractions for the curative setting. The prostate setting consisted of patients who received exclusive radiotherapy on prostate gland and seminal vesicles. A moderate hypofractionated schedule was proposed: 70 Gy on prostate gland and 63 Gy on seminal vesicles in 28 fractions, delivered with a simultaneous integrated boost. For rectal cancer, patients offered pre-operative RT were considered. The prescription dose was 50 Gy on the gross tumor volume and positive nodes, and 45 Gy on the elective volumes, in 25 fractions. Finally, patients with left-sided breast cancer who underwent conservative surgery were selected. In this case, the prescription dose was 45 Gy for whole breast irradiation and 50 Gy on the tumor bed, given with a concomitant boost, in 20 fractions.

Technical Setup
Each patient underwent a planning CT scan in supine position; to prevent patient's displacements during treatment, immobilization devices were used: thermoplastic mask for H&N, knee wedge and foot lock for prostate and rectum treatments, breast board for breast cases. Planning CT images were acquired with a Canon Aquilion LB V6.3 series scanner (Canon medical system corporation-Ōtawara, Japan) with 120 kVp tube load. The slice thickness was 3 mm for H&N cancer and 5 mm for other diseases. The in-plane pixel size was 1 mm × 1 mm for all acquisitions.
The same CT acquisitions were contoured by LC and the images, together with the RTstructure DICOM file, were then sent to the Treatment Planning System (TPS) Eclipse (Version 15.6, Varian Medical Systems-A Siemens Healthineers Company, Palo Alto, CA, USA). The LC structure set was later duplicated on the TPS. One structure set has been reviewed by the competent RO and, if necessary, the contours were modified; the second was not submitted to any change.
For H&N cancer, contoured OARs were fifteen (brainstem, brachial plexuses, spinal cord, inner ears, parotid glands, thyroid, mandible, oral cavity, larynx, lungs and esophagus). For prostate cases, five structures were considered (bladder, femoral heads, rectum and penile bulb). For rectal cancer, four OARs were accounted for (femoral heads, bladder and bowel-as abdominal cavity). Finally, for breast cancer contoured structures were four (contralateral breast, heart, and both lungs).

Contouring Time
We recorded the time spent performing the manual contour for each CT scan. Moreover, the time required for LC to generate OARs on a consumer grade system (3.1 GHz Intel Core i7, 8 GB memory) was also evaluated. Finally, the time spent by the ROs to review and, if necessary, edit the contours performed by LC was measured. The overall duration of contouring using LC (LC contouring plus ROs review) was compared to the time required to perform manual contouring, which was used as a reference. In this way, the time difference-absolute and relative-between the two contouring methods was obtained.

Geometrical Analysis
The manually contoured structures (MC) were compared with those generated by LC by means of three indicators: volume variation, Dice Similarity Coefficient (DSC) and shift of the center of mass. For structures with a volume greater than 15 cm 3 , the volume percentage variation was considered. Conversely, for smaller structures, the absolute change in volume was analyzed, since the percentage variation was not considered indicative, given that small variations in volume lead to large percentage variations.
DSC [23] is a measure of the overlap of two volumes. Its value is comprised between 0 and 1, where 0 indicates no overlap while 1 stay for complete overlap. If X and Y are the two volumes to be compared, the coefficient DSC (X|Y) is defined as DSC (X|Y) = 2|X∩Y|/(|X| + |Y|). Finally, starting from the coordinates of the center of mass of each structure in latero-lateral (X), cranio-caudal (Y) and antero-posterior (Z) direction, its displacements between manual and auto-segmented contouring were evaluated. All the parameters were obtained from the statistics tool of the contouring module of Eclipse TPS.

Dosimetric Analysis
A dosimetric analysis was performed to evaluate the effects of unsupervised use of LC on the assessment of dose distribution.
The original treatment plan, optimized and clinically approved with the manually contoured volumes, was recalculated on the LC contoured structure-set using the AAA algorithm (version 15.6.06) of Eclipse TPS, the same as the original plan.
The differences in the Dose Volume Histograms (DVH) between the two structure sets were then evaluated and plans were compared using the metrics reported in Table 1.
For serial organs, metrics associated with maximum dose were used, while for parallel organs the average dose or dose too large volume were considered.

Contouring Time
The absolute and percentage variations of the contouring times are shown in Figure 1. The maximum time saving, both absolute and relative, was obtained for the H&N setting (80 min and 65%, respectively). The minimum changes, both absolute and relative, were found for rectum (3 min and 17%, respectively). Similar variations were found for prostate treatments, while breast cases showed intermediate values.

Contouring Time
The absolute and percentage variations of the contouring times are shown in Figure  1. The maximum time saving, both absolute and relative, was obtained for the H&N setting (80 min and 65%, respectively). The minimum changes, both absolute and relative, were found for rectum (3 min and 17%, respectively). Similar variations were found for prostate treatments, while breast cases showed intermediate values.  Figure 2 shows the average percentage variations in volumes for structures with a volume greater than 15 cm 3 . The associated uncertainty is expressed in terms of ±1 standard deviation. The OAR with the minimum variation (1%) is lung; the structures with the greatest percentage variation are bowel and oral cavity, with mean percentage variations of 65% and 32%, respectively.

Geometrical Analysis
The absolute volume variations for structures with a volume smaller than 15 cm 3 are reported in Figure 3. The associated uncertainty is expressed in terms of ±1 standard deviation. All the structures show values close to or less than 1 cm 3 . Figure 4 shows the average Dice Index for the analyzed structures, with the relative uncertainty, expressed as ±1 standard deviation. The lowest DSC value is 0.39 for the penile bulb. The best results were found for lungs, characterized by a Dice Index of 0.99. Furthermore, a good agreement was found for bladder, heart, and femoral heads, with values greater than or close to 0.9. Considering all structures, the average DSC is 0.72.  Figure 2 shows the average percentage variations in volumes for structures with a volume greater than 15 cm 3 . The associated uncertainty is expressed in terms of ±1 standard deviation. The OAR with the minimum variation (1%) is lung; the structures with the greatest percentage variation are bowel and oral cavity, with mean percentage variations of 65% and 32%, respectively.

Geometrical Analysis
The absolute volume variations for structures with a volume smaller than 15 cm 3 are reported in Figure 3. The associated uncertainty is expressed in terms of ±1 standard deviation. All the structures show values close to or less than 1 cm 3 . Figure 4 shows the average Dice Index for the analyzed structures, with the relative uncertainty, expressed as ±1 standard deviation. The lowest DSC value is 0.39 for the penile bulb. The best results were found for lungs, characterized by a Dice Index of 0.99. Furthermore, a good agreement was found for bladder, heart, and femoral heads, with values greater than or close to 0.9. Considering all structures, the average DSC is 0.72. The absolute value of the three-dimensional displacement of the center of mass is represented, for all the structures, in Figure 5. The lowest values were found for lungs, with values close to 0. The greatest displacement occurred for bowel, with a value equal to 2.4 cm. In Figure 6 the absolute values of the displacements in each direction are reported for bowel.      The absolute value of the three-dimensional displacement of the center of mass is represented, for all the structures, in Figure 5. The lowest values were found for lungs, with values close to 0. The greatest displacement occurred for bowel, with a value equal to 2.4 cm. In Figure 6 the absolute values of the displacements in each direction are reported for bowel.

Dosimetric Analysis
In Table 1, the metrics used for the dosimetric comparison of the treatment plans are reported. The most relevant difference was found in the bowel for rectal cancer treatments: the mean volume covered by the 45 Gy isodose was 10.4 cm 3 for the MC structures versus 289.4 cm 3 for the LC ones.

Discussion
The present study explores the effects of commercial deep-learning based software for auto-contouring on the clinical workflow of a radiation oncology department at a tertiary cancer hospital. In particular, the focus was on timesaving and on the accuracy of the contoured structures.
To accurately assess the time reduction, we evaluated the clinical settings having the highest impact on the workflow in our radiotherapy department. In addition, for each disease site we focused on, all the OARs included in the clinical routine were considered.
Limbus performance was already analyzed by other authors, who investigated multiobserver variability [1], qualitative evaluations of expert ROs [21] and specific evaluations for lung SBRT [22,23]. Furthermore, Zabel et al. [16] compared the manual contouring workflow with LC and an additional atlas-based automatic contouring algorithm for bladder and rectum contouring. Finally, a recent study by D'Aviero et al. evaluates the geometric accuracy of the contours limited to H&N district [24]. The present study includes 28 OARs and 4 anatomical subsets, resulting in a total of 84 contours analyzed. To the best of our knowledge, there are no data available in the literature on such a comprehensive list of OARs and diseases. Furthermore, this study investigates the entire radiotherapy workflow, focusing on geometrical accuracy, timesaving and dosimetric implications of LC implementation in a radiotherapy department.
The possibility to save time is greater in anatomical districts characterized by a greater number and complexity of OARs. Our data are similar to those reported in the literature. As an example, in the setting of lung cancer, Lustberg et al. [2] showed an average time saving of 61% compared to existing clinical practice and 22% compared to the use of atlas-based contours. Wong et al. [1] also found remarkable decreases in contouring time, although for Life 2022, 12, 2088 9 of 11 H&N the absolute time reduction is not comparable to ours due to the smaller number of structures contoured by Wong et al.
LC provides good results, as no gross contouring errors were found. This high-quality performance is highlighted by the average DICE Index of 0.72 which can be considered acceptable in clinical practice [21]. However, some OARs have characteristics that deserve to be discussed.
As can be seen by center of mass and geometrical analyses, there is a difference in bowel manual contouring versus automatic segmentation. LC considers as bowel the entire abdominal cavity, extending the caudal limit including the whole inferior abdomen, regardless of the presence of the intestinal loops. During manual contour, however, bowel was considered as abdominal cavity whose caudal limit is defined by the presence of intestinal loops [14,25,26]. These differences justify the dosimetric variation observed.
Regarding the oral cavity, the differences are due to different approaches in contouring; similar results are found by Zhong et al. [27]. LC considers the extended oral cavity, as Contouring Head and Neck OARs Guidelines suggest [15], including the oral tongue and anterior portion of the oropharynx. In manual contouring, the latter was instead excluded from the oral cavity OAR, since it is part of PTV.
The low DSC for penile bulb is an expected finding, as the anatomical markers or the necessary soft tissue contrast for the penile bulb is generally lacking on CT. To best identify penile bulbs and reduce great contouring variability, some authors have stressed the importance of performing an MRI or CT scan with contrast in the urethra for optimal identification of the penile bulb [14].
About brachial plexuses, the institutional practice is not to contour the complete brachial plexus until, laterally, the thoracic wall because for oropharynx tumors the dose to the brachial plexus axillary trunk is negligible [28]. This choice is due to the necessity, in manual contouring, to reach a compromise between the contouring time and the usefulness of the executed contour. However, this tradeoff is not necessary in the case of automatic contouring.
A disagreement in the cranial limit of plexuses was also found. During manual contouring, the brachial plexuses start from the spinal nerves through the neural foramina from the C4-C5 (C5 nerve roots) to the T1-T2 (T1 nerve roots) level. In LC the cranial level of brachial plexus is often higher, such as C2-C3, probably due to the position of neck. These issues explain differences in DSC values for brachial plexuses (about 0.7) compared to those found by D'Aviero et al. [24] (about 0.95).
Regarding parotid glands, no significant changes in geometric parameters were found. However, there is a non-negligible variation of dosimetric indicator. Although the shape and position of parotid glands are similar in manual contouring and LC, minimal differences could drastically affect dosimetric parameters because of the proximity of parotid glands to PTV and to the steep dose gradients. These results are similar to those reported by Nelms et al. [29].
Good results were found for lungs, femoral heads and bladder. DSC values for these OARs were similar to those found by Wong et al. for bladder, femoral heads [21] and lungs [22]. Furthermore, Zabel et al. [16] bladder DSC value −0.97-confirms our result.
A limitation of the study is the low number of patients analyzed for each setting. However, the analysis considers all OARs involved in the clinical workflow for the considered anatomical regions. This allows for a comprehensive assessment of the impact of LC on radiotherapy routine and considers all the steps of the radiotherapy planning process, from contouring to dosimetric consequences of the unsupervised use of LC. A complete description of LC impact on radiotherapy routine can provide useful information.
As a novelty, this study provides quantitative evidence of the time savings achieved by LC use. These values are realistic, thanks to the number of contoured structures. Furthermore, it is possible to identify the anatomical sites which most benefit from LC. Dosimetric evaluation shows that, although DVH differences are not significant in most cases, LC con-Life 2022, 12, 2088 10 of 11 toured structures must always be supervised by an expert contourer. Otherwise, especially in regions near to high dose gradients, there may be relevant dosimetric variations.

Conclusions
Although an accurate visual review by an expert clinician is still required, LC can significantly reduce the time required for contouring and simplify the workflow leading to treatment planning. Its implementation also allows reducing interobserver variability and improving the interpretation of radiological anatomy. Furthermore, LC can support staff training and the continuous assessment of clinical contouring and structure segmentation.