Clinical Evaluation of Deep Learning and Atlas-Based Auto-Contouring for Head and Neck Radiation Therapy

Featured Application: Deep learning (DL) auto-contouring instead of atlas-based auto-contouring and manual contouring should be used for anatomy segmentation in head and neck radiation therapy for reducing contouring time, and commercial DL auto-contouring tools should be further trained by local hospital datasets for enhancing their geometric accuracy. Abstract: Various commercial auto-contouring solutions have emerged over past few years to address labor-intensiveness, and inter-and intra-operator variabilities issues of traditional manual anatomy contouring for head and neck (H&N) radiation therapy (RT). The purpose of this study is to compare the clinical performances between RaySearch Laboratories deep learning (DL) and atlas-based auto-contouring tools for organs at risk (OARs) segmentation in the H&N RT with the manual contouring as reference. Forty-ﬁve H&N computed tomography datasets were used for the DL and atlas-based auto-contouring tools to contour 16 OARs and time required for the segmentation was measured. Dice similarity coefﬁcient (DSC), Hausdorff distance (HD) and HD 95th-percentile (HD95) were used to evaluate geometric accuracy of OARs contoured by the DL and atlas-based auto-contouring tools. Paired sample t -test was employed to compare the mean DSC, HD, HD95, and contouring time values of the two groups. The DL auto-contouring approach achieved more consistent performance in OARs segmentation than its atlas-based approach, resulting in statistically signiﬁcant time reduction of the whole segmentation process by 40% ( p < 0.001). The DL auto-contouring had statistically signiﬁcantly higher mean DSC and lower HD and HD95 values ( p < 0.001–0.009) for 10 out of 16 OARs. This study proves that the RaySearch Laboratories DL auto-contouring tool has signiﬁcantly better clinical performances than its atlas-based approach.


Introduction
Nowadays, intensity-modulated radiation therapy (IMRT) is an important cancer treatment option as a result of its capability of reduction of toxic effects associated with radiation therapy (RT). One of the essential requirements for IMRT is anatomy contouring [1][2][3]. Traditionally, this requires radiation therapists and/or oncologists to manually identify and contour tumors (clinical target volumes (CTVs)) and normal tissues (organs at risk (OARs)). Labor-intensiveness, and inter-and intra-operator variabilities are two major issues of the manual contouring [1][2][3][4][5]. Various commercial auto-contouring solutions have emerged over past few years to address these issues. Atlas-based and deep learning (DL) approaches are used to develop these auto-contouring solutions [3][4][5][6][7][8][9][10][11]. The atlas-based method involves automatically registering reference patient contours (gold standard/ground truth) to new patients through deforming reference patient contours for matching new patient anatomical structures. This approach can be subdivided into three categories, namely single atlas (with use of one reference dataset), multi-atlas (using multiple datasets) and hybrid (based on outcome of statistical analysis of multiple gold standards) [3,4,12]. Over the past five years, use of DL in medical imaging has become popular [13][14][15]. Commercial companies such as Manteia Medical Technologies Co. (Xiamen, China) [5], Mirada Medical Limited (Oxford, UK) [7] and Carina Medical LLC (Lexington, KY, USA) [10] have applied the widely used deep convolutional neural network (DCNN) architecture to develop their auto-contouring models (software). The DL auto-contouring model development involves providing training datasets with ground truth contours to the model for learning features of tumors and normal tissues (including those not known by human) automatically. With sufficient training, the model becomes capable to automatically search for these features and locate them to achieve auto-contouring for new patients. Hence, it is believed that the DL auto-contouring approach performs better than the atlas-based method because of its capability of identifying unknown but relevant features for more accurate contouring [1][2][3]5,7,[10][11][12][16][17][18][19][20][21].
It is well known that head and neck cancer segmentation is a challenging task because computed tomography (CT) images which have limited soft tissue contrast are commonly used in the RT planning process [1,2,[22][23][24][25][26]. This results in the greatest inter-and intraoperator variabilities and takes about 2-3 h for the manual contouring in the head and neck RT [2,[22][23][24][25]. Few studies compared performances between commercial atlas-based and DL auto-contouring software packages with the manual contouring as the reference [5,6,10,27]. The investigated atlas-based and DL auto-contouring software pairs included Maestro 6.6.5 (MIM Software Inc., Cleveland, OH, USA) versus AccuContour (Manteia Medical Technologies Co., Xiamen, China) [ [27]. Those studies showed consistent results that the DL auto-contouring was more accurate and required less time for segmentation of OARs [5,6,10,27]. Given these promising results of the DL auto-contouring, it appears worthwhile to further investigate the potential of other unexplored DL auto-contouring software packages such as RaySearch Laboratories AB RayStation 12A RSL Head and Neck CT 2.0.0.47 DL autocontouring model. The purpose of this study is to compare the clinical performances between RaySearch Laboratories AB RayStation DL auto-contouring model, RSL Head and Neck CT 2.0.0.47 and its atlas-based auto-contouring software, ANACONDA for the OARs segmentation in the head and neck RT with the manual contouring as the reference. It is hypothesized that the DL auto-contouring model is more accurate and requires less contouring time when compared with the atlas-based auto-contouring tool.

Study Design and Imaging Data
This was a retrospective study with methods based on similar studies of comparison of performances between the atlas-based and DL auto-contouring for the OARs segmentation in the head and neck RT [5,6,10,27]. Planning CT datasets of 45 head and neck cancer patients who had RT treatments between 2018 and 2021 at Pamela Youde Nethersole Eastern Hospital in Hong Kong Special Administrative Region were retrospectively collected from Eclipse treatment planning system (Varian Medical Systems, Palo Alto, CA, USA). Patient inclusion criteria were: 1. head and neck cancer histologically proven and 2. radical radiotherapy received. Patients with pre-radiotherapy surgery were excluded. . Patient consent was waived due to the retrospective nature. The collected CT datasets were acquired with the following parameters, slice thickness of 2 mm, 400 mAs, 120 kV and field of view of 600 mm in accordance with the routine protocol of Pamela Youde Nethersole Eastern Hospital [27]. Figure 1 shows the overview of study design.

Manual Contouring
Manual contouring for the collected 45 CT datasets was conducted by a radiation therapist with more than 10 years of experience in head and neck RT planning on the Eclipse treatment planning system as per international consensus delineation guidelines [28]. Contoured OARs included brainstem, larynx, optic chiasm, oral cavity, pituitary, spinal cord, two cochleae, two eyes, two lenses, two optic nerves and two parotid glands [27]. The manual contours were subsequently reviewed and modified by a radiation oncologist specialized in the head and neck RT for eventual clinical use, and hence considered as the ground-truth contours [10].

Atlas-Based and DL Auto-Contouring
The 45 CT datasets were exported to the RaySearch Laboratories AB RayStation 12A treatment planning system (with Intel Xeon W-10885M central processing unit (Santa Clara, CA, USA), 64 GB random access memory and NVIDIA Quadro RTX 5000 16 GB graphics card (Santa Clara, CA, USA)) in the Pamela Youde Nethersole Eastern Hospital. Its ANACONDA and RSL Head and Neck CT 2.0.0.47 were used for the atlas-based and DL auto-contouring, respectively. ANACONDA used the hybrid approach for auto-contouring. In the beginning of a new image set being contoured, multiple best-matching atlases from a pool of 40 nasopharyngeal cancer datasets within the hospital database were determined by rigid image registration. Contours from these best-matching atlases were subsequently mapped to the new image set using deformable registration and these contours eventually merged together as the segmentation outcome via a fusion algorithm [10]. The DL auto-contouring involved a pre-trained three-dimensional (3D) U-net CNN model to classify each voxel of the datasets into either specific tissue (i.e., an OAR) or nonspecific one for generating labelled (segmented) datasets as the outcome. Only its pre-

Manual Contouring
Manual contouring for the collected 45 CT datasets was conducted by a radiation therapist with more than 10 years of experience in head and neck RT planning on the Eclipse treatment planning system as per international consensus delineation guidelines [28]. Contoured OARs included brainstem, larynx, optic chiasm, oral cavity, pituitary, spinal cord, two cochleae, two eyes, two lenses, two optic nerves and two parotid glands [27]. The manual contours were subsequently reviewed and modified by a radiation oncologist specialized in the head and neck RT for eventual clinical use, and hence considered as the ground-truth contours [10].

Atlas-Based and DL Auto-Contouring
The 45 CT datasets were exported to the RaySearch Laboratories AB RayStation 12A treatment planning system (with Intel Xeon W-10885M central processing unit (Santa Clara, CA, USA), 64 GB random access memory and NVIDIA Quadro RTX 5000 16 GB graphics card (Santa Clara, CA, USA)) in the Pamela Youde Nethersole Eastern Hospital. Its ANACONDA and RSL Head and Neck CT 2.0.0.47 were used for the atlas-based and DL auto-contouring, respectively. ANACONDA used the hybrid approach for auto-contouring. In the beginning of a new image set being contoured, multiple best-matching atlases from a pool of 40 nasopharyngeal cancer datasets within the hospital database were determined by rigid image registration. Contours from these best-matching atlases were subsequently mapped to the new image set using deformable registration and these contours eventually merged together as the segmentation outcome via a fusion algorithm [10]. The DL autocontouring involved a pre-trained three-dimensional (3D) U-net CNN model to classify each voxel of the datasets into either specific tissue (i.e., an OAR) or non-specific one for generating labelled (segmented) datasets as the outcome. Only its pre-trained DL autocontouring model without any finetuning was used for the DL auto-contouring. Details of its 3D U-net CNN architecture, and model training, validation and testing were available elsewhere [29].

Geometric Accuracy and Contouring Time Evaluation
Three parameters, Dice similarity coefficient (DSC) for quantification of degree of overlapping between two contours (Contour_A and Contour_B), Hausdorff distance (HD) defined as pairwise 3D point distance from one contour to another contour, and HD 95th-percentile (HD95) were employed to evaluate the geometric accuracy of the 16 OARs contoured by the atlas-based and DL auto-contouring tools with the manual contours as the reference. Equations (1)-(3) were used for DSC, HD and HD95 calculations, respectively [5,6,10,27].  [27].
For the contouring time evaluation, the time required by the atlas-based and DL auto-contouring tools for the auto-contouring processes was recorded. Additionally, the same radiation therapist involved in the previous manual contouring process reviewed and edited the 16 OARs contoured by the two auto-contouring tools as per the clinical protocol of the hospital. The review and editing time was recorded as well [27]. No dataset with patient personal information was taken from the hospital.

Statistical Analysis
SPSS Statistics 28 (International Business Machines Corporation, Armonk, NY, USA) was used for statistical analysis. Paired sample t-test was employed to compare the mean DSC, HD, HD95, auto-contouring time, review and editing time and total segmentation process time values of the DL and atlas-based auto-contouring groups. A p-value less than 0.05 represented statistical significance [5,6,10].

Geometric Accuracy
The geometric accuracy of the DL auto-contouring approach was higher than that of the atlas-based method overall. Figure 2 shows that the DL auto-contouring performed more consistently with smaller DSC, HD and HD95 variations for nearly all OARs when compared with the atlas-based auto-contouring. Additionally, its geometric accuracy (in terms of higher DSC and lower HD and HD95 values) was better for 10 out of 16 OARs. These 10 OARs were brain stem, left and right eyes, left and right lens, left and right optic nerves, left and right parotid glands, and pituitary. Except for the DSC values of left eye, the DL auto-contouring had statistically significantly higher mean DSC and lower HD and HD95 values for these 10 OARs as well (Table 2).     Nonetheless, Table 2 also illustrates that both atlas-based and DL auto-contouring approaches had very poor/poor accuracy (DSC < 0.50/HD95 ≥ 6 mm) in contouring larynx, optic chiasm, oral cavity and spinal cord. In addition, the DL auto-contouring only had mean DSC of 0.31-0.35 for contouring left and right cochleae. Figures 3 and 4 show two examples of OARs segmentations by the three approaches, namely manual contouring (gold standard), DL and atlas-based auto-contouring.

Contouring Time Evaluation
The DL auto-contouring approach required about 40% less time to complete all OARs segmentation for a dataset on average. This was contributed by 55% and 35% less time required for initial auto-contouring and subsequent review and editing processes, respectively. Table 3 illustrates the mean time required for the atlas-based and deep learning auto-contouring processes. The DL approach had statistically significantly shorter time required for auto-contouring and review and editing, and hence the whole process.

Contouring Time Evaluation
The DL auto-contouring approach required about 40% less time to complete all OARs segmentation for a dataset on average. This was contributed by 55% and 35% less time required for initial auto-contouring and subsequent review and editing processes, respectively. Table 3 illustrates the mean time required for the atlas-based and deep learning auto-contouring processes. The DL approach had statistically significantly shorter time required for auto-contouring and review and editing, and hence the whole process.

Discussion
To the best of our knowledge, this was the first study on comparing the performances between RaySearch Laboratories AB RayStation DL auto-contouring model, RSL Head and Neck CT 2.0.0.47 and its atlas-based auto-contouring software, ANACONDA for the OARs segmentation in the head and neck RT. This study's results show that the RSL Head and Neck CT 2.0.0.47 DL auto-contouring approach could achieve more consistent performance in OARs segmentation than the ANACONDA atlas-based approach (Figure 2). These findings are in line with similar studies on comparing the INTContour DL auto-contouring approach with the Maestro 6.9.6 and ANACONDA atlas-based approaches for head and neck RT [6,10]. This consistent performance resulted in less time required for manual correction because the required adjustments became more straightforward and could be completed rapidly. Hence, this contributed to the statistically significant reduction of the review and editing time and the time required for the whole segmentation process by 35% and 40%, respectively (Table 3). For Li et al.'s [6] study, the INTContour DL and Maestro 6.9.6 atlas-based approaches required 120 s and 600 s to contour 4 head and neck OARs, respectively. Although this represented that their DL approach was able to reduce the total contouring time by 80%, the number of OARs involved in their study was only one-fourth of the number in this study. Hence, the total contouring time required for the RSL Head and Neck CT 2.0.0.47 DL auto-contouring approach appears comparable to their study. These findings are particularly important for addressing the major issues of the manual contouring, namely inter-and intra-operator variabilities and labor-intensiveness in head and neck RT [1][2][3][4][5].
To further address the manual contouring issue, high geometric accuracy of contours is essential. This study's results show that the RSL Head and Neck CT 2.0.0.47 DL autocontouring had statistically significantly higher mean DSC and lower HD and HD95 values for 10 out of 16 OARs, namely brain stem, left and right eyes, left and right lens, left and right optic nerves, left and right parotid glands, and pituitary when compared with those of the ANACONDA atlas-based approach (Table 2). However, the highest mean DSC value achieved by the RSL Head and Neck CT 2.0.0.47 DL auto-contouring approach was only 0.88 for the right eye and the lowest one was 0.31 for the right cochlea. For Wang et al.'s [5] study on comparing the performances of Maestro 6.6.5 atlas-based, pre-trained and trained AccuContour DL auto-contouring tools for head and neck OARs contouring, the highest mean DSC value achieved by their pre-trained DL tool was above 0.9 but four (left and right temporomandibular joints and left and right optic nerves) out of 14 OARs contoured by their pre-trained DL model had mean DSC below 0.6. In contrast, their Maestro 6.6.5 atlas-based auto-contouring tool performed better for these four OARs. In this study, five (left and right cochleae, larynx, optic chiasm and spinal cord) out of 16 OARs contoured by the pre-trained RSL Head and Neck CT 2.0.0.47 DL model had mean DSC below 0.6. Except larynx, the ANACONDA atlas-based approach had higher mean DSC values for these OARs. The DSC findings of this study appears comparable to the corresponding tools in Wang et al.'s study [5].
According  respectively) were also found. Despite that the DSC and HD/HD95 are parameters to evaluate the geometric accuracy of contours, the DSC is the indicator of shape similarity while HD/HD95 are the parameters to illustrate location similarity. HD95 is generally considered a better descriptor of location similarity than the HD due to its capability of excluding the outliers [6,27]. Similar discrepancies between the DSC and HD values were also found in Wang et al.'s [5] study. For example, their pre-built DL model had the mean DSC values of about 0.8 (good) for oral cavity and left and right parotid glands but the corresponding mean HD values were more than 17 mm (poor). In order to address these discrepancies and improve the geometric accuracy of the DL auto-contouring tool, Wang et al. [5] used 120 datasets from their local hospital database to finetune the DL model. In this way, the DL model could learn the local contouring protocols and become more capable to accurately contour the OARs as per the hospital's requirements. After further training, their AccuContour DL model was able to achieve DSC values of all 14 OARs above 0.7. This highlights finetuning of the commercial DL model is key to improve its geometric accuracy. In this study, discrepancies between the local hospital contouring protocol and DL auto-contouring tool were also found. For example, Figure 4 shows the notable difference of the spinal cord caudal limits of manual contouring protocol and the DL approach. It is expected that further training of the RSL Head and Neck CT 2.0.0.47 DL model would help to reduce these discrepancies and improve its accuracy [29].
This study had several major limitations. Only 45 nasopharyngeal cancer datasets were used for the atlas-based and DL auto-contouring tool evaluation. However, the number of evaluation datasets of this study was more than a double of the dataset numbers of the similar studies by Wang et al. [5] (20 datasets) and Li et al. [6] (22 datasets). Additionally, this study covered 16 OARs which was greater than the numbers of OARs (4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14) of other similar studies [5,6,10]. Although only one radiation therapist with more than 10 years of experience in head and neck RT planning was involved in the contouring time evaluation, the international consensus delineation guidelines were used for review and editing the contours for minimizing the inter-and intra-operator variabilities [5,28]. Besides, this study only evaluated one DL auto-contouring model and did not investigate dosimetric impact of contouring accuracy but these arrangements were consistent with the other similar studies [5,6,10,27].

Conclusions
This study compared the clinical performances between RaySearch Laboratories AB RayStation DL auto-contouring model, RSL Head and Neck CT 2.0.0.47 and its atlas-based auto-contouring software, ANACONDA for 16 OARs segmentation in the head and neck RT with the manual contouring as the reference. Its results show that the DL auto-contouring model was more accurate and required less contouring time when compared with the atlas-based auto-contouring tool. The DL auto-contouring approach could achieve more consistent performance in the OARs segmentation than the atlas-based approach, resulting in statistically significant reduction of the time required for the whole segmentation process by 40%. Additionally, the DL auto-contouring had statistically significantly higher mean DSC and lower HD and HD95 values for 10 out of 16 OARs, namely brain stem, left and right eyes, left and right lens, left and right optic nerves, left and right parotid glands, and pituitary. These outcomes are particularly important for addressing the major issues of the manual contouring, namely inter-and intra-operator variabilities and labor-intensiveness in head and neck RT. However, for future studies, more radiation oncologists and head and neck cancer datasets with greater varieties should be used to evaluate the performances (in terms of contouring time, geometric accuracy and associated dosimetric impact) of multiple atlas-based, pre-trained and trained DL auto-contouring tools in contouring a great number of OARs. Comparison of performances of RaySearch Laboratories DL and atlas-based auto-contouring tools in contouring both CTVs and OARs for other cancer types such as breast cancer should be conducted as well.