1. Introduction
As evidenced by the World Health Organization (WHO) [
1] the burden due to cancer has been expanding over the years and is predicted to increase by 77% from 2022 to 2050. It was estimated that about 1 in 5 people would develop cancer leading to death, over their entire lifespan: approximately 1 in 9 men and 1 in 12 women. Radiotherapy (RT) has been recognized as an essential element of an effective cancer care program worldwide [
2], irrespective of palliative and curative intent, as it can lead to no recurrence in the treated area or symptom relief when lesions have metastasized or grown out of control [
3]. Cancer treatment often requires a multidisciplinary approach, leading to the use of RT in combination with surgical intervention and chemotherapy [
4].
The need for patient-specific quality assurance (PSQA), driven by technological advancements in RT, has long been central to the allocation of departmental resources. However, in high-throughput centers, the increasing number of patients, as well as the increase in usage of Volumetric Modulated Arc Therapy (VMAT) over simpler techniques, further amplifies the impact of PSQA on the routine workflow of RT departments.
In recent years, artificial intelligence, machine learning, and deep learning have been widely employed to manage and analyze large quantities of data to improve efficacy of procedures using a more comprehensive data-informed approach. Deep neural networks (DNN) and convolutional neural networks (CNN) have been used to predict various aspects of treatment, from diagnosis and the intermediate steps necessary to describe images, to prediction of treatment response and prognosis.
Having algorithms dedicated to the processing of large amounts of features has led to the development of methods that can generate descriptors from all kinds of data. In the field of medical imaging, big data techniques to generate features fall under the umbrella term of radiomics [
5] which, when applied to dose distributions, are called dosiomic. The features described in [
5] are used to describe first order statistics, as well as the more complex texture properties, of the images from which they are originated. Using a shared set of descriptors aims to standardize and harmonize reporting and investigation of the same kind of images, while having a more straightforward interpretation than the deep features that each DNN builds in the training phase.
The use of dosiomic features in training DNNs and CNNs may lead the deep features to look at different properties of the images, possibly leading to a more complete understanding of the training data.
In this work we have developed a data driven approach based on Deep Learning (DL) algorithms to aid in the identification of the VMAT RT plans which should be checked, to optimize the use of resources by reducing the number of unnecessary PSQA measurements.
To the best of our knowledge, as of writing of this article, this is the first work using an approach based on a 3D convolutional network on absorbed dose distribution in conjunction with dosiomic features that evaluated differences in performance based on the use of flattening filter.
In the following sections, after having briefly described the data available and the neural networks used for the analysis, the results will be presented using confusion matrices and decision curves with increasing levels of pragmatism. Finally, the results will be discussed and compared with the available literature.
2. Materials and Methods
2.1. Input Dataset
The LINACs available to the department are all from Elekta, two VersaHD and one Synergy (Elekta AB, Stockholm, Sweden). The emission spectra of the three LINACs are energy-matched, so they can be considered interchangeable from a treatment delivery standpoint.
VMAT treatment plans were generated in the Pinnacle (version 3) Treatment Planning System (TPS) using the auto-planning engine, employing both flattening filter (FF) and flattening filter-free (FFF) photon beams. Dose calculation was performed in Pinnacle 16.4.3 with a collapsed cone convolution (CCC) algorithm, on a dose grid with isotropic voxel with, most frequently, a side of 3 mm and, occasionally, a side of 2 mm. The instrument used for dosimetry was PTW Octavius 1500 Matrix (PTW-Freiburg, Freiburg, Germany) described in [
6].
A total of 1802 dose distributions computed inside the PSQA phantom, acquired from 2021 to 2024 at IRCSS Azienda Ospedaliero-Universitaria di Bologna, were available for this study. All of them had an associated report with the results of the PSQA in terms of γPassingRate and Γ-Index.
The distribution of the
γPassingRate over the entire dataset, which is shown in
Figure 1, presented a median
γPassingRate of 97.6%, an average of 97.2%, and a minimum of 84.7%.
The threshold to identify safe deliverability of the treatment plans was chosen according to [
7] which, in the absence of center-specific values, suggests using
γPassingRate > 95% when the Γ-Index is computed with Distance To Agreement (DTA) = 2 mm and a Dose Difference (DD) = 3%, which will be indicated as
γ3%,2 mm.
A combined Matlab and R-script was developed to collect the γPassingRate for each patient and plan.
2.2. Model Development and Validation
A general overview of the analysis performed in the study can be seen in
Figure 2.
The input data available to the models was:
- I.
All the 4470 dosiomic features,
- II.
The normalized and thresholded 3D dose distribution, originally computed by the Pinnacle3 TPS in the geometry used during the PSQA measurement (i.e., Octavius phantom).
Before developing the model, the dataset was split to use 80% of the data in the training phase, and the remaining 20% for final validation. During the training phase, the dataset has been further randomly subdivided at each epoch to obtain a training dataset (80%) and a testing dataset (20%), which was used to monitor the performance during each epoch.
Two models were developed and investigated, which were called hybrid because both architectures were composed of a fully connected branch, dedicated to the analysis of dosiomic features, and a CNN branch, devoted to the analysis of the 3D dose distributions.
The single branches, with their respective input data, were trained and tested separately to verify that simultaneous usage of both inputs led to an improvement in performance. To ease the presentation of the best results obtained in our research, the results of this analysis are shown in
Supplementary Material.
A Python script was developed to automatically pair the γPassingRate from the report to the dose distribution, including only cases in which Γ-Index was computed with DTA = 2 mm and DD = 3% and in which there were no ambiguities in the plan- > report association, such as in cases of second treatments or re-planning. This process reduced the number of available dose distributions to 839, which were stored as couples of DICOM files made up of both RTDose and RTPlan.
Since only dose distributions were used as input, and in order to reduce the time necessary to access these files in memory, all RTDose files have been converted to NRRD format using a bash script for plastimatch [
8].
In the conversion phase, to have a homogeneous dataset, all dose distributions have been normalized to the maximum dose and have been thresholded to include only the values > 25% of DMax. This was performed to obtain input images similar to the conditions used in the computation of Γ-Index, which was obtained with global normalization and a 20% dose threshold.
Since the extraction of dosiomic features requires the definition of regions of interest (ROIs), after the conversion, 5 isodose curves were chosen and automatically segmented to prepare for feature extraction. The chosen isodoses were 25%, 50%, 75%, 90% and 95%. These levels were chosen to engineer texture features that describe different regions. Indeed, isodoses of up to 75% are expected to give a general overview of the entire plan while, on the other hand, isodose 90% and 95% were chosen because they contain the steepest gradients, have shapes similar to the actual target and are the regions in which planning objectives are most active giving a large contribution to plan modulation.
The absorbed dose distributions have been further preprocessed during the feature extraction to obtain a wavelet decomposition and the gradient features of the image, using pyradiomics version 3.0.1 [
9] on Python 3.11.5 [
10].
Overall, each isodose contour was used to extract 894 radiomic features leading to a total of 4470 (=894 × 5) dosiomic features for a single dose distribution. Regardless of the architecture, which are reported in
Figure 3, the models were trained to output pass/fail criteria using γ
PassingRate ≤ 97% as benchmark. This threshold was deliberately selected to prompt the algorithm to detect treatment plans deviating from the median behavior, rather than exclusively flagging those that unequivocally require replanning. An additional advantage of this choice is the creation of a more balanced dataset, albeit at the cost of occasionally recommending PSQA procedures that may ultimately prove unnecessary.
Both models were called hybrid because both architectures are composed of a fully connected branch, dedicated to the analysis of dosiomic features, and a CNN branch, devoted to the analysis of the 3D dose distributions.
The first model, identified as Hybrid-1, had a pooling layer after each convolutional layer and the architecture, which is reported in
Figure 3a, had
L2 regularization weight set to 0.08, a dropout fraction of 25% and a total of 9.9 million trainable parameters.
The second model, identified as Hybrid-2, had roughly the same number of trainable parameters but two convolutional layers before each pooling, the structure is reported in
Figure 3b, with
L2 regularization weight left as default and dropout fraction set to 33%.
All the convolutional networks were trained from scratch, instead of using pre-trained network weights, because most of the available networks were trained on 2D images either for object detection or on medical images which were deemed too distant from dose distributions.
Overfitting has been pre-emptively addressed, in all networks, using dropout layers and
L2 regularization of both weights and biases, while the risk of exploding and vanishing gradients, as well as stabilization and speed up of training, was addressed with batch normalization layers [
11]. All dropout layers have been set to have a 0.3 dropout fraction, all kernels in the convolutional layers were sized as 3 × 3 × 3 tensors with default unitary stride and “same” padding to leave unchanged the dimensions of the distribution before and after a convolution.
Finally, during the training phase, the learning rates of all models have been modified following a policy of reduction on plateaus to give the models the possibility to make finer adjustments along the optimization curve around local minima. The learning rate has been reduced by a factor 0.25, during the training sessions, after 35 successive epochs without improvement in the monitored performance metric. To allow for the changes in the learning rate to have effect, an early stopping policy was defined so that all models stopped once the performance remained on a plateau for more than 75 successive epochs, hence it allowed for at least two reductions in learning rate.
The training process was very time-consuming, mostly due to the lack of ad hoc computational resources. Both hybrid models required from 30 min to 4 h for each training epoch and reached convergence to a suitable performance on the validation dataset after about 150 epochs. For this reason, a more in-depth search for optimal architecture and hyper-parameters was not possible and will be the object of future works.
To reduce the memory space required for the loading dose distributions, while maximizing the amount of information in the volumes, some preprocessing steps have been implemented when loading the data in memory using the tensorflow dataset API [
12].
Starting with the normalized and thresholded absorbed dose distributions, the following preprocessing steps were applied:
Each distribution was cropped to fit within a cube of side length 27 cm, centered on the 25% isodose. The cube size was determined according to the maximum measurable field of the phantom, under standard conditions.
All distributions were subsequently resampled to matrices of size 94 × 94 × 94. This resolution was selected to minimize the number of resampling operations required across the training set, thereby limiting the introduction of noise arising from artificial alterations of the original dose distributions. These choices, however, limited the range of possible absorbed dose distributions that could be evaluated by the network to those that could be contained by the described cube. There may also be variations in performance of the network when the dimension of pixels in the dose grid used in the TPS differs from 3 mm (e.g., 2 mm).
The dosiomic features were extracted from the normalized images as well as from their elaboration with a wavelet filter and with a gradient filter, as defined in [
9], to emphasize texture and gradients in the dose distributions. The bin width for discretization of the images, prior to feature extraction, was set to 5% to observe variations in dose with that same order of magnitude.
Finally, before being fed to the networks, all dosiomic features were normalized by subtracting the mean and dividing by the standard deviation across training samples.
The investigated models have been developed in Python using TensorFlow version 2.10 [
12].
The performance of the models was evaluated, on the testing dataset, using confusion matrices as well as decision curves [
13,
14] which conveniently summarized the performances plotting either net benefit or net reduction in interventions for each strategy.
To properly use these curves, it is necessary to decide what an acceptable range of workload, named threshold probability, as the maximum number of QAs the medical physics unit is reasonably willing to perform. A cautious physicist may feel that performing 15–20 QAs to find one error in the machine may be an acceptable use of resources. On the other hand, a physicist with great confidence in the machines available, due to other QA procedures in place, may feel that each PSQA has a large impact on the clinical practice by absorbing too many resources from those required for patients, hence they may deem that even performing 5–10 QAs to find one error may be an acceptable compromise. The first physicist would have a threshold probability of 5–6% (1/20–1/15) while the second would have a threshold probability of 10–20% (1/4–1/3), for this reason the interesting range in which our models had to be evaluated was from roughly 5% to roughly 20% threshold probability.
2.3. Independent Validation of Models
Once trained and tested the entire pipeline was validated prospectively on data acquired routinely in the clinical practice. An additional 201 PSQAs were measured, and the performance was with the same criteria used in the validation phase.
4. Discussion
This work shows the applicability of the DL approaches on absorbed doses saved in the RTDose files to predict the need to perform the QA procedure of VMAT plans generated using FFF or FF beams.
The best model performance was obtained with the Hybrid-1 model, which reached a specificity of 0.74 and a sensitivity of 0.64 followed by the Hybrid-2 model with a specificity and sensitivity of 0.72 and 0.52, respectively. Both models represent a relevant improvement in the current detection ability of the random sampling used at our institute.
Regarding Hybrid-2 architecture, considering plans employing the 6 MV FFF beam, and in reference to
Figure 7c, the current practice of performing the QA for all plans leads to performing 83% of QAs unnecessarily to identify the 17% of plans that need checking. The model would reduce the total QA workload for FFF plans from 100% to 37%, of which 12% would be needed and 25% unnecessary. Considering the plans with the flattening filter, in reference to
Figure 7d, the current practice leads to checking, on average, 11% of plans. Over the total number of FF plans, this leads to performing unnecessary QAs in 10% of cases while just 1% of QAs would lead to identification of challenging plans. Using this model would mean performing 38% of QAs, of which 5% would be necessary and 33% unnecessary. This would mean improving the detection of plans needing attention by a factor of five, while increasing the number of unnecessary QAs by a factor of three.
Regarding Hybrid-1 architecture, considering plans employing the 6 MV FFF beam, and in reference to
Figure 8c, the current practice of performing the QA for all plans leads to performing 83% of QAs unnecessarily to identify the 17% of plans that need checking. The model would reduce the total QA workload for FFF plans from 100% to 37%, of which 14% would be needed and 23% would be unnecessary. Considering the plans with the flattening filter, in reference to
Figure 8d, the current practice leads to checking, on average, 11% of plans. Over the total number of FF plans, this leads to performing unnecessary QAs in 10% of cases while just 1% of QAs would lead to identification of challenging plans. Using the model would mean performing 44% of QAs, of which 7% would be necessary and 37% unnecessary. This would mean improving the detection of plans needing attention by a factor of seven, while increasing the number of unnecessary QAs by a factor three.
Both hybrid models would lead to an increase in total number of QA performed; however, the increase in performed QAs would be exclusively made up of plans under the attention limit suggested by [
7].
When considering
Figure 9 and
Figure 10, net benefit is the percentage of true positives added without increasing true negatives, i.e., the percentage of overall QAs that each model would correctly identify as in need of measurement without adding unnecessary workload. Threshold probability represents a sliding scale that starts, on the left, from feeling that performing all QAs is necessary because missing an error is too dangerous and ends, on the right, with being willing to perform fewer QAs because they are too demanding on clinical routine.
With these considerations in mind
Figure 9a shows that a team that is really concerned with possible errors would choose all PSQAs strategy. As the weight of QAs on the routine increases, a team willing to perform no more than 20 PSQAs to find a single error (0.05 threshold probability) would choose the hybrid-1 model. For a team that would perform no more than 10 PSQAs to find a single error (0.1 threshold probability), hybrid-2 would be the preferred model. Finally, for cases in which PSQAs are too expensive and performing more than 5–4 QAs (0.2–0.25% threshold probability) to find a single error becomes unacceptable, the winning strategy becomes performing no PSQA at all. Both hybrid models, when employed in their respective regions of relevance, would lead to a net benefit of 10% to 5%.
On the other hand,
Figure 9b shows how many unnecessary PSQAs would be avoided by using each strategy. The Hybrid-1 model is preferred up to threshold probabilities of around 10% and would lead to a reduction in unnecessary workload of 30%. When PSQAs are more demanding, i.e., up to a threshold probability of around 20%, using the Hybrid-2 model would lead to a 40% reduction in unnecessary procedures. If performing five PSQAs to find one error is an unacceptable workload (threshold probability > 0.25%) then the winning strategy to reduce workload is to simply perform no PSQAs (blue line).
In all cases randomly choosing QAs, i.e., yellow lines of
Figure 9a,b is the worst available strategy both in the attempt to reduce the workload and with the intention of improving correct identification.
In the case of FFF plans, i.e.,
Figure 10a,b, the Hybrid-1 structure is always the optimal choice leading to around 10% improvement in net benefit and around a 40% reduction in unnecessary workload.
On the other hand, when considering FF plans, i.e.,
Figure 10c,d, up to 10% threshold probability Hybrid-1 and Hybrid-2 lead to an equivalent net benefit and reduction in interventions. In cases in which the unnecessary workload of PSQAs is more relevant (thresholds > 10%)
Figure 10c,d shows that the Hybrid-2 model would be better both in benefit and for reduction in interventions. In all cases in which QAs are too demanding, the winning strategy is clearly to perform no PSQAs at all. Finally, a random sampling strategy is confirmed to be the worst option available in all cases.
To the authors knowledge, this is the first study that created a model based on the 3D absorbed dose distribution and that evaluated the performance in differentiating between plans with and without flattening filter. It is worth mentioning that all the architectures presented in this work, apart from retraining the weights and changing the activation of the last layer, can be converted to predicting γ3%,2 mm as a raw number. The regression task was initially explored, however, it is not shown in this work because it produced a model with a mean absolute error of 1.5–2% which, given the width of the distributions observed in the available data, would be a performance equivalent to always predicting the average γPassingRate and would lead to an underestimation of risk of failure for the PSQA.
Miao et al. [
15] used a similar CNN structure to perform a regression analysis on 2D dose distributions computed in the reference PSQA geometry with different phantom and dose calculation algorithms. The mean absolute errors reported are 1.1%, 1.9%, 1.7%, 2.3% for
γ3%,3 mm γ3%,2 mm γ2%,3 mm γ2%,2 mm respectively, which are in line with our initial results on a regression task. Unfortunately, the true and false positive rates are not reported, hence no direct comparison can be made.
Lambri et al. [
13] trained a DL model based only on complexity features from 5622 VMAT plans, which was evaluated with an action limit of 95%, in a multicentric setting, obtaining a specificity of 0.90, 0.90 and 0.68, and a sensitivity of 0.61, 0.25, and 0.55, respectively, in the three institutions where it was tested. A direct comparison with their results can be made using the same action limit as [
13], Hybrid-1 reaches a sensitivity and specificity of 0.65 and 0.76 while the Hybrid-2 model reaches a sensitivity and specificity of 0.66 and 0.62, respectively.
Huang et al. [
14] reported a CNN trained on a dataset of planar dose distributions extracted with the Pylinac library from 556 single fields of intensity-modulated radiation therapy (IMRT) treatments. In predicting γ
3%,2 mm they obtained a mean absolute error of 2% whereas, in the classification of plans with γ
PassingRate < 85%, they reported a 57% sensitivity and a 100% specificity. In their testing set they had less than 9% of plans below their defined action limits, which complicates the comparison and could reduce the robustness of the estimated specificity and sensitivity. It should also be noted that, according to [
15], plans with γ
PassingRate < 0.90 are unfit for clinical delivery, hence rarely seen in our routine measurements.
Li et al. [
16] processed the multileaf aperture images of 761 VMAT using 3D-MResNet to predict the γ
PassingRate at 3%/2 mm and 2%/2 mm. The mean absolute errors of the
model for the test set were 1.63% and 2.38% at 3%/2 mm and 2%/2 mm, respectively.
In a classification task with the benchmark chosen with the aid of a cost-sensitive error rate, Li et al. [
15] reported, for γ
3%,2 mm, a sensitivity of 13/78 ≃ 17%.
Yang et al. [
17] and Cui et al. [
18] commissioned and deployed a hybrid StarGAN architecture in a multicentric setting involving eight different institutions. Limited to the
γ3%,2 mm criteria, they report specificities of 70–75% and sensitivities of 90–92%, with a 2.45% RMSE in the regression task. However, their input was limited to plan complexity features, making direct comparison of performances rather difficult.
As noted in [
13], the results are inherently dependent on the specific linear accelerator and detector, which limits the possibility of deriving universally valid conclusions in a single-center setting. Although the model was trained on dose distributions generated by a single TPS, with a specific machine model and detector, its structure and learned representations provide a solid foundation. With appropriate adaptation or fine-tuning, the model could be extended to centers employing different machines, TPSs, or detectors. This aspect will be the focus of future studies aimed at assessing the model’s generalizability and clinical applicability in multicenter scenarios.
There are various possible pathways for future developments. One of the interesting prospects for this approach could be to add a third branch in the network to analyze the RTPlan files. Despite the complexities in properly unpacking the information in RTPlan files, and in designing a network able to analyze said information, it is reasonable to expect an improvement in performance. Given the differences in distributions of γ3%,2 mm due to the district, a further improvement could be obtained by adding clinical information on the plan in the input for the model. Another interesting direction would be to implement a network that produces multiple outputs for various DD and DTA criteria in the Γ-Index computation. This structure would have the advantage of providing a more comprehensive overview of the plan; however, the implementation of this kind of structure would require a large effort in data collection and storage as well as multiple routinary testing in each PSQA event.
A limitation of this study lies in the data partitioning strategy, which relied on per-epoch random 80/20 splits. This approach may introduce generalization bias if multiple plans from the same patient, or plans with nearly identical parameters, are distributed across training and test sets. Future implementations employing fixed patient- or course-level partitions, together with a temporal (year-wise) hold-out informed by PSQA device characteristics and linac performance data, are expected to better assess robustness against potential machine variations or workflow drift, thereby improving the model’s external validity and reproducibility.
Moreover, all the preprocessing steps were aimed at harmonizing the dose distributions and to make them comparable regardless of dose shape, prescription dose, and grid shape for the computation inside the TPS. However, these choices limit the possible sizes of absorbed dose distributions acceptable by the models and are strictly dependent on the maximum sensitive volume of the detector used which, in this work, was dictated by the OCTAVIUS 1500 matrix to be a cylinder of 27 cm in diameter and height. A first example of absorbed dose distributions unseen in training due to available devices, would be those in which a single target extends more than 27 cm, or when more lesions treated simultaneously are separated by more than 27 cm. Another possible limitation could be encountered in analyzing doses computed on a grid with pixel size different from 3 mm (e.g., 1 or 5 mm). Upon implementation each center should choose how to handle these limit cases according to the typical cohorts encountered in the clinical practice.
Finally, the analysis pipeline has not been yet optimized hence the speed and performance in training can still be improved, both by employing ad hoc computational resources and by optimizing the code. Reducing the training times would make re-training the network less expensive, therefore mitigating the risk of not being able to describe slow drifts in the machine performance due to aging or sharp changes after maintenance, which are otherwise guaranteed by the other facets of the QA program.