A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans

Buzzi, Simone; Mancosu, Pietro; Bresolin, Andrea; Gallo, Pasqualina; La Fauci, Francesco; Lobefalo, Francesca; Paganini, Lucia; Pelizzoli, Marco; Reggiori, Giacomo; Franzese, Ciro; Tomatis, Stefano; Scorsetti, Marta; Lenardi, Cristina; Lambri, Nicola

doi:10.3390/bioengineering12080897

Open AccessArticle

A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans

by

Simone Buzzi

^1,2,

Pietro Mancosu

^1,3,*,

Andrea Bresolin

¹,

Pasqualina Gallo

¹

,

Francesco La Fauci

¹

,

Francesca Lobefalo

¹,

Lucia Paganini

¹

,

Marco Pelizzoli

¹,

Giacomo Reggiori

¹,

Ciro Franzese

^1,3

,

Stefano Tomatis

¹,

Marta Scorsetti

^1,3

,

Cristina Lenardi

^2,4

and

Nicola Lambri

^1,5

¹

Radiotherapy and Radiosurgery Department, IRCCS Humanitas Research Hospital, Via Manzoni 56, Rozzano, 20089 Milan, Italy

²

Dipartimento di Fisica “Aldo Pontremoli”, Università degli Studi di Milano, 20133 Milan, Italy

³

Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, 20072 Milan, Italy

⁴

Milano Division, National Institute for Nuclear Physics, 20133 Milan, Italy

⁵

Scuola di Specializzazione di Fisica Medica, Università degli Studi di Milano, 20133 Milan, Italy

^*

Author to whom correspondence should be addressed.

Bioengineering 2025, 12(8), 897; https://doi.org/10.3390/bioengineering12080897

Submission received: 29 July 2025 / Revised: 15 August 2025 / Accepted: 18 August 2025 / Published: 21 August 2025

(This article belongs to the Special Issue Radiation Imaging and Therapy for Biomedical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Stereotactic radiosurgery (SRS) for multiple brain metastases can be delivered with a single isocenter and non-coplanar arcs, achieving highly conformal dose distributions at the cost of extreme modulation of treatment machine parameters. As a result, SRS plans are at a higher risk of patient-specific quality assurance (PSQA) failure compared to standard treatments. This study aimed to develop a machine-learning (ML) model to predict the PSQA outcome (gamma passing rate, GPR) of SRS plans. Five hundred and ninety-two consecutive patients treated between 2020 and 2024 were selected. GPR analyses were performed using a 3%/1 mm criterion and a 95% action limit for each arc. Fifteen plan complexity metrics were used as input features to predict the GPR of an arc. A stratified and a time-series approach were employed to split the data into training (1555 arcs), validation (389 arcs), and test (486 arcs) sets. The ML model achieved a mean absolute error of 2.6% on the test set, with a 0.83% median residual value (measured/predicted). Lower values of the measured GPR tended to be overestimated. Sensitivity and specificity were 93% and 56%, respectively. ML models for virtual QA of SRS can be integrated into clinical practice, facilitating more efficient PSQA approaches.

Keywords:

machine learning; patient-specific QA; stereotactic radiosurgery; HyperArc; radiotherapy

1. Introduction

Intensity-modulated radiotherapy (IMRT) and volumetric modulated arc therapy (VMAT) are inverse treatment planning techniques widely used for their ability to deliver highly conformal dose distributions, capable of maximizing tumor targeting while sparing healthy tissue [1]. This is especially true for stereotactic radiosurgery (SRS), which involves the delivery of highly conformal dose distributions with steep dose gradients to brain lesions, administered in a single or a few high-dose fractions. For this reason, measurement-based patient-specific quality assurance (PSQA) is mandatory to verify that SRS treatments can be delivered as intended [2,3].

Typically, the measured and calculated dose distributions are compared both for dose difference and physical distance using the gamma agreement index (γ) [4]. The PSQA result is summarized by the gamma passing rate (GPR), i.e., the percentage of points where γ ≤ 1. The AAPM TG-218 recommends, for general IMRT and VMAT treatments, a 3%/2 mm gamma criterion, using global normalization, with a 10% dose threshold and a 90% action limit [5]. The same report recommends stricter, albeit unspecified, criteria for SRS.

Measurement-based PSQA is inherently a time-consuming and resource-intensive process. In addition, PSQA failures can lead to increased workload and delays in patient treatment due to the need for replanning. In the search for alternatives, several studies have explored the use of complexity metrics to establish a relationship with PSQA outcomes [6,7,8,9,10]. Plan complexity attempts to quantify the uncertainties related to approximations in the dose calculation algorithms of a treatment planning system (TPS) and the mechanical limitations of a linear accelerator (linac), which could affect the results of PSQA. However, no final consensus has been reached, as results show strong correlations only in some cases, while also being dependent on the QA device, treatment technique, and linac used [10,11]. More recently, machine-learning (ML) [12,13,14,15,16,17,18] and deep-learning (DL) models [19,20,21] have been investigated to predict PSQA outcomes from complexity metrics. Overall, these studies varied greatly in terms of treatment sites, analysis methods, input features, and algorithms used. Moreover, specific applications to SRS treatments remain unexplored.

In this study, we considered a dataset of multiple target single-isocenter SRS plans, which are more subject to PSQA failures than standard treatments due to their inherent higher complexity. A virtual QA method specific to these cases could potentially reduce a department’s QA workload by identifying plans at risk of failure before measurement. To this aim, a tree-based ML model was trained and tested employing a time-series split to assess the model’s applicability under similar conditions to clinical practice.

2. Materials and Methods

2.1. Dataset

Five hundred ninety-two (592) consecutive SRS patients treated between December 2020 and December 2024 were selected from the internal database of our Institute, for a total of 2430 arcs. The RT plans were optimized using HyperArc (v15; Varian Medical System, Palo Alto, CA, USA), a VMAT optimization algorithm specifically developed for treating multiple brain lesions simultaneously, using a single isocenter for radiation delivery, achieving dose delivery accuracy and conformity through the use of non-coplanar arcs and extreme modulation of MLC leaf positions [22]. A Varian Edge linac was employed, equipped with a Varian High Definition 120 MLC (Varian Medical System, Palo Alto, CA, USA). The MLC features, for each bank, 32 × 2.5 mm-wide inner leaves and 28 × 5 mm-wide outer leaves.

PSQA measurements were performed using an Electronic Portal Imaging Device (EPID) mounted on the linac. The EPID was a Varian PortalVision aS1200 Imager (Varian Medical System, Palo Alto, CA, USA), which features an amorphous silicon (a-Si) detector panel with a size of 43 × 43 cm², corresponding to a pixel matrix of 1280 × 1280 with a resolution of 0.336 mm. The gamma analyses were performed with Portal Dosimetry (v15; Varian Medical System, Palo Alto, CA, USA). This software computes the GPR by comparing the dose distribution calculated by the TPS to the integral delivered dose measured with the EPID. The conversion between dose distribution and portal dose image was performed by the Portal Dosimetry Image Prediction (v15; Varian Medical System, Palo Alto, CA, USA) algorithm. The parameters used to compute the GPR were 3%/1 mm, with a 10% dose threshold and global normalization in absolute dose. These criteria were based on the department’s QA program for SRS treatments [23]. The monthly quality assurance procedure applied to the portal imager consisted of dark field and flood field checks, followed by normalization using a 10 × 10 cm² calibration field.

As input features, originally two static parameters, six dynamic parameters, and ten complexity metrics were employed, as reported in Table 1. These had been extracted from DICOM RT plan files by means of a software program written in MATLAB (R2021b) [24]. To mitigate the impact of multicollinearity on feature importance algorithms, Spearman’s correlation coefficients (r_s) between features were calculated. Hierarchical clustering with Ward’s linkage was then applied to a distance matrix derived as 1 − |r_s|. A stringent distance threshold of 0.1 was applied to the resulting dendrogram. This threshold, interpreted on the scale of Ward’s linkage criterion, was selected with the objective of grouping features with an absolute Spearman’s correlation (|r_s|) of 0.9 or higher. From the clusters formed, one representative feature was retained per cluster, leading to a selection of fifteen features. The selected features are reported in italics in Table 1.

2.2. Model Selection

A random forest model was trained to predict the GPR of an arc based on its complexity. During early experiments, the model exhibited difficulty in predicting low GPR values, which were underrepresented in the training set. To address this, sample weights were assigned to each training instance, prioritizing these low GPR values according to the following relation:

w_{i} = (100.1 - {G P R}_{i}), w h e r e {G P R}_{i} \in t r a i n i n g s e t .

(1)

The ML pipeline implemented in this study aimed to mimic real-world scenarios where models are evaluated on future data. The approach is outlined in Figure 1 and consists of the following steps:

The original dataset was first divided using a time-series approach, with the most recent 20% of data assigned to the test set. This proportion was selected to represent approximately 6 months of clinical PSQA workload, enabling a realistic assessment of the model’s prospective performance. The remaining 80%—consisting of the oldest data—was further split into training and validation sets using a stratified approach on GPR with an 80-20 ratio. As a result, the training, validation, and test sets contained 1555, 389, and 486 arcs, respectively.
A randomized search with 5-fold cross-validation was performed to select the optimal combination of hyperparameters using 1000 iterations. The mean absolute error (MAE) was used as the evaluation metric, averaged across the cross-validation folds. Features were rescaled using percentile statistics as follows, which are robust to outliers:

$f_{s c a l e d} = \frac{f - f_{50 %}}{f_{75 %} - f_{25 %}}$

(2)

To prevent data leakage, this scaling operation was performed independently within each training fold.
Using the optimal combination of hyperparameters (reported in the Supplementary Material, Table S1), the model was retrained on the whole training set.
The validation set was used for determining an optimal threshold limit (TL) to assess the classification performance of the model on the test set, which contained the most recent data.

2.3. Model Assessment

The performance of the regression model on the test set was evaluated using the R², MAE, and absolute error statistics. R² measures the proportion of variance in the GPR explained by the model, where the best possible score is 1 and a value of 0 denotes a model that always predicts the average value of GPR.

Arcs were divided into a positive class, with measured GPR < 95%, and a negative class, with measured GPR ≥ 95%. The decision to use a 95% action limit, recommended as the tolerance limit in AAPM guidelines, was based on the fact that AAPM’s recommendations are for RT in general, while stricter, albeit unspecified, criteria are recommended for SRS [5].

The classification performance of the model was assessed based on receiver-operating-characteristic (ROC) and precision–recall curves. These curves were obtained by computing the sensitivity/specificity and precision/recall for all possible action limits on the GPR predicted by the model. In this study, sensitivity (recall) and specificity measured the ability of the model to correctly identify “fail” (positive class) and “pass” (negative class) arcs, while precision measured the fraction of true “fails” identified among the predicted failures. These metrics were computed as follows:

s e n s i t i v i t y = \frac{# t r u e f a i l}{# t r u e f a i l + # f a l s e p a s s},

(3)

s p e c i f i c i t y = \frac{# t r u e p a s s}{# t r u e p a s s + # f a l s e f a i l},

(4)

p r e c i s i o n = \frac{# t r u e f a i l}{# t r u e f a i l + # f a l s e f a i l}

(5)

An optimal TL was determined from the validation set as a threshold ensuring at least 90% sensitivity and 50% specificity. Achieving 100% sensitivity would require a threshold that produces an unacceptably high false-positive rate, making the model impractical for routine use. This threshold was then applied to the test set to compute sensitivity, specificity, and precision.

2.4. Interpretability

Partial dependence plots (PDPs) were used as a post hoc method to explain the model output and interpret feature importance. A PDP shows the global relation between a feature and the target variable, marginalizing over the values of the other variables.

3. Results

The Median GPR of the full dataset was 98%. The GPR statistics for different ranges are reported in Table 2. Approximately 75% of the arcs (1828) had a GPR ≥ 95%.

The correlation heatmap and dendrogram between feature pairs are reported in Figures S1 and S2 of the Supplementary Material. A strong correlation (r_s > 0.9) was found between the Q1 MLCGap and Median MLCGap, which describe similar quantities. Strong anti-correlation (r_s < −0.9) was observed between three pairs: BM and MCS, Median MLCGap and SAS10, and Q1 MLCGap and SAS10.

The model assessment is summarized in Table 3. We found similar MAEs obtained with cross-validation (2.5% ± 0.1%) and on the test set (2.6%). While 95% of data points fall within a 7.5% absolute error, the 98th percentile corresponds to a 10% absolute error. The R² value was found to be 0.27, while the ROC and precision–recall curve analyses resulted in an AUC of 0.85 and AP of 0.67. In particular, both were above the baseline values of a dummy classifier. Using the 97% TL obtained from the validation set, the sensitivity, specificity, and precision on the test set were 93%, 56%, and 50%, respectively.

Figure 2 shows the scatter plot and residual distribution of the model on the test set. The model overestimated the measured GPR below 95%, yielding mean and median errors of −4.2% and −3.8%, respectively, for GPR values below the 1st quartile (<95.1%). The median residual was 0.85% with the 1st and 3rd quartiles of [−1.4, 2.3]%. The maximum residual was near 15%.

Figure 3 shows the ROC and precision–recall curves obtained on the test set. Considering the clinical action limit, 95%, for the predicted GPR as TL, the model sensitivity and specificity were 86% and 53%, respectively.

The PDPs for the Area, Median MLCGap, MeanRR, and BM, based on the test set, are shown in Figure 4. These are the main features showing a non-trivial dependence, while the remaining features are reported in Figure S3 of the Supplementary Material. The Area and Median MLCGap show similar patterns; smaller values are associated with lower predictions, while at increasing values, they do not show any influence on the outcome. The MeanRR exhibits a decrease in predicted values at low ranges, followed by two sharp increases as its values rise. BM shows a different trend, as it only begins to affect predictions significantly at higher values.

4. Discussion

In this study, an ML model was developed to investigate the potential of reducing the PSQA workload of SRS plans treating multiple brain metastases with single-isocenter non-coplanar fields. These cases are more subject to PSQA failures than standard treatments due to their inherent higher complexity. The dataset comprised 2430 SRS arcs from 592 consecutive patients. Data split utilized a hybrid approach combining time-series and stratified sampling, reflecting clinical scenarios where models are deployed on future data, as described by Chan et al. [29]. Using a 3%/1 mm criterion and a 95% action limit, the model achieved a 2.6% test set MAE, with a 0.27 R² and 0.85 ROC-AUC, respectively.

The Area, Median MLCGap, MeanRR, and BM proved the most impactful features according to PDPs. The first three in particular displayed a threshold-like behavior. The lower predictions observed for smaller values of the Area and Median MLCGap are likely associated with increased uncertainty in small-field dose calculations, arising from limitations in modeling radiation transmitted at the tip of the MLC leaves. A similar explanation may account for the behavior observed for BM > 0.9, which indicates beams composed of many small apertures that are spatially separated from each other. In contrast, MeanRR exhibited an opposite relationship to what might be expected, possibly suggesting the presence of interaction effects that are not captured by the PDP.

As alternatives to measurement-based PSQA, a few ML and DL methods have been explored in recent years [29,30,31]. Table S2 in the Supplementary Material summarizes some of the relevant studies and compares their main findings with the present work. Lam et al. [12] and Zhu et al. [18] presented the application of complexity metrics-based models for PSQA across various treatment sites, both employing the 2%/2 mm criterion. Although they reported a lower MAE, approximately 1%, the former’s dataset consisted exclusively of 182 IMRT plans with a measured GPR > 90%, and the latter only included 2% plans with GPR < 90%, out of 213. Kusunoki et al. [16] used a 2%/2 mm criterion and developed a series of models utilizing 15 features to analyze a dataset of 356 VMAT plans for the head and neck. With a 99% lower control limit for prediction, they obtained 100% sensitivity and 75% specificity, but the measured GPR range was above 95.2%. Li et al. [13] analyzed 303 VMAT plans from gynecological and head and neck tumor locations and achieved a test MAE of 2.39%, with 100% sensitivity and 71% specificity, although only 2.1% of the plans did not meet the 3%/2 mm criterion with a 90% action limit. Hirashima et al. used the same criterion and action limit to obtain an MAE of 3.1%, sensitivity of 64%, and specificity of 82%, on a dataset of 1255 VMAT plans from various treatment locations [14].

Among those including a higher number of outliers, Wall and Fontenot [15] analyzed 500 VMAT plans from various tumor locations to compare various ML models, with the best model achieving an MAE < 4% at 3%/3 mm. Han et al. [32] developed DL models using 13 plan metrics and a dataset of 201 VMAT plans from pelvis and head and neck tumor locations, with 10% of the plans failing to meet the 3%/2 mm criterion with a 90% action limit. They achieved 87.5% sensitivity and 71.7% specificity. Finally, in our previous study, we analyzed the largest single-institute dataset of VMAT plans to date [17], consisting of 5522 VMAT plans from various treatment sites. The GPR was computed with a 3%/1 mm criterion, and a 95% action limit with a 10% threshold was used. Employing the same set of features as this study, a test MAE of 2.3% was achieved. However, the data split did not consider the temporal dependence, and less than 17% of the arcs had a measured GPR < 95%.

Only Noblet et al. [33], to the best of our knowledge, have explicitly reported a data-splitting approach similar to ours. Their method involved an initial time-series split to separate the most recent data, followed by a random split of the remaining dataset. Their nine classifiers on a dataset of 1767 VMAT arcs from various treatment sites achieved an ROC-AUC of 0.88, with 52% sensitivity and 92% specificity.

By adopting a 97% TL on the test set, our model achieved a sensitivity of 93% and a specificity of 56%. Sensitivity was prioritized over specificity, as missing a true failure carries a greater potential risk than investigating a false alarm. Such a configuration supports the model’s potential role as a complementary tool to our department’s PSQA program, in which SRS plans are always verified. The model could serve as a pre-screening method to anticipate failing cases and reduce the workload burden of repetitive replanning and measurement. In particular, the flagged “fail” cases could be reoptimized with automatable procedures, which reduce plan complexity and improve the GPR [34]. Assuming 25% of failing cases, as observed in this study’s dataset, and that one reoptimization is performed in 10 min while replanning and measurement take 60 min, the maximum expected workload reduction is 56%. The detailed calculation is reported in Section S1 of the Supplementary Material.

According to our department’s PSQA program, SRS plans are verified on an arc basis. This ensures that potential delivery errors in single arcs are not obscured by the superposition of the other beam doses. Due to the extreme modulation of SRS arcs, if an arc fails the PSQA, then the plan is reoptimized. The model performance was derived from the GPR of each arc to reflect this practice, and the metrics reported in this work were used to obtain a global assessment. The achieved level of sensitivity implies that this method is still immature to substitute the conventional PSQA process. Nonetheless, such a model could serve as an additional verification step without changing PSQA procedures.

Although the model was evaluated using a time-series split to assess prospective performance, changes over time in radiotherapy technology, patient characteristics, and clinical practices may still degrade its accuracy. Therefore, the model should be continuously monitored—e.g., by periodically evaluating its performance on a representative set of test plans and checking for prediction drift—and updated as needed to reflect alignment with current clinical practice [35].

It should be noted that EPIDs are perpendicular composite methods, in which the integrated image can mask certain delivery errors. Moreover, EPIDs are not ideal absolute dosimeters and require careful calibration due to their reduced thickness compared with the photon build-up region. A further potential limitation in non-coplanar arcs is that EPID-based PSQA does not fully represent the actual treatment delivery, as it does not account for couch rotation, unlike true composite methods.

Notably, studies including outliers have shown a tendency to overestimate low GPR values. This limitation was also observed in the current work, leading to a low R² score. Several strategies could be explored to improve model performance: increasing the number of samples with low GPR (e.g., by creating fictitious plans with extremely high complexity—although this may reduce the representativeness of the clinical dataset) and expanding the feature set by incorporating dosiomics features [14,32] or metrics derived from linac and EPID QA results. Nonetheless, ML models could provide indicators to anticipate potential failures and be integrated into clinical practice as support tools for an optimized PSQA program.

5. Conclusions

A hybrid stratified and time-series approach was applied to train an ML model for predicting PSQA outcomes in highly complex multi-target SRS treatments. The results suggest that plan parameters and complexity metrics provide valuable insights, offering promising potential for training models as supplementary tools for virtual QA, particularly in facilitating more efficient PSQA approaches.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bioengineering12080897/s1: Table S1: Parameters searched in the randomized cross-validation and the resulting best combination; Table S2: Summary of recent studies which investigated the use of ML and DL models for PSQA; Figure S1: Correlation matrix of plan parameters and complexity metrics; Figure S2: Dendrogram of the hierarchical clustering on the Spearman rank–order correlations; Figure S3: PDP plots on the test set for the features showing a trivial dependence.

Author Contributions

Conceptualization, S.B., P.M. and N.L.; methodology, S.B. and N.L.; software, S.B. and N.L.; validation, P.M., A.B., P.G., F.L.F., F.L., L.P., M.P., G.R. and S.T.; formal analysis, S.B., P.M. and N.L.; resources, P.M., C.F., M.S. and C.L.; data curation, S.B., P.M. and N.L.; writing—original draft preparation, S.B.; writing—review and editing, P.M., A.B., P.G., F.L.F., F.L., L.P., M.P., G.R., C.F., S.T., M.S., C.L. and N.L.; visualization, S.B. and N.L.; supervision, P.M., C.F., M.S. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank Victor Hernandez from Sant Joan de Reus University Hospital (Reus, Spain) and Jordi Saez from Hospital Clínic de Barcelona (Barcelona, Spain) for providing the MATLAB software.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AAPM	American Association of Physicists in Medicine
AUC	Area under the curve
DL	Deep learning
EPID	Electronic portal imaging device
GPR	Gamma passing rate
IMRT	Intensity-modulated radiotherapy
Linac	Linear accelerator
MAE	Mean absolute error
ML	Machine learning
PDP	Partial dependence plot
PSQA	Patient-specific quality assurance
ROC	Receiver operating characteristic
SRS	Stereotactic radiosurgery
TL	Threshold limit
TPS	Treatment planning system
VMAT	Volumetric modulated arc therapy

References

Teoh, M.; Clark, C.H.; Wood, K.; Whitaker, S.; Nisbet, A. Volumetric Modulated Arc Therapy: A Review of Current Literature and Clinical Use in Practice. Br. J. Radiol. 2011, 84, 967–996. [Google Scholar] [CrossRef]
Dunn, L.; Tamborriello, A.; Subramanian, B.; Xu, X.; Ruruku, T.T. Assessing the Sensitivity and Suitability of a Range of Detectors for SIMT PSQA. J. Appl. Clin. Med. Phys. 2024, 25, e14343. [Google Scholar] [CrossRef]
Xia, Y.; Adamson, J.; Zlateva, Y.; Giles, W. Application of TG-218 Action Limits to SRS and SBRT Pre-Treatment Patient Specific QA. J. Radiosurg SBRT 2020, 7, 135–147. [Google Scholar]
Low, D.A. Gamma Dose Distribution Evaluation Tool. J. Phys. Conf. Ser. 2010, 250, 012071. [Google Scholar] [CrossRef]
Miften, M.; Olch, A.; Mihailidis, D.; Moran, J.; Pawlicki, T.; Molineu, A.; Li, H.; Wijesooriya, K.; Shi, J.; Xia, P.; et al. Tolerance Limits and Methodologies for IMRT Measurement-Based Verification QA: Recommendations of AAPM Task Group. No. 218. Med. Phys. 2018, 45, e53–e83. [Google Scholar] [CrossRef]
McNiven, A.L.; Sharpe, M.B.; Purdie, T.G. A New Metric for Assessing IMRT Modulation Complexity and Plan Deliverability. Med. Phys. 2010, 37, 505–515. [Google Scholar] [CrossRef]
Masi, L.; Doro, R.; Favuzza, V.; Cipressi, S.; Livi, L. Impact of Plan Parameters on the Dosimetric Accuracy of Volumetric Modulated Arc Therapy. Med. Phys. 2013, 40, 071718. [Google Scholar] [CrossRef] [PubMed]
Park, J.M.; Wu, H.-G.; Kim, J.H.; Carlson, J.N.K.; Kim, K. The Effect of MLC Speed and Acceleration on the Plan Delivery Accuracy of VMAT. Br. J. Radiol. 2015, 88, 20140698. [Google Scholar] [CrossRef] [PubMed]
Vieillevigne, L.; Khamphan, C.; Saez, J.; Hernandez, V. On the Need for Tuning the Dosimetric Leaf Gap for Stereotactic Treatment Plans in the Eclipse Treatment Planning System. J. Appl. Clin. Med. Phys. 2019, 20, 68–77. [Google Scholar] [CrossRef] [PubMed]
Antoine, M.; Ralite, F.; Soustiel, C.; Marsac, T.; Sargos, P.; Cugny, A.; Caron, J. Use of Metrics to Quantify IMRT and VMAT Treatment Plan Complexity: A Systematic Review and Perspectives. Phys. Medica 2019, 64, 98–108. [Google Scholar] [CrossRef]
Chiavassa, S.; Bessieres, I.; Edouard, M.; Mathot, M.; Moignier, A. Complexity Metrics for IMRT and VMAT Plans: A Review of Current Literature and Applications. Br. J. Radiol. 2019, 92, 20190270. [Google Scholar] [CrossRef]
Lam, D.; Zhang, X.; Li, H.; Deshan, Y.; Schott, B.; Zhao, T.; Zhang, W.; Mutic, S.; Sun, B. Predicting Gamma Passing Rates for Portal Dosimetry-Based IMRT QA Using Machine Learning. Med. Phys. 2019, 46, 4666–4675. [Google Scholar] [CrossRef]
Li, J.; Wang, L.; Zhang, X.; Liu, L.; Li, J.; Chan, M.F.; Sui, J.; Yang, R. Machine Learning for Patient-Specific Quality Assurance of VMAT: Prediction and Classification Accuracy. Int. J. Radiat. Oncol. Biol. Phys. 2019, 105, 893–902. [Google Scholar] [CrossRef] [PubMed]
Hirashima, H.; Ono, T.; Nakamura, M.; Miyabe, Y.; Mukumoto, N.; Iramina, H.; Mizowaki, T. Improvement of Prediction and Classification Performance for Gamma Passing Rate by Using Plan Complexity and Dosiomics Features. Radiother. Oncol. 2020, 153, 250–257. [Google Scholar] [CrossRef]
Wall, P.D.H.; Fontenot, J.D. Application and Comparison of Machine Learning Models for Predicting Quality Assurance Outcomes in Radiation Therapy Treatment Planning. Inform. Med. Unlocked 2020, 18, 100292. [Google Scholar] [CrossRef]
Kusunoki, T.; Hatanaka, S.; Hariu, M.; Kusano, Y.; Yoshida, D.; Katoh, H.; Shimbo, M.; Takahashi, T. Evaluation of Prediction and Classification Performances in Different Machine Learning Models for Patient-Specific Quality Assurance of Head-And-Neck VMAT Plans. Med. Phys. 2022, 49, 727–741. [Google Scholar] [CrossRef] [PubMed]
Lambri, N.; Hernandez, V.; Sáez, J.; Pelizzoli, M.; Parabicoli, S.; Tomatis, S.; Loiacono, D.; Scorsetti, M.; Mancosu, P. Multicentric Evaluation of a Machine Learning Model to Streamline the Radiotherapy Patient Specific Quality Assurance Process. Phys. Medica 2023, 110, 102593. [Google Scholar] [CrossRef]
Zhu, H.; Zhu, Q.; Wang, Z.; Yang, B.; Zhang, W.; Qiu, J. Patient-Specific Quality Assurance Prediction Models Based on Machine Learning for Novel Dual-Layered MLC Linac. Med. Phys. 2023, 50, 1205–1214. [Google Scholar] [CrossRef]
Tomori, S.; Kadoya, N.; Kajikawa, T.; Kimura, Y.; Narazaki, K.; Ochi, T.; Jingu, K. Systematic Method for a Deep Learning-Based Prediction Model for Gamma Evaluation in Patient-Specific Quality Assurance of Volumetric Modulated Arc Therapy. Med. Phys. 2021, 48, 1003–1018. [Google Scholar] [CrossRef]
Wang, L.; Li, J.; Zhang, S.; Zhang, X.; Zhang, Q.; Chan, M.F.; Yang, R.; Sui, J. Multi-Task Autoencoder Based Classification-Regression Model for Patient-Specific VMAT QA. Phys. Med. Biol. 2020, 65, 235023. [Google Scholar] [CrossRef]
Kimura, Y.; Kadoya, N.; Tomori, S.; Oku, Y.; Jingu, K. Error Detection Using a Convolutional Neural Network with Dose Difference Maps in Patient-Specific Quality Assurance for Volumetric Modulated Arc Therapy. Phys. Medica 2020, 73, 57–64. [Google Scholar] [CrossRef]
Nicosia, L.; Figlia, V.; Mazzola, R.; Napoli, G.; Giaj-Levra, N.; Ricchetti, F.; Rigo, M.; Lunardi, G.; Tomasini, D.; Bonù, M.L.; et al. Repeated Stereotactic Radiosurgery (SRS) Using a Non-Coplanar Mono-Isocenter (HyperArc^TM) Technique versus Upfront Whole-Brain Radiotherapy (WBRT): A Matched-Pair Analysis. Clin. Exp. Metastasis 2020, 37, 77–83. [Google Scholar] [CrossRef]
Lambri, N.; Dei, D.; Goretti, G.; Crespi, L.; Brioso, R.C.; Pelizzoli, M.; Parabicoli, S.; Bresolin, A.; Gallo, P.; La Fauci, F.; et al. Machine Learning and Lean Six Sigma for Targeted Patient-Specific Quality Assurance of Volumetric Modulated Arc Therapy Plans. Phys. Imaging Radiat. Oncol. 2024, 31, 100617. [Google Scholar] [CrossRef] [PubMed]
Hernandez, V.; Saez, J.; Pasler, M.; Jurado-Bruggeman, D.; Jornet, N. Comparison of Complexity Metrics for Multi-Institutional Evaluations of Treatment Plans in Radiotherapy. Phys. Imaging Radiat. Oncol. 2018, 5, 37–43. [Google Scholar] [CrossRef]
Crowe, S.B.; Kairn, T.; Kenny, J.; Knight, R.T.; Hill, B.; Langton, C.M.; Trapp, J.V. Treatment Plan Complexity Metrics for Predicting IMRT Pre-Treatment Quality Assurance Results. Australas. Phys. Eng. Sci. Med. 2014, 37, 475–482. [Google Scholar] [CrossRef]
Park, J.M.; Park, S.-Y.; Kim, H.; Kim, J.H.; Carlson, J.; Ye, S.-J. Modulation Indices for Volumetric Modulated Arc Therapy. Phys. Med. Biol. 2014, 59, 7315–7340. [Google Scholar] [CrossRef] [PubMed]
Du, W.; Cho, S.H.; Zhang, X.; Hoffman, K.E.; Kudchadker, R.J. Quantification of Beam Complexity in Intensity-Modulated Radiation Therapy Treatment Plans. Med. Phys. 2014, 41, 021716. [Google Scholar] [CrossRef] [PubMed]
Younge, K.C.; Matuszak, M.M.; Moran, J.M.; McShan, D.L.; Fraass, B.A.; Roberts, D.A. Penalization of Aperture Complexity in Inversely Planned Volumetric Modulated Arc Therapy. Med. Phys. 2012, 39, 7160–7170. [Google Scholar] [CrossRef]
Chan, M.F.; Witztum, A.; Valdes, G. Integration of AI and Machine Learning in Radiotherapy QA. Front. Artif. Intell. 2020, 3, 577620. [Google Scholar] [CrossRef]
Osman, A.F.I.; Maalej, N.M. Applications of Machine and Deep Learning to Patient-Specific IMRT/VMAT Quality Assurance. J. Appl. Clin. Med. Phys. 2021, 22, 20–36. [Google Scholar] [CrossRef]
Ono, T.; Iramina, H.; Hirashima, H.; Adachi, T.; Nakamura, M.; Mizowaki, T. Applications of Artificial Intelligence for Machine- and Patient-Specific Quality Assurance in Radiation Therapy: Current Status and Future Directions. J. Radiat. Res. 2024, 65, 421–432. [Google Scholar] [CrossRef] [PubMed]
Han, C.; Zhang, J.; Yu, B.; Zheng, H.; Wu, Y.; Lin, Z.; Ning, B.; Yi, J.; Xie, C.; Jin, X. Integrating Plan Complexity and Dosiomics Features with Deep Learning in Patient-Specific Quality Assurance for Volumetric Modulated Arc Therapy. Radiat. Oncol. 2023, 18, 116. [Google Scholar] [CrossRef]
Noblet, C.; Maunet, M.; Duthy, M.; Coste, F.; Moreau, M. A TPS Integrated Machine Learning Tool for Predicting Patient-Specific Quality Assurance Outcomes in Volumetric-Modulated Arc Therapy. Phys. Medica 2024, 118, 103208. [Google Scholar] [CrossRef]
Lambri, N.; Zaccone, C.; Bianchi, M.; Bresolin, A.; Dei, D.; Gallo, P.; La Fauci, F.; Lobefalo, F.; Paganini, L.; Pelizzoli, M.; et al. Optimization of Replanning Processes for Volumetric Modulated Arc Therapy Plans at Risk of QA Failure Predicted by a Machine Learning Model. Appl. Sci. 2024, 14, 6103. [Google Scholar] [CrossRef]
Hurkmans, C.; Bibault, J.-E.; Brock, K.K.; Van Elmpt, W.; Feng, M.; David Fuller, C.; Jereczek-Fossa, B.A.; Korreman, S.; Landry, G.; Madesta, F.; et al. A Joint ESTRO and AAPM Guideline for Development, Clinical Validation and Reporting of Artificial Intelligence Models in Radiation Therapy. Radiother. Oncol. 2024, 197, 110345. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Pipeline for model selection and assessment. Abbreviations: CV = cross-validation; TL = threshold limit.

Figure 2. Scatter plot (left) and residual distribution (right) of the measured and predicted GPRs. The dashed lines represent a perfect regression model, while the solid lines denote a baseline model always predicting the average GPR.

Figure 3. ROC and precision–recall curves on the test set. The dashed lines denote a dummy classifier always predicting the most frequent class (i.e., “pass”). AP = average precision; ROC: receiver operating characteristic.

Figure 4. PDP plots on the test set for the Area, Median MLCGap, MeanRR, and BM.

Table 1. Input features used for model training. In italics are the fifteen features selected with hierarchical clustering.

#	Name	Description
1	Area	Field aperture area (mm²)
2	MUOverDosePerFraction	Monitor units normalized to the prescribed dose per fraction (MU/Gy)
3	MeanMLCSpeed	Mean speed of all in-field leaves (cm/s)
4	MLCSpeedModulation	Sum of MLC speed variations divided by total leaf travel (cm/s mm⁻¹)
5	MeanRR	Mean dose rate (MU/min)
6	RRModulation	Total dose rate variation divided by arc length (MU/min deg⁻¹)
7	MeanGS	Mean gantry speed (deg/s)
8	GSModulation	Total gantry speed variation divided by arc length (deg/s deg⁻¹)
9	Q1 MLCGap	First quartile of MLC gap size distribution (mm)
10	Median MLCGap	Median of MLC gap size distribution (mm)
11	SAS10 [25]	Small aperture score: the fraction of MLC gap sizes < 10 mm
12	MeanTGI [9]	Mean tongue and groove index: irregularity in beam aperture shapes
13	MCS [6]	Modulation complexity score: a combination of aperture area variability (AAV) and leaf sequence variability (LSV)
14	MITotal [26]	Modulation index total: combines MLC dynamics, gantry speed variability, and dose rate variability
15	BI [27]	Beam irregularity: measures the non-circularity of the MLC aperture
16	BM [27]	Beam modulation: indicates to what extent the beam is delivered through small apertures
17	EdgeMetric [28]	Ratio of MLC side length to aperture area (mm⁻¹)
18	LT/AL [7]	Average leaf travel distance divided by the arc length (mm/deg)

Table 2. Distribution of measured GPR for the full set. Abbreviations: GPR = gamma passing rate; Q1 = 1st quartile; and Q3 = 3rd quartile.

Interval (%)	Median GPR (%)	Q1–Q3 (%)	Number of Arcs
[95, 100]	99.1	97.7–99.8	1828
[90, 95)	93.1	91.8–94.1	400
[85, 90)	88.1	86.9–89.1	151
[80, 85)	83.1	82.1–84.4	51
Full set	98.3	95.1–99.6	2430

Table 3. Model assessment. Abbreviations: AbsErr = absolute error; MAE = mean absolute error; ROC-AUC = receiver operating characteristic area under curve; and AP = average precision.

Metric	Value
Cross-validation MAE	2.5% ± 0.1%
Test MAE	2.6%
% arcs with AbsErr ≤ 3%, 5%, 10%	70%, 88%, 98%
75th, 90th, 95th, and 98th percentile of AbsErr	3.3%, 5.5%, 7.5%, 10%
R²	0.27 (baseline 0) ¹
ROC-AUC	0.85 (baseline 0.50)
AP	0.67 (baseline 0.32) ²
Sensitivity (97%TL)	93%
Specificity (97% TL)	56%
Precision (97%TL)	50%

¹ The baseline R² = 0 denotes a model that always predicts the average GPR. ² The baseline AP = 0.32 indicates a model that always predicts the positive class (“fail”).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Buzzi, S.; Mancosu, P.; Bresolin, A.; Gallo, P.; La Fauci, F.; Lobefalo, F.; Paganini, L.; Pelizzoli, M.; Reggiori, G.; Franzese, C.; et al. A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans. Bioengineering 2025, 12, 897. https://doi.org/10.3390/bioengineering12080897

AMA Style

Buzzi S, Mancosu P, Bresolin A, Gallo P, La Fauci F, Lobefalo F, Paganini L, Pelizzoli M, Reggiori G, Franzese C, et al. A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans. Bioengineering. 2025; 12(8):897. https://doi.org/10.3390/bioengineering12080897

Chicago/Turabian Style

Buzzi, Simone, Pietro Mancosu, Andrea Bresolin, Pasqualina Gallo, Francesco La Fauci, Francesca Lobefalo, Lucia Paganini, Marco Pelizzoli, Giacomo Reggiori, Ciro Franzese, and et al. 2025. "A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans" Bioengineering 12, no. 8: 897. https://doi.org/10.3390/bioengineering12080897

APA Style

Buzzi, S., Mancosu, P., Bresolin, A., Gallo, P., La Fauci, F., Lobefalo, F., Paganini, L., Pelizzoli, M., Reggiori, G., Franzese, C., Tomatis, S., Scorsetti, M., Lenardi, C., & Lambri, N. (2025). A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans. Bioengineering, 12(8), 897. https://doi.org/10.3390/bioengineering12080897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Time-Series Approach for Machine Learning-Based Patient-Specific Quality Assurance of Radiosurgery Plans

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Model Selection

2.3. Model Assessment

2.4. Interpretability

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI