Next Article in Journal
Exploratory Analysis of Cerebrospinal Fluid IL-6 and IL-17A Levels in Subcortical Small-Vessel Disease Compared to Alzheimer’s Disease: A Pilot Study
Previous Article in Journal
Implementation of Multi-Criteria Decision-Making for Selecting Most Effective Genome Sequencing Technology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robustness of Machine Learning Predictions for Determining Whether Deep Inspiration Breath-Hold Is Required in Breast Cancer Radiation Therapy

1
Department of Oral and Maxillofacial Radiology, Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University, Okayama 700-8558, Japan
2
Department of Oral Medicine and Oral Surgery, Faculty of Dentistry, Jordan University of Science and Technology, Irbid 22110, Jordan
3
Radiological Technology, Graduate School of Health Sciences, Okayama University, Okayama 700-8558, Japan
4
Department of Radiology, Matsuyama Red Cross Hospital, Matsuyama 790-8524, Japan
5
Department of Health and Welfare Science, Graduate School of Health and Welfare Science, Okayama Prefectural University, Okayama 719-1197, Japan
6
Graduate School of Interdisciplinary Sciences and Engineering in Health Systems, Okayama University, Okayama 770-8558, Japan
7
Department of Oral Radiology, Faculty of Dentistry, Hasanuddin University, Sulawesi 90245, Indonesia
8
Department of Dentistry and Dental Surgery, College of Medicine and Health Sciences, An-Najah National University, Nablus 44839, Palestine
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Diagnostics 2025, 15(6), 668; https://doi.org/10.3390/diagnostics15060668
Submission received: 24 December 2024 / Revised: 31 January 2025 / Accepted: 6 March 2025 / Published: 10 March 2025
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

:
Background/Objectives: Deep inspiration breath-hold (DIBH) is a commonly used technique to reduce the mean heart dose (MHD), which is critical for minimizing late cardiac side effects in breast cancer patients undergoing radiation therapy (RT). Although previous studies have explored the potential of machine learning (ML) to predict which patients might benefit from DIBH, none have rigorously assessed ML model performance across various MHD thresholds and parameter settings. This study aims to evaluate the robustness of ML models in predicting the need for DIBH across different clinical scenarios. Methods: Using data from 207 breast cancer patients treated with RT, we developed and tested ML models at three MHD cut-off values (240, 270, and 300 cGy), considering variations in the number of independent variables (three vs. six) and folds in the cross-validation (three, four, and five). Robustness was defined as achieving high F2 scores and low instability in predictive performance. Results: Our findings indicate that the decision tree (DT) model demonstrated consistently high robustness at 240 and 270 cGy, while the random forest model performed optimally at 300 cGy. At 240 cGy, a threshold critical to minimize late cardiac risks, the DT model exhibited stable predictive power, reducing the risk of overestimating DIBH necessity. Conclusions: These results suggest that the DT model, particularly at lower MHD thresholds, may be the most reliable for clinical applications. By providing a tool for targeted DIBH implementation, this model has the potential to enhance patient-specific treatment planning and improve clinical outcomes in RT.

1. Introduction

Radiation therapy (RT) is a vital part of breast cancer treatment [1,2]. Traditional RT methods often expose the heart and lungs to high doses of radiation, which can lead to long-term side effects, especially for those with left-sided breast cancer. These complications can significantly affect patient survival and are often observed within 10 years following treatment [3,4]. Previous studies on patients with breast cancer treated with RT have used the mean heart dose (MHD) as a measure of the radiation exposed to the heart [4,5,6,7]. Therefore, researchers now aim to reduce the MHD to improve overall survival rates.
The MHD cut-off value is usually set based on clinical guidelines and research findings [8]. Radiation oncologists and medical physicists work together to determine safe dose limits for critical organs such as the heart [9]. This is especially crucial in left-sided breast cancer because the heart is so close to the treatment area, increasing the risk of cardiac complications. The deep inspiration breath-hold (DIBH) technique is commonly used to reduce the MHD in patients with left-sided breast cancer [10,11,12]. During DIBH, patients inhale a deep breath and hold it during the delivery of radiation. This inflates the lungs and displaces the heart from the treatment area, thereby lowering the heart’s radiation exposure.
To assess its benefits in RT for breast cancer, DIBH is often compared with free-breathing (FB) techniques without DIBH [13]. This comparison helps ensure that the treatment is as safe and effective as possible, but it can be costly and time-consuming for both patients and RT staff [14]. By using machine learning (ML) approaches, RT staff can analyze patient data to predict who will benefit most from DIBH [15,16]. This targeted approach means that only patients likely to see significant advantages will undergo the additional steps required for DIBH, leading to time and cost savings in both the short and long term.
In our previous publication, we showed that ML models could effectively predict the MHD using a specific cut-off value of 300 cGy [15]. However, in clinical practice, multiple cut-off values are often required to address different patients’ needs and different treatment protocols [8]. In this study, we aim to assess whether these ML models can consistently maintain a stable predictive performance [17] across various clinically relevant cut-off values. Additionally, we evaluate the robustness of the models in [18,19] in terms of their ability to adapt to changes in the modeling process, such as variations in the number of independent variables or adjustments to the number of folds in cross-validation (CV). This evaluation is essential to determine how resilient these models are when exposed to different clinical scenarios and how they may influence the management of radiation therapy patients. To the best of our knowledge, no previously published studies have evaluated the robustness of MHD predictions using ML at different cut-off values.

2. Materials and Methods

2.1. Study Population

Our study comprised 207 patients diagnosed with left-sided breast cancer who underwent field-in-field (FIF) RT with FB at Okayama University Hospital between 2009 and 2016. These patients were selected from consecutive females with left-sided early-stage breast cancer. Exclusion criteria were simultaneous bilateral breast cancer, treatment with regional nodal irradiation, and treatment using hypo-fractionated irradiation. The patients received treatment at our facility using either the conventional FIF with a one-reference-point technique or an innovative FIF approach employing two reference points (FIF-2RP) [20]. All patients were irradiated for the whole breast with 200 cGy per fraction, with 25 fractions for a total of 5000 cGy, after partial breast resection. Eighty-eight patients were irradiated with an additional 1000–1600 cGy boost on the tumor bed. The heart dose during the 5000 cGy irradiation was the subject of this study [15]. Prior to participation, patients provided written informed consent for RT and the use of their de-identified data for scientific analysis. This investigation adhered to the principles outlined in the Declaration of Helsinki, revised in 2013. Approval for utilizing de-identified post-radiation data was obtained from the Ethical Review Board of our institution (approval no. 2103-024).

2.2. Data Collection

In March 2021, we retrospectively collected patient data from the RT planning system following computed tomography (CT) simulations. Key parameters, including breast separation (SEP), chest wall thickness (CWT), and the MHD, were carefully documented. SEP and CWT were evaluated for each patient using single-slice CT images taken at the nipple level as shown in Figure S1. SEP was defined as the distance along the posterior edge of the tangent fields, while CWT represents the distance from the nipple surface to the lung, measured perpendicularly to SEP, as described in a previous study [15]. Additionally, we retrieved demographic and clinical information from each patient’s medical records, including their age and body mass index (BMI), the tumor location, and the specific RT technique employed. Table 1 summarizes the patient characteristics.

2.3. ML Models

In this study, we utilized Anaconda Python version 3.9, along with various Python libraries (Python Software Foundation, Wilmington, DE, USA), to develop and experiment with our ML models. A total of ten supervised ML models were employed to accurately classify patients into low- or high-MHD categories based on predefined cut-off values. The models included gradient boosting (GB), decision tree (DT), bagging, deep neural network (DNN), random forest (RF), K-nearest neighbor (KNN), support vector machine (SVM), naïve Bayes (NB), logistic regression (LR), and ridge classifier (RC) models. These models were used to identify relationships and dependencies between the dependent variable (MHD) and the independent variables (SEP, CWT, age, BMI, tumor location, and RT method), enabling the prediction of a high or low MHD based on patterns learned from the training dataset.
Additionally, to address the class imbalance in the training data, we applied the synthetic minority over-sampling technique (SMOTE) in conjunction with the “imblearn” pipeline to increase the representation of underrepresented high- or low-MHD patients [21].

2.4. Model-Building Process

The model-building process involved exploring various configurations encompassing changes in the number of independent variables, classification cut-off values, and the number of folds used in the grid-search CV (GridSearchCV) process. Two primary configurations were considered: one utilizing three independent variables (SEP, CWT, and BMI) and the other incorporating six independent variables (SEP, CWT, BMI, age, tumor location, and RT method). Furthermore, the classification cut-off values of 240, 270, and 300 cGy were evaluated, alongside the number of folds in GridSearchCV (three, four, and five). The dataset at 240 cGy is summarized in File S1.
This comprehensive approach resulted in the creation of eighteen distinct sub-models for each ML model, each tailored to a specific setting.
A general overview of the model building process is provided in Figure 1. The first step involved randomly splitting the dataset into training and test sets in an 80:20 ratio. Due to the imbalanced nature of the dataset, a stratified split was used to ensure that the proportion of patients in each class (low and high MHD) was consistent across both the original dataset and the partitions. This led to an 80% representation of each class in the training set and 20% in the test set [22]. This approach was selected for its ability to preserve data integrity while effectively managing class imbalances within the dataset.
The next step involved fine-tuning the parameters of each model using the training dataset through a hyperparameter tuning process. With the primary goal of accurately identifying patients who might not require DIBH, our focus was on minimizing false negatives (i.e., patients incorrectly classified as having a low MHD). To achieve this, the models were trained using the F2 score as the primary performance metric within a GridSearchCV framework, as the F2 score places a greater emphasis on minimizing false negatives. In the next step, the models were built using the optimal hyperparameters determined from this tuning process.
In our study, hyperparameter tuning was conducted using repeated stratified K-fold CV (RSKCV), a technique employed to enhance the reliability of model performance evaluation [23]. RSKCV involves systematically partitioning the dataset into K folds while maintaining a consistent distribution of classes in each fold. This process is repeated multiple times to mitigate variability in performance estimates. Specifically, we utilized RSKCV within the GridSearchCV framework to evaluate various hyperparameter configurations. By repeatedly sampling and stratifying the data, RSKCV ensured a robust assessment of model performance, aiding in the selection of hyperparameters that generalize effectively to unseen data. This approach was pivotal in optimizing our models’ performance while minimizing the risk of overfitting [24].
Notably, to address potential biases introduced by synthetic high- or low-MHD patients generated through SMOTE, these synthetic instances were exclusively added into the training folds—not into the validation folds—using an “imblearn” pipeline. This measure ensured that the validation of our models relied solely on real data.
The code for hyperparameter tuning is shown in File S2.

2.5. Model Evaluation

To rigorously assess the models’ performance, a comprehensive external evaluation was conducted using an independent test set comprising 42 patients who were entirely distinct from those involved in model training and construction. This external validation step ensured that the models’ effectiveness transcended the confines of the training data and accurately reflected their real-world utility.
During this evaluation, the classification cut-off value was systematically varied to encompass clinically relevant thresholds: MHD ≥ 240 cGy; MHD ≥ 270 cGy; and MHD ≥ 300 cGy. This approach allowed for a nuanced analysis of the models’ performance across different levels of sensitivity and specificity, catering to diverse clinical needs and scenarios.
The primary metric used for assessing each model’s performance was the F2 score [15], chosen for its ability to strike a balanced evaluation between precision and recall, with a specific focus on minimizing false negatives, a critical consideration in medical decision making. Significant differences in the F2 scores among the models were analyzed with the permutation test using R version 4.3.2 (R Core Team) and the “stats” package. Values of p < 0.05 were considered statistically significant. Model instability, defined as the difference between the minimum and maximum F2 scores, was assessed using the median instability value for each cut-off value. Models were categorized as having “high” or “low” instability if their instability exceeded or fell below the median value, respectively, for each cut-off value.
By leveraging this robust evaluation framework, we aimed to provide comprehensive insights into the models’ efficacy and generalizability, thereby bolstering confidence in their real-world deployment and clinical impact.
The code for the best performance results is shown in File S3.

2.6. Predicted DIBH

To accurately assess the differences between the predicted and real incidences of DIBH at different radiation doses, we conducted a comparative analysis. This involved creating a graph that showed both the predicted and actual percentages of patients needing DIBH in the test set.
For this analysis, we followed these steps:
We selected the best-performing model at each classification cut-off value.
Using Formula (1), we recorded the actual percentages of patients requiring DIBH (real DIBH) for the best-performing model:
Real DIBH = ((TP + FN)/(total patients in the test set)) × 100%
where TP represents true positives, and FN represents false negatives.
Using Formula (2), we calculated the predicted percentages of patients needing DIBH (predicted DIBH) for the best-performing model.
Predicted DIBH = ((TP + FP)/(total patients in the test set)) × 100%
where FP represents false positives.
Finally, we plotted the real and predicted percentages of patients needing DIBH for each classification cut-off value to visualize and compare the discrepancies.

3. Results

3.1. Patient Characteristics

The characteristics of patients who were involved in this study are shown in Table 1.

3.2. Model Performance and Robustness

In this study, we created models by adjusting different factors such as the classification cut-off value, the number of independent variables, and the folds in CV. Table 2 shows the F2 scores and the predictive performance of the models under these different conditions. Additionally, Table 3 presents the results of pairwise permutation tests between models using different cut-off values. The numbers in Table 3 indicate the p-values of pairwise permutation tests. The robustness of the ML models was evaluated across different cut-off values (240, 270, and 300 cGy) based on the models’ median F2 scores and instability metrics.
The median instability value was 0.100, 0.121, and 0.255 for cut-off values of 240, 270, and 300 cGy, respectively.
At a cut-off value of 240 cGy, GB demonstrated superior performance, with the highest median F2 score of 0.846, but also exhibited the highest model instability of 0.454. DT showed consistent performance, with the second-highest median F2 score of 0.701, and low instability (0.038). Bagging had the third-highest median F2 score (0.683) but with high instability (0.174). Based on Table 3, no significant difference in the median F2 score was observed among GB, DT, and bagging. As a result, DT was the most robust model at the cut-off value of 240 cGy.
At 270 cGy, DT achieved the highest median F2 score, 0.795, and showed notable robustness, achieving the lowest model instability (0.018). GB achieved the second-highest median F2 score, 0.735, but the highest instability (0.823). Therefore, DT was the most robust model at the cut-off value of 270 cGy.
For the cut-off value of 300 cGy, bagging and KNN exhibited the highest (0.789) and third-highest (0.750) median F2 scores but with high instability. In contrast, RF showed the second-highest median F2 score of 0.756, with low instability (0.089). No significant differences in the median F2 score were observed among bagging, KNN, and RF. As a result, RF was the most robust model at the cut-off value of 300 cGy.

3.3. Comparison Between Predicted DIBH and Real DIBH

Figure 2 presents a comparison between the predicted and real percentages of patients requiring DIBH using the best-performing model at each classification cut-off value: DT at 240 and 270 cGy and RF at 300 cGy. This analysis reveals the discrepancies between the predicted and actual incidences of DIBH across different radiation doses.
The graph indicates that the models tended to overestimate DIBH incidences compared to actual patient data across all cut-off values. However, at 240 cGy, the model showed only a subtle discrepancy of 9.5% between the predicted and actual DIBH incidences, while it demonstrated a 31.0% discrepancy at 270 cGy and 31.0% at 300 cGy.

4. Discussion

In this study, we evaluated the robustness and stability of ML models for identifying patients who may not require DIBH across various classification cut-off values. Additionally, we examined the effect of altering the number of independent variables and the number of CV folds on model performance. We identified the most robust ML models as DT (at cut-off values of 240 and 270 cGy) and RF (at 300 cGy) based on their high median F2 scores and low instability. In contrast, GB was not considered robust due to its high instability, despite achieving a high median F2 score at 240 and 270 cGy.
The choice of ML models for predicting the need for DIBH in breast cancer RT is based on their ability to handle complex and multidimensional data. KNN and the DNN are particularly effective for modeling nonlinear relationships, such as those between the BMI and SEP in influencing heart dose [15], while LR is strong for binary classification tasks [25]. RF enhances stability and accuracy by leveraging an ensemble approach, improving predictions in complex datasets [26]. NB is well-suited for small datasets, offering reliable performance despite limited data availability [27].
The reliability and robustness of ML models are critical considerations in clinical settings [28]. These models are typically trained and optimized under specific conditions, including fixed parameter settings and consistent data distributions. However, their performance may vary when applied to new conditions or when key parameters are adjusted [29]. Variations in patient demographics, data quality, or disease prevalence can all influence model accuracy and reliability [30]. Bouthillier et al. highlighted challenges related to reproducibility in ML research, emphasizing the importance of standardized evaluation protocols to ensure robustness across diverse conditions [31]. Additionally, Goodfellow et al. demonstrated how slight changes in input data can significantly affect model predictions, underscoring the need for robust training methods [32].
To comprehensively assess model robustness, we evaluated model instability by calculating the range of F2 scores across six sub-models constructed for each cut-off value [17]. This measure underscores the sensitivity of ML models to parameter changes and emphasizes the necessity of rigorous evaluation to ensure consistent clinical performance.
In clinical practice, having stable and reliable ML models across different cut-off values is paramount [33]. Cut-off values often determine critical decision thresholds such as treatment recommendations. Instability in model performance at different cut-off values can lead to inconsistent clinical decisions, potentially compromising patient safety and treatment outcomes. Kamizaki et al. identified the DNN as the optimal algorithm for DIBH prediction, achieving an F2 score of 0.80 [34], while KNN was the best-performing model in one of our studies, with an F2 score of 0.67 [15]. However, our study emphasizes that model robustness under varying constraints is more clinically relevant than merely achieving the highest performance metrics. While the DNN in Kamizaki’s study demonstrated superior performance, our findings highlight the importance of stability, ensuring that consistent and dependable decision making is possible even under different clinical conditions. Healthcare professionals can trust predictions and recommendations made by ML systems using reliable models, regardless of minor variations in input parameters. By developing and validating models that demonstrate robustness across a range of cut-off values, we can enhance the dependability of ML applications in RT, ultimately improving patient management and health outcomes.
In our study, DT at 240 and 270 cGy and RF at 300 cGy emerged as the most robust ML models, consistently achieving high median F2 scores and low instability. This aligns with the literature, which highlights the robustness and consistency of DT [35] and RF [26] models in predictive modeling applications. In contrast, at 240 and 270 cGy, the GB model exhibited the highest instability among all models. This instability can be attributed to several factors inherent to GB algorithms. GB might be prone to overfitting, particularly in small or noisy datasets, which can result in fluctuations in performance when subjected to various changes in data or parameter settings [36]. The sequential nature of GB, which builds an ensemble of weak learners to correct errors incrementally, further contributes to its sensitivity to data variations [37].
In this research, we selected three cut-off values of 240, 270, and 300 cGy. Following the International Quantitative Analysis of Normal Tissue Effects in the Clinic (QUANTEC) guidelines [7], using a high cut-off value such as 300 cGy, as previously reported [15], would result in fewer patients being treated with DIBH compared with using lower cut-off values such as 240 or 270 cGy. Consequently, the number of late cardiac side effects might increase for patients treated with the higher cut-off value, although these values should be selected based on several guidelines for breast cancer RT. Our evaluation of model performance revealed a discrepancy between real and predicted DIBH outcomes across three cut-off values. This sensitivity underscores the critical impact of threshold selection on predictive accuracy. Throughout our analysis, the model consistently tended to overestimate the necessity of DIBH, potentially resulting in misclassifications. However, we found the 240 cGy cut-off value to be particularly promising for DIBH predictions. DT, our best-performing model at this cut-off value, achieved a high median F2 score (0.701) with low model instability (0.038). In particular, DT exhibited a minimal discrepancy of only 9.5% between real and predicted DIBH incidences at this cut-off value. These results underscore the 240 cGy threshold as the most accurate and suitable for clinical application.
A major limitation of our study is its retrospective design and the specific group of patients that we included. Our dataset might not represent all breast cancer patients because it originates from a single hospital and may have selection biases. Additionally, we used specific techniques (FIF-2RP) that might not be used in other hospitals. Another limitation is the small, imbalanced dataset, which may constrain the robustness and generalizability of our models. To minimize biased performance estimates, we applied RSKCV exclusively to the training set and evaluated our models on a small, independent test set of unseen data. Nevertheless, the computational constraints of our study limited us from employing bootstrapping with optimism correction, which could have provided a more robust assessment of model performance. While, in our study, we considered the MHD as the primary factor influencing the decision to use DIBH, clinical practice also considers omics features like HER2 expression, which can influence decisions on the use of Herceptin, which has associated cardiac risks. A notable study, the CHECK HEART-BC study, found that 8.5% of breast cancer patients developed cardiomyopathy, with the concurrent use of trastuzumab and radiotherapy identified as significant risk factors contributing to this adverse outcome [38]. Therefore, other omics features beyond the MHD may also affect the decision to use DIBH in practice. Our study focused on variables available from single-slice CT scans for their convenience in daily clinical practice, including CWT and SEP. However, incorporating additional volumetric variables from multi-slice CT, such as heart volume in the field, lung volume changes, and maximum heart depth [39], might potentially enhance model accuracy. More studies with larger, multi-institutional groups and a prospective design are needed to confirm our models’ reliability and clinical usefulness.

5. Conclusions

In summary, our study shows the importance of evaluating the robustness and reliability of ML models in predicting the need for DIBH in patients receiving RT for left-sided breast cancer. We found that DT and RF emerged as the top-performing models in our study, demonstrating a consistent and reliable performance across various conditions. Despite the limitations of our retrospective, single-institution study, our findings provide useful insights for improving ML models for clinical decision making.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/diagnostics15060668/s1. Figure S1: Single-slice CT parameters; File S1: Dataset at 240 cGy; File S2: Hyperparameter tuning (DT); File S3: Performance results (DT).

Author Contributions

Conceptualization, W.E.A.-H. and M.K.; methodology, W.E.A.-H. and M.K.; software, W.E.A.-H. and M.K.; validation, W.E.A.-H. and M.K.; formal analysis, W.E.A.-H. and M.K.; investigation, W.E.A.-H. and M.K.; resources, W.E.A.-H. and M.K.; data curation, W.E.A.-H., M.K., G.A.J., M.F., R.K., K.K., S.Y., Y.N., M.O., Y.T., K.S., I.S., M.B., N.T., M.H. and J.A.; writing—original draft preparation, W.E.A.-H. and M.K.; writing—review and editing, W.E.A.-H. and M.K.; visualization, W.E.A.-H. and M.K.; supervision, W.E.A.-H. and M.K.; project administration, W.E.A.-H. and M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Okayama University Hospital (2103-024, 19 February 2021).

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Darby, S.; McGale, P.; Correa, C.; Taylor, C.; Arriagada, R.; Clarke, M.; Cutter, D.; Davies, C.; Ewertz, M.; Godwin, J.; et al. Effect of radiotherapy after breast-conserving surgery on 10-year recurrence and 15-year breast cancer death: Meta-analysis of individual patient data for 10,801 women in 17 randomised trials. Lancet 2011, 378, 1707–1716. [Google Scholar] [CrossRef] [PubMed]
  2. Clarke, M.; Collins, R.; Darby, S.; Davies, C.; Elphinstone, P.; Evans, V.; Godwin, J.; Gray, R.; Hicks, C.; James, S.; et al. Effects of radiotherapy and of differences in the extent of surgery for early breast cancer on local recurrence and 15-year survival: An overview of the randomised trials. Lancet 2005, 366, 2087–2106. [Google Scholar] [CrossRef] [PubMed]
  3. Sardar, P.; Kundu, A.; Chatterjee, S.; Nohria, A.; Nairooz, R.; Bangalore, S.; Mukherjee, D.; Aronow, W.S.; Lavie, C.J. Long-term cardiovascular mortality after radiotherapy for breast cancer: A systematic review and meta-analysis. Clin. Cardiol. 2017, 40, 73–81. [Google Scholar] [CrossRef]
  4. Taylor, C.; Correa, C.; Duane, F.K.; Aznar, M.C.; Anderson, S.J.; Bergh, J.; Dodwell, D.; Ewertz, M.; Gray, R.; Jagsi, R.; et al. Estimating the risks of breast cancer radiotherapy: Evidence from modern radiation doses to the lungs and heart and from previous randomized trials. J. Clin. Oncol. 2017, 35, 1641–1649. [Google Scholar] [CrossRef]
  5. Drost, L.; Yee, C.; Lam, H.; Zhang, L.; Wronski, M.; McCann, C.; Lee, J.; Vesprini, D.; Leung, E.; Chow, E. A systematic review of heart dose in breast radiotherapy. Clin. Breast Cancer 2018, 18, e819–e824. [Google Scholar] [CrossRef]
  6. Jacob, S.; Camilleri, J.; Derreumaux, S.; Walker, V.; Lairez, O.; Lapeyre, M.; Bruguière, E.; Pathak, A.; Bernier, M.O.; Laurier, D.; et al. Is mean heart dose a relevant surrogate parameter of left ventricle and coronary arteries exposure during breast cancer radiotherapy: A dosimetric evaluation based on individually-determined radiation dose (BACCARAT study). Radiat. Oncol. 2019, 14, 29. [Google Scholar] [CrossRef] [PubMed]
  7. Beaton, L.; Bergman, A.; Nichol, A.; Aparicio, M.; Wong, G.; Gondara, L.; Speers, C.; Weir, L.; Davis, M.; Tyldesley, S. Cardiac death after breast radiotherapy and the QUANTEC cardiac guidelines. Clin. Transl. Radiat. Oncol. 2019, 19, 39–45. [Google Scholar] [CrossRef]
  8. Kirli Bolukbas, M.; Karaca, S.; Coskun, V. Cardiac protective techniques in left breast radiotherapy: Rapid selection criteria for routine clinical decision making. Eur. J. Med. Res. 2023, 28, 504. [Google Scholar] [CrossRef]
  9. McWilliam, A.; Khalifa, J.; Vasquez Osorio, E.; Banfill, K.; Abravan, A.; Faivre-Finn, C.; van Herk, M. Novel Methodology to Investigate the Effect of Radiation Dose to Heart Substructures on Overall Survival. Int. J. Radiat. Oncol. Biol. Phys. 2020, 108, 1073–1081. [Google Scholar] [CrossRef]
  10. Lu, Y.; Yang, D.; Zhang, X.; Teng, Y.; Yuan, W.; Zhang, Y.; He, R.; Tang, F.; Pang, J.; Han, B.; et al. Comparison of deep inspiration breath hold versus free breathing in radiotherapy for left sided breast cancer. Front. Oncol. 2022, 12, 845037. [Google Scholar] [CrossRef]
  11. Falco, M.; Masojć, B.; Macała, A.; Łukowiak, M.; Woźniak, P.; Malicki, J. Deep inspiration breath hold reduces the mean heart dose in left breast cancer radiotherapy. Radiol. Oncol. 2021, 55, 212–220. [Google Scholar] [CrossRef]
  12. Yamauchi, R.; Mizuno, N.; Itazawa, T.; Saitoh, H.; Kawamori, J. Dosimetric evaluation of deep inspiration breath hold for left-sided breast cancer: Analysis of patient-specific parameters related to heart dose reduction. J. Radiat. Res. 2020, 61, 447–456. [Google Scholar] [CrossRef] [PubMed]
  13. Gaál, S.; Kahán, Z.; Paczona, V.; Kószó, R.; Drencsényi, R.; Szabó, J.; Rónai, R.; Antal, T.; Deák, B.; Varga, Z. Deep-inspirational breath-hold (DIBH) technique in left-sided breast cancer: Various aspects of clinical utility. Radiat. Oncol. 2021, 16, 89. [Google Scholar] [CrossRef] [PubMed]
  14. Darapu, A.; Balakrishnan, R.; Sebastian, P.; Kather Hussain, M.R.; Ravindran, P.; John, S. Is the deep inspiration breath-hold technique superior to the free breathing technique in cardiac and lung sparing while treating both left-sided post-mastectomy chest wall and supraclavicular regions. Case Rep. Oncol. 2017, 10, 37–51. [Google Scholar] [CrossRef] [PubMed]
  15. Al-Hammad, W.E.; Kuroda, M.; Kamizaki, R.; Tekiki, N.; Ishizaka, H.; Kuroda, K.; Sugimoto, K.; Oita, M.; Tanabe, Y.; Barham, M.; et al. Mean heart dose prediction using parameters of single-slice computed tomography and body mass index: Machine learning approach for radiotherapy of left-sided breast cancer of Asian patients. Curr. Oncol. 2023, 30, 7412–7424. [Google Scholar] [CrossRef]
  16. Koide, Y.; Aoyama, T.; Shimizu, H.; Kitagawa, T.; Miyauchi, R.; Tachibana, H.; Kodaira, T. Development of deep learning chest X-ray model for cardiac dose prediction in left-sided breast cancer radiotherapy. Sci. Rep. 2022, 12, 13706. [Google Scholar] [CrossRef]
  17. West, D.; Mangiameli, P.; Rampal, R.; West, V. Ensemble strategies for a medical diagnostic decision support system: A breast cancer diagnosis application. Eur. J. Oper. Res. 2005, 162, 532–551. [Google Scholar] [CrossRef]
  18. Freiesleben, T.; Grote, T. Beyond generalization: A theory of robustness in machine learning. Synthese 2023, 202, 109. [Google Scholar] [CrossRef]
  19. Lloyd, E.A. Confirmation and robustness of climate models. Philos. Sci. 2010, 77, 971–984. [Google Scholar] [CrossRef]
  20. Tekiki, N.; Kuroda, M.; Ishizaka, H.; Khasawneh, A.; Barham, M.; Hamada, K.; Konishi, K.; Sugimoto, K.; Katsui, K.; Sugiyama, S.; et al. New field-in-field with two reference points method for whole breast radiotherapy: Dosimetric analysis and radiation-induced skin toxicities assessment. Mol. Clin. Oncol. 2021, 15, 193. [Google Scholar] [CrossRef]
  21. Alghamdi, M.; Al-Mallah, M.; Keteyian, S.; Brawner, C.; Ehrman, J.; Sakr, S. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project. Liu B, editor. PLoS ONE 2017, 12, e0179805. [Google Scholar] [CrossRef]
  22. Khushi, M.; Shaukat, K.; Alam, T.M.; Hameed, I.A.; Uddin, S.; Luo, S.; Yang, X.; Reyes, M.C. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access 2021, 9, 109960–109975. [Google Scholar] [CrossRef]
  23. Kaliappan, J.; Bagepalli, A.R.; Almal, S.; Mishra, R.; Hu, Y.C.; Srinivasan, K. Impact of cross-validation on machine learning models for early detection of intrauterine fetal demise. Diagnostics 2023, 13, 1692. [Google Scholar] [CrossRef] [PubMed]
  24. Charilaou, P.; Battat, R. Machine learning models and over-fitting considerations. World J. Gastroenterol. 2020, 28, 605–607. [Google Scholar] [CrossRef]
  25. Chhatwal, J.; Alagoz, O.; Lindstrom, M.J.; Kahn, C.E., Jr.; Shaffer, K.A.; Burnside, E.S. A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. AJR Am. J. Roentgenol. 2009, 192, 1117–1127. [Google Scholar] [CrossRef]
  26. Liaw, A.; Wiener, M. Classification and regression by random forest. R News 2002, 2, 18–22. [Google Scholar]
  27. Wickramasinghe, I.; Kalutarage, H. Naive bayes: Applications, variations, and vulnerabilities: A review of literature with code snippets for implementation. Soft Comput. 2021, 25, 2277–2293. [Google Scholar] [CrossRef]
  28. Rodriguez, D.; Nayak, T.; Chen, Y.; Krishnan, R.; Huang, Y. On the role of deep learning model complexity in adversarial robustness for medical images. BMC Med. Inform. Decis. Mak. 2022, 22, 160. [Google Scholar] [CrossRef]
  29. Koçak, B.; Cuocolo, R.; dos Santos, D.P.; Stanzione, A.; Ugga, L. Must-have qualities of clinical research on artificial intelligence and machine learning. Balkan Med. J. 2023, 40, 3–12. [Google Scholar] [CrossRef]
  30. Campagner, A.; Famiglini, L.; Carobene, A.; Cabitza, F. Everything is varied: The surprising impact of instantial variation on ML reliability. Appl. Soft Comput. 2023, 146, 110644. [Google Scholar] [CrossRef]
  31. Bouthillier, X.; Laurent, C.; Vincent, P. Unreproducible research is reproducible. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 725–734. Available online: https://proceedings.mlr.press/v97/bouthillier19a.html (accessed on 18 June 2024).
  32. Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
  33. Zhu, Y.; Wang, M.C. Obtaining optimal cutoff values for tree classifiers using multiple biomarkers. Biometrics 2022, 78, 128–140. [Google Scholar] [CrossRef] [PubMed]
  34. Kamizaki, R.; Kuroda, M.; Al-Hammad, W.E.; Tekiki, N.; Ishizaka, H.; Kuroda, K.; Sugimoto, K.; Oita, M.; Tanabe, Y.; Barham, M.; et al. Evaluation of the accuracy of heart dose prediction by machine learning for selecting patients not requiring deep inspiration breath-hold radiotherapy after breast cancer surgery. Exp. Ther. Med. 2023, 26, 536. [Google Scholar] [CrossRef] [PubMed]
  35. Aluja-Banet, T.; Nafria, E. Stability and scalability in decision trees. Comput. Stat. 2023, 18, 505–520. [Google Scholar] [CrossRef]
  36. Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
  37. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  38. Terui, Y.; Nochioka, K.; Ota, H.; Tada, H.; Sato, H.; Miyata, S.; Toyoda, S.; Shimojima, M.; Izumiya, Y.; Kitai, T.; et al. CHECH HEART-BC. Risk prediction model of cardiotoxicity in breast cancer patients; the multicenter prospective CHECK HEART-BC (comprehensive heart imaging to evaluate cardiac damage linked with chemotherapy in breast). Eur. Heart J. 2023, 44 (Suppl. S2), ehad655.2680. [Google Scholar] [CrossRef]
  39. Ferini, G.; Valenti, V.; Viola, A.; Umana, G.E.; Martorana, E. A critical overview of predictors of heart sparing by deep-inspiration-breath-hold irradiation in left-sided breast cancer patients. Cancers 2022, 14, 3477. [Google Scholar] [CrossRef]
Figure 1. Overview of criteria of building models. T: training dataset; t: test dataset; s: synthetic minority over-sampling set; ML: machine learning; CV: cross-validation; RSKCV: repeated stratified K-fold cross-validation; GridSearchCV: grid-search cross-validation.
Figure 1. Overview of criteria of building models. T: training dataset; t: test dataset; s: synthetic minority over-sampling set; ML: machine learning; CV: cross-validation; RSKCV: repeated stratified K-fold cross-validation; GridSearchCV: grid-search cross-validation.
Diagnostics 15 00668 g001
Figure 2. The discrepancies between the predicted and real incidences of deep inspiration breath-hold across different radiation doses. DIBH: deep inspiration breath-hold.
Figure 2. The discrepancies between the predicted and real incidences of deep inspiration breath-hold across different radiation doses. DIBH: deep inspiration breath-hold.
Diagnostics 15 00668 g002
Table 1. Patient characteristics.
Table 1. Patient characteristics.
CharacteristicValue
Age (mean ± SD, years)55.3 ± 11.1
Body mass index (mean ± SD)22.9 ± 3.9
Tumor location (%)
          Upper-inner quadrant27.1%
          Lower-inner quadrant9.2%
          Upper-outer quadrant51.7%
          Lower-outer quadrant4.8%
          Central portion7.2%
Radiation method (%)
          FIF-1RP33.8%
          FIF-2RP66.2%
Breast separation (mean ± SD, cm)18.8 ± 2.6
Chest wall thickness (mean ± SD, cm)6.0 ± 1.2
Mean heart dose (mean ± SD, cGy)251 ± 81
          High106 *1, 74 *2, 43 *3
          Low101 *1, 133 *2, 164 *3
SD: standard deviation. The body mass index was calculated as weight (kg)/height2 (m). Tumor location was determined according to the International Classification of Diseases for Oncology, Third Edition (ICD-O-3). The radiation method (n) refers to the number of patients treated using either the field-in-field technique with one reference point or two reference points. Breast separation (cm) was measured as the distance along the posterior edge of the tangent fields at the nipple level, while chest wall thickness (cm) was defined as the distance from the skin surface to the lung, also measured at the nipple level. A high mean heart dose was defined as a value equal to or greater than the designated cut-off, whereas a low mean heart dose was classified as any value below this threshold. *1: the number of patients at the 240 cGy cut-off value. *2: the number of patients at the 270 cGy cut-off value. *3: the number of patients at the 300 cGy cut-off value.
Table 2. F2 scores and predictive performance of models.
Table 2. F2 scores and predictive performance of models.
Cut-Off Value# of VariablesFoldsGBDTBaggingDNNRFKNNSVMNBLRRC
240 cGy3 variables3-fold0.8460.7010.7070.6010.6070.5280.5600.5280.5280.485
4-fold0.8460.7010.7070.5600.6630.5280.5600.5280.5280.485
5-fold0.3920.7010.7140.6360.6070.6070.5600.5280.4850.485
6 variables3-fold0.8460.7390.5820.6520.5660.5710.6810.5500.5440.544
4-fold0.7990.7010.6600.6190.5710.6300.5440.5940.5880.544
5-fold0.8460.7010.5400.6250.6480.6300.5440.5940.5440.544
Median 0.8460.7010.6830.6220.6070.5890.5600.5390.5360.514
Q1 0.8110.7010.6020.6060.5800.5390.5480.5280.5280.485
Q3 0.8460.7010.7070.6330.6380.6240.5600.5830.5440.544
IQR 0.0350.0000.1060.0280.0580.0850.0120.0550.0160.059
Maximum 0.8460.7390.7140.6520.6630.6300.6810.5940.5880.544
Minimum 0.3920.7010.5400.5600.5660.5280.5440.5280.4850.485
Model instability 0.454 *0.0380.174 *0.0920.0970.102 *0.137 *0.0660.103 *0.059
270 cGy3 variables3-fold0.7350.7860.6870.6250.7310.6790.5550.6790.6790.679
4-fold0.7350.7860.6870.6320.8230.6790.7220.6790.7310.679
5-fold0.8230.7860.6790.6250.6870.6700.7400.6790.6790.679
6 variables3-fold0.7350.8040.6250.5230.6870.7050.7140.6390.6090.555
4-fold0.0000.8040.7400.6410.6870.6540.6320.6390.6090.555
5-fold0.8040.8040.6870.5550.6870.6390.6470.6390.5550.555
Median 0.7350.7950.6870.6250.6870.6740.6800.6590.6440.617
Q1 0.7350.7860.6810.5730.6870.6580.6360.6390.6090.555
Q3 0.7870.8040.6870.6300.7200.6790.7200.6790.6790.679
IQR 0.0520.0180.0060.0580.0330.0210.0840.0400.0700.124
Maximum 0.8230.8040.7400.6410.8230.7050.7400.6790.7310.679
Minimum 0.0000.7860.6250.5230.6870.6390.5550.6390.5550.555
Model instability 0.823 *0.0180.1150.1180.136 *0.0660.185 *0.0400.176 *0.124 *
300 cGy3 variables3-fold0.6030.7250.7890.6890.7370.7620.7620.7140.7140.714
4-fold0.7620.7250.7890.7890.7750.7750.7620.7140.7140.714
5-fold0.5760.3060.5100.6890.7750.7750.7500.7140.7010.714
6 variables3-fold0.7270.6660.7140.5260.7750.7370.4090.5350.5260.526
4-fold0.5200.6890.8030.4540.6860.7370.5450.5350.5260.526
5-fold0.5100.6660.8030.6140.7370.5170.5260.6140.5260.526
Median 0.5900.6780.7890.6520.7560.7500.6480.6640.6140.620
Q1 0.5340.6660.7330.5480.7370.7370.5310.5550.5260.526
Q3 0.6960.7160.8000.6890.7750.7720.7590.7140.7110.714
IQR 0.1620.0500.0670.1410.0380.0350.2280.1590.1850.188
Maximum 0.7620.7250.8030.7890.7750.7750.7620.7140.7140.714
Minimum 0.5100.3060.5100.4540.6860.5170.4090.5350.5260.526
Model instability 0.2520.419 *0.293 *0.335 *0.0890.258 *0.353 *0.1790.1880.188
# of variables: number of variables. GB: gradient boosting; DT: decision tree; DNN: deep neural network; RF: random forest; KNN: K-nearest neighbor; SVM: support vector machine; NB: naïve Bayes; LR: logistic regression; RC: ridge classifier. Q1: first quartile, represents the 25th percentile; Q3: third quartile, represents the 75th percentile; IQR: interquartile range, calculated as Q3–Q1. Model instability is the variation in model F2 scores under different conditions, calculated as the difference between the maximum and minimum values for each model. * indicates a “significantly high” instability value, surpassing the median instability values of the models for each respective cut-off. Blue values are the highest median F2 scores at each cut-off value. Red values are the highest instability values associated with the highest median F2 scores.
Table 3. Pairwise permutation tests between models with various cut-off values.
Table 3. Pairwise permutation tests between models with various cut-off values.
Cut-off value = 240 cGy
ModelGBDTBaggingDNNRFKNNSVMNBLRRC
GBN/A
DT0.61N/A
Bagging0.2060.121N/A
DNN0.1020.002 *0.292N/A
RF0.080.002 *0.260.807N/A
KNN0.035 *0.002 *0.0870.1860.294N/A
SVM0.0520.002 *0.0610.1320.2250.82N/A
NB0.015 *0.002 *0.032 *0.013 *0.026 *0.2530.539N/A
LR0.015 *0.002 *0.015 *0.004 *0.006 *0.0820.1430.409N/A
RC0.015 *0.002 *0.009 *0.002 *0.002 *0.024 *0.022 *0.0690.364N/A
Cut-off value = 270 cGy
ModelGBDTBaggingDNNRFKNNSVMNBLRRC
GBN/A
DT0.067N/A
Bagging0.9940.002 *N/A
DNN0.9050.002 *0.011 *N/A
RF0.9440.015 *0.3490.002 *N/A
KNN10.002 *0.4960.004 *0.058N/A
SVM0.9940.002 *0.660.0580.2170.937N/A
NB10.002 *0.1970.009 *0.002 *0.3720.76N/A
LR10.002 *0.1950.2230.048 *0.3510.550.63N/A
RC0.9550.002 *0.0540.6490.002 *0.1390.2250.1820.589N/A
Cut-off value = 300 cGy
ModelGBDTBaggingDNNRFKNNSVMNBLRRC
GBN/A
DT0.887N/A
Bagging0.0970.251N/A
DNN0.8830.9850.132N/A
RF0.024 *0.013 *0.9240.048 *N/A
KNN0.1320.3640.6190.1820.727N/A
SVM0.8870.9740.17510.1280.336N/A
NB0.7140.9240.1470.820.022 *0.1490.903N/A
LR0.9810.9720.0930.9350.013 *0.0690.970.589N/A
RC0.9480.9810.0950.9520.022 *0.0690.9870.611N/A
Diagnostics 15 00668 i001
GB: gradient boosting; DT: decision tree; DNN: deep neural network; RF: random forest; KNN: K-nearest neighbor; SVM: support vector machine; NB: naïve Bayes; LR: logistic regression; RC: ridge classifier. N/A: not applicable. The numbers indicate the p-values of pairwise permutation tests. * indicates p < 0.05 in permutation tests between each pairwise model.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Hammad, W.E.; Kuroda, M.; Al Jamal, G.; Fujikura, M.; Kamizaki, R.; Kuroda, K.; Yoshida, S.; Nakamura, Y.; Oita, M.; Tanabe, Y.; et al. Robustness of Machine Learning Predictions for Determining Whether Deep Inspiration Breath-Hold Is Required in Breast Cancer Radiation Therapy. Diagnostics 2025, 15, 668. https://doi.org/10.3390/diagnostics15060668

AMA Style

Al-Hammad WE, Kuroda M, Al Jamal G, Fujikura M, Kamizaki R, Kuroda K, Yoshida S, Nakamura Y, Oita M, Tanabe Y, et al. Robustness of Machine Learning Predictions for Determining Whether Deep Inspiration Breath-Hold Is Required in Breast Cancer Radiation Therapy. Diagnostics. 2025; 15(6):668. https://doi.org/10.3390/diagnostics15060668

Chicago/Turabian Style

Al-Hammad, Wlla E., Masahiro Kuroda, Ghaida Al Jamal, Mamiko Fujikura, Ryo Kamizaki, Kazuhiro Kuroda, Suzuka Yoshida, Yoshihide Nakamura, Masataka Oita, Yoshinori Tanabe, and et al. 2025. "Robustness of Machine Learning Predictions for Determining Whether Deep Inspiration Breath-Hold Is Required in Breast Cancer Radiation Therapy" Diagnostics 15, no. 6: 668. https://doi.org/10.3390/diagnostics15060668

APA Style

Al-Hammad, W. E., Kuroda, M., Al Jamal, G., Fujikura, M., Kamizaki, R., Kuroda, K., Yoshida, S., Nakamura, Y., Oita, M., Tanabe, Y., Sugimoto, K., Sugianto, I., Barham, M., Tekiki, N., Hisatomi, M., & Asaumi, J. (2025). Robustness of Machine Learning Predictions for Determining Whether Deep Inspiration Breath-Hold Is Required in Breast Cancer Radiation Therapy. Diagnostics, 15(6), 668. https://doi.org/10.3390/diagnostics15060668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop