Radiation-Induced Hypothyroidism in Patients with Oropharyngeal Cancer Treated with IMRT: Independent and External Validation of Five Normal Tissue Complication Probability Models

Simple Summary Hypothyroidism is a common complication of therapeutic irradiation in the neck area. Several dose-response models have been proposed to predict its’ occurrence based on clinical and radiomic features. We aimed to externally validate the results of five such models in a prospectively recruited cohort of 108 patients with oropharyngeal cancer. Two of the evaluated models, published by Rønjom et al. and by Boomsma et al., had satisfactory performance. Both models are based on mean thyroid dose and thyroid volume. Three remaining models, by Cella et al., Bakhshandeh et al. and Vogelius et al., performed significantly worse. Short-term change in the level of thyroid-stimulating hormone (TSH) after radiation therapy was not indicative of hypothyroidism development in long term. We conclude that the models by Rønjom et al. and by Boomsma et al. are feasible for long-term prediction of hypothyroidism in oropharyngeal cancer survivors treated with intensity-modulated radiation therapy. Abstract We aimed to externally validate five normal tissue complication probability (NTCP) models for radiation-induced hypothyroidism (RIHT) in a prospectively recruited cohort of 108 patients with oropharyngeal cancer (OPC). NTCP scores were calculated using original published formulas. Plasma thyrotropin (TSH) level was additionally assessed in the short-term after RT. After a median of 28 months of follow-up, thirty one (28.7%) patients developed RIHT. Thyroid mean dose and thyroid volume were significant predictors of RIHT: odds ratio equal to 1.11 (95% CI 1.03–1.19) for mean thyroid dose and 0.87 (95%CI 0.81–0.93) for thyroid volume in univariate analyses. Two of the evaluated NTCP models, published by Rønjom et al. and by Boomsma et al., had satisfactory performance with accuracies of 0.87 (95%CI 0.79–0.93) and 0.84 (95%CI: 0.76–0.91), respectively. Three remaining models, by Cella et al., Bakhshandeh et al. and Vogelius et al., performed significantly worse, overestimating the risk of RIHT in this patient cohort. A short-term TSH level change relative to baseline was not indicative of RIHT development in the follow-up (OR 0.96, 95%CI: 0.65–1.42, p = 0.825). In conclusion, the models by Rønjom et al. and by Boomsma et al. demonstrated external validity and feasibility for long-term prediction of RIHT in survivors of OPC treated with Intensity-Modulated Radiation Therapy (IMRT).


Introduction
Radiation-induced hypothyroidism (RIHT) commonly develops in cancer survivors that receive radiation therapy (RT) for head and neck cancer (HNC). The median interval between RT and hypothyroidism is approximately 1.5 years, however later toxicities are also observed [1]. In a long follow-up (>10 years), more than 50% of patients experience RIHT [2][3][4].
Despite clinical benefits, the use of Intensity-Modulated Radiation Therapy (IMRT) was one of the factors that were reported to increase the incidence of RIHT [5,6] in contrast to earlier planning techniques, i.e., 3-dimensional conformal radiotherapy. In cases when thyroid is not contoured properly, not delineated at all or not taken into account during the optimization process, the steep dose gradients between tumor and surrounding normal tissues may result in potential overdosing in the areas where the gland is located [7]. This is of particular importance in cases when cervical lymph node regions, lying in the close proximity to the thyroid, are irradiated.
As the symptoms of hypothyroidism may be non-specific, appropriate thyroid hormone levels monitoring in patients who have undergone RT in the head and neck region is of great importance. Inadequate levels of thyroid hormones negatively impact the patients' quality of life [8], their morbidity and mortality [5,[9][10][11]; even subclinical hypothyroidism (elevated TSH with normal T3 and T4) contributes significantly to increased cardiovascular mortality [12]. Hypothyroidism symptoms and the risk it incurs may be reversed with thyroid hormone replacement, mandating routine follow-up of thyroid function and considering the thyroid gland as an organ at risk (OAR) during RT treatment planning. Still, for these efforts to yield optimal results, we need accurate tools for predicting radiation-induced hypothyroidism.
A recent systematic review of normal tissue complication probability (NTCP) models relevant to RT of the HNC [13] critically evaluated five dose-response models for RIHT [14][15][16][17][18]. To overcome the difficulty in comparing dose-response models from different studies, a relevance score was introduced to evaluate the relevance of NTCP estimations for the given patient population. The authors concluded that the most relevant NTCP model for RIHT is the one by Rønjom et al. [17] published in 2015 and using only thyroid mean dose and volume as predictors.
We aimed to externally validate the five available NTCP models for RIHT in an independent cohort of patients with oropharyngeal cancer (OPC) to verify which, if any, should be used to inform clinical decision making. We tested also which of the variables included in the original NTCP models were predictive of RIHT in the validation cohort and verified if recalibration would improve the model performance. Furthermore, we addressed the shortcomings of the previous studies by prospectively collecting data from three radiation oncology centers in Poland and by including an additional timepoint (shortly after RT) in the study design, to verify the hypothesis that monitoring TSH level shortly after RT may be useful to predict RIHT.

Patient Inclusion and Outcome
In total, 195 patients met the inclusion criteria and were recruited in the study. Among these, 83 were excluded from the final analysis due to a lack of follow-up thyroid function assessment before recurrence/death or loss of contact, 3 due to baseline plasma TSH > 4 mIU/L and 1 due to known thyroid disease ( Figure S1). A summary characteristics of the 108 patients included in the analysis (39 from center A, 13 from center B and 56 from center C) is shown in Table 1. In the whole group, the median follow-up was 28 months (IQR 21-38 months). Thirty-one patients (28.7%) developed RIHT that required thyroid replacement therapy, meeting our primary endpoint; median time to RIHT development was 16 months (IQR 14-22 months). Neither RIHT frequency nor median time to RIHT development differed significantly between the three centers (p = 0.869 and p = 0.299, respectively).

External Validation of NTCP Models for RIHT
A comparison of the relevant variables (demographic, predictors and outcome) for the whole cohort and for the development cohort from each study is presented in Table S1.
A summary of models' performance in the whole validation cohort and in each center is presented in Table 2. The best performing models were those developed by Rønjom  . While NTCP scores for these two models were also highly correlated-as expected given that both models included the same parameters-the model by Rønjom et al. predicted slightly lower RIHT development probabilities in patients with low risk of developing RIHT (with NTCP values from both models <0.3) and slightly higher probabilities in the high-risk patients (with NTCP values >0.8; Figure 1).
A comparison of the relevant variables (demographic, predictors and outcome) for the whole cohort and for the development cohort from each study is presented in Table S1.
A summary of models' performance in the whole validation cohort and in each center is presented in Table 2. The best performing models were those developed by Rønjom et al., characterized by a prediction accuracy of 0.87 (95%CI: 0.79-0.93), and by Boomsma et al., characterized by a prediction accuracy of 0.84 (95%CI: 0.76-0.91). All other models were characterized by high sensitivity (at least 90%) and low specificity. This was most evident for the model by Cella et al., which predicted RIHT in 106 (98.1%) patients and had 100% sensitivity and 3% specificity.
Model by Rønjom et al. had better specificity for RIHT prediction (90% vs. 81%), and the model by Boomsma et al. had better sensitivity (94% vs. 81%). While NTCP scores for these two models were also highly correlated-as expected given that both models included the same parameters-the model by Rønjom et al. predicted slightly lower RIHT development probabilities in patients with low risk of developing RIHT (with NTCP values from both models <0.3) and slightly higher probabilities in the high-risk patients (with NTCP values >0.8; Figure 1). The two best models' performance was similar and very good (areas under the curve (AUCs) 0.91-0.94) in patient cohorts from centers A and C and moderate in the smallest patient cohort (n = 13) from center B (AUC = 0.69 for both models; Figure 2a,b). Additionally, we visualized these The two best models' performance was similar and very good (areas under the curve (AUCs) 0.91-0.94) in patient cohorts from centers A and C and moderate in the smallest patient cohort (n = 13) from center B (AUC = 0.69 for both models; Figure 2A,B). Additionally, we visualized these models' calibration performance using calibration plots ( Figure 2C,D). Calibration plots for the three models with worse overall performance are presented in Figure S2. It can be noted that although the models by Rønjom et al. and by Boomsma et al. performed well in patients at high risk of developing RIHT, the predictions were underestimating this risk in patients with lower probabilities of follow-up RIHT. After dividing the patient cohort into the training and test set, we recalibrated the models' outputs using Platt scaling [19], which, however, failed to improve model performance ( Figure S2). Parameters of the logistic function describing recalibrated models are presented in Table S2. models by Rønjom et al. and by Boomsma et al. performed well in patients at high risk of developing RIHT, the predictions were underestimating this risk in patients with lower probabilities of follow-up RIHT. After dividing the patient cohort into the training and test set, we recalibrated the models' outputs using Platt scaling [19], which, however, failed to improve model performance ( Figure S2). Parameters of the logistic function describing recalibrated models are presented in Table  S2. We also analyzed separately the data from fourteen patients (13%) that had baseline TSH levels <0.3 mIU/L. Noteworthy, although the baseline TSH level is not included in any of the validated We also analyzed separately the data from fourteen patients (13%) that had baseline TSH levels

Variables Associated with RIHT in the Validation Cohort
In univariate analyses, higher mean thyroid dose and lower thyroid volume were significant risk factors for RIHT development (p = 0.004 and p < 0.001, respectively); in the multivariable analysis, both these variables remained significant (p = 0.011 and p < 0.001, Table 3). No other factor, including baseline TSH or short-term change in the TSH level, was a significant predictor of RIHT. Abbreviations: RT-radiation therapy; TSH-thyroid stimulating hormone; HPV-Human papillomavirus; OR-odds ratio; CI-confidence interval.

Short-Term TSH Level Changes and RIHT
Finally, we aimed to assess whether short-term changes in the TSH or fT4 level are indicative of RIHT development in the long term. Although TSH levels decreased significantly in the short term after RT completion (p < 0.001), the short-term change in TSH was not different between patients with and without RIHT in follow-up (p = 0.844; Figure 3). This change was also not predictive of RIHT development in the multivariate model including dosimetric and clinical parameters ( Table 3). The same was true for the short-term change in fT4 levels (p = 0.280 for comparison of patients with and without RIHT in the follow-up).

Short-Term TSH Level Changes and RIHT
Finally, we aimed to assess whether short-term changes in the TSH or fT4 level are indicative of RIHT development in the long term. Although TSH levels decreased significantly in the short term after RT completion (p < 0.001), the short-term change in TSH was not different between patients with and without RIHT in follow-up (p = 0.844; Figure 3). This change was also not predictive of RIHT development in the multivariate model including dosimetric and clinical parameters ( Table 3). The same was true for the short-term change in fT4 levels (p = 0.280 for comparison of patients with and without RIHT in the follow-up).

Discussion
In this prospective, multicenter study, we evaluated factors associated with RIHT in an independent cohort of 108 OPC patients treated with IMRT in three radiation oncology centers in Poland and externally validated five published NTCP models for this complication.
Since hypothyroidism negatively impacts the patients' quality of life [8] and their morbidity and mortality [5,[9][10][11], the establishment of clinically feasible models to predict RIHT is important to undertake preventive strategies in patients at high risk of this complication.
Previous reports suggested a dose-response relationship allowing a robust prediction of RIHT [13]. However, there was no quantitative analyses of normal tissue effects in the clinic (QUANTEC) report focusing on thyroid complications, as highlighted by the authors of a recent paper that attempted to revisit the dose constraints for head and neck OARs [20]. That is why we decided to focus on RIHT prediction in a contemporary cohort of OPC patients.
The best performing models in terms of accuracy and discriminative ability were those published by Rønjom et al. [17] and by Boomsma et al. [15]. Both are logistic regression-based with the thyroid mean dose and thyroid volume used as the only predictors; both were also developed in cohorts numbering >100 patients with HNC. Importantly, both models were characterized by a satisfactory performance in our dataset, with an accuracy of 87% and 84% for the Rønjom et al. and Boomsma et al. models respectively. Although the calibration was not perfect, especially in patients with lower risks of RIHT, recalibration using Platt scaling failed to improve the model performance, possibly due to the limited number of patients.

Discussion
In this prospective, multicenter study, we evaluated factors associated with RIHT in an independent cohort of 108 OPC patients treated with IMRT in three radiation oncology centers in Poland and externally validated five published NTCP models for this complication.
Since hypothyroidism negatively impacts the patients' quality of life [8] and their morbidity and mortality [5,[9][10][11], the establishment of clinically feasible models to predict RIHT is important to undertake preventive strategies in patients at high risk of this complication.
Previous reports suggested a dose-response relationship allowing a robust prediction of RIHT [13]. However, there was no quantitative analyses of normal tissue effects in the clinic (QUANTEC) report focusing on thyroid complications, as highlighted by the authors of a recent paper that attempted to revisit the dose constraints for head and neck OARs [20]. That is why we decided to focus on RIHT prediction in a contemporary cohort of OPC patients.
The best performing models in terms of accuracy and discriminative ability were those published by Rønjom et al. [17] and by Boomsma et al. [15]. Both are logistic regression-based with the thyroid mean dose and thyroid volume used as the only predictors; both were also developed in cohorts numbering >100 patients with HNC. Importantly, both models were characterized by a satisfactory performance in our dataset, with an accuracy of 87% and 84% for the Rønjom et al. and Boomsma et al. models respectively. Although the calibration was not perfect, especially in patients with lower risks of RIHT, recalibration using Platt scaling failed to improve the model performance, possibly due to the limited number of patients.
All other three evaluated models, by Cella et al. [16], Bakhshandeh et al. [14] and Vogelius et al. [18], performed significantly worse (accuracies below 50%). This may be explained in part by the limited size of the patient cohorts in which they were developed (n = 65 for Bakhshandeh et al. and n = 53 for Cella et al.) and by the differences between the original and presented patient groups. While Boomsma et al. and Rønjom et al. NTCP models were developed in homogenous populations of HNC patients treated with chemo-RT, the study cohort for Cella et al. consisted of patients with Hodgkin's lymphoma and the study by Vogelius et al. was a meta-analysis of four studies, two of which included patients with Hodgkin's lymphoma. Given the differences in treatment for HNC and Hodgkin's lymphoma, i.e., lower total target dose, a dose-response model developed using data for one of these conditions is unlikely to perform well for the other. This disparity might also explain why the model by Cella et al., which includes thyroid V30 as a predictor in the logistic regression formula, predicted RIHT in 98.1% of the patients in our cohort, where the median thyroid V30 was 100 (IQR 99.8-100).
A recent retrospective study of RIHT after IMRT in 360 OPC patients [21] reached similar conclusions to our study, i.e., that higher thyroid mean dose and smaller thyroid volume are both significantly associated with risk of RIHT in the multivariate analysis. The authors also recalibrated and evaluated two of the five already mentioned NTCP models, by Boomsma  each yielding AUC ROC of 0.72 and 0.66, respectively. They proposed that multicenter, prospective studies are needed to validate the available NTCP models and that monitoring of the TSH level shortly after RT, during the acute/subacute phase, may better account for the possibility of spontaneous recovery of subclinical RIHT that could impact the accuracy of the thyroid status assessment. Our results indicate, however, that monitoring the TSH level in the short-term after RT does not allow for accurate prediction of RIHT.
Noteworthy, the NTCP models published by Rønjom et al. [17] and by Boomsma et al. [15] have also gained highest relevance scores in a systematic review of NTCP models for HNC RT [13]. This result highlights the pertinence of the approach to evaluating dose-response models regarding the relevance of the patient material, study design, radiation therapy reporting and modeling approach. A comprehensive strategy summarizing possible solutions to cope with abovementioned issues was recently published and should be adhered to in future studies on the matter [22].
One limitation of our study is a small number of patients in center C (n = 13), which prevents an appropriate comparison of the models' performance between all participating centers. The definition of the endpoint (clinical hypothyroidism), differs from the one defined in most studies that reported the validated NTCP models (clinical or subclinical hypothyroidism, based on the elevated TSH). We chose an endpoint based on the CTCAE classification because it is a robust and validated tool used both in state-of-the-art clinical trials and everyday practice. Another difference regarding the original studies is the inclusion of patients with hyperthyroidism (baseline TSH < 0.3 mIU/L; n = 14 patients). We chose to include these patients to allow the evaluation of models' performance in a pragmatic, real-life clinical setting. Notably, NTCP models by Boomsma et al. and Rønjom et al. performed well also in this group, correctly predicting RIHT for all (Rønjom et al. [17]) or almost all (n = 13) patients (Boomsma et al. [15]).
Considering the potential long-term sequelae of hypothyroidism and the relatively high incidence of RIHT, it is important to include the thyroid gland as an organ at risk in routine treatment planning of RT. Recommendations for the constraint doses for the thyroid gland have been proposed by Rønjom et al. [23] with respect to the thyroid volume and may be incorporated in the clinical practice to limit the incidence of RIHT. Our study also indicates that multivariable predictive models based on patients with hematological malignancies in the H&N region should probably not be utilized to predict normal tissue complications in OPC patients.
The incidence of HPV-positive OPC patients is rapidly increasing [24] and treatment deintensification, in an effort to reduce toxicity while preserving high survival rates, is currently being explored in this group [25]. Establishing a robust NTCP model for RIHT could possibly allow treatment plan optimization [26] or selection of patients for emerging treatment techniques, such as proton therapy [27]. Given the fact that in some clinical situations (e.g., patients with small thyroids) gland sparing is often impossible, the biggest benefits of improved prediction may be in patient counseling and tailored surveillance strategies. According to National Comprehensive Cancer Network (NCCN) guidelines, TSH levels should be checked every 6-12 months after irradiation [28]. Reliable models could allow one to stratify the patients and assess thyroid function more often in the high-risk group, especially within the first years of follow-up, when the incidence of RIHT is the highest [29]. In this context, our study outlines the feasibility of predicting RIHT using published dose-response models and their potential utility in planning patient follow-up and selecting patients most likely to benefit from preventive strategies.

Patients
The study cohort involved 108 patients prospectively recruited at three oncology centers in Poland: Copernicus Regional Specialist Hospital in Łódź (primary center, denominated A), Radom Oncology Centre and Maria Sklodowska-Curie National Research Institute of Oncology, Gliwice Branch (called Cancers 2020, 12, 2716 9 of 14 satellite centers B and C, respectively) between 01.05.2016 and 31.12.2018 and followed up until 02.2020; details regarding patient inclusion are presented in Figure S1. Inclusion criteria were: histologically diagnosed OPC, planned radical treatment, age >18 years and signed informed consent for the study. Exclusion criteria were: metastatic disease, known thyroid disease at baseline, elevated baseline serum TSH, history of thyroidectomy, history of radioiodine therapy, history of RT to the head and neck region and advanced chronic disease (heart failure-III/IV NYHA, renal failure-eGFR<30 mL/min/1.73 m 2 , liver failure-C or D score in Child-Pugh classification).
Staging was performed according to the American Joint Committee on Cancer (AJCC) 7th edition staging system [30]. All patients enrolled in the study underwent qualification to radiotherapy according to the centre protocol, which is based on the standard treatment according to the National Comprehensive Cancer Network (NCCN). Target volumes and OARs were contoured according to the consensus international guidelines [31,32]. Dose limitations for normal tissues used for planning purposes were based on quantitative analyses of normal tissue effects in the clinic (QUANTEC). All patients were treated with IMRT (bilateral irradiation in all patients) using conventional fractionation-with a planned total dose of 69.96-70.0 Gy in 33-35 fractions, 5 fractions/week (daily Monday-Friday) [33]. The dose to target volume was planned according to the International Commission on Radiation Units and Measurements Reports 62 [34] and 83 [35]. Concomitant systemic treatment was allowed, including treatment with weekly platinum-based chemotherapy (cisplatin 40 mg/m 2 ), every three weeks platinum (cisplatin 100 mg/m 2 ) and induction chemotherapy according to the PF protocol (cisplatin 100 mg/m 2 on day one with 5-FU 1000 mg/m 2 administered by continuous infusion on days 1-4, every 21 days) or the TPF protocol (docetaxel 75 mg/m 2 , cisplatin 75 mg/m 2 on day 1 with 5-FU 1000 mg/m 2 administered by continuous infusion on days 1-4, every 21 days). Detailed description of the RT protocol either or not combined with systemic treatment for all patients is presented in Table 1.
For p16 immunohistochemistry, we used a CINtec p16 INK4a histology kit (DakoCytomation BV, Heverlee, Belgium) with a 70% nuclear and cytoplasmic staining cutoff [36]. Both positive and negative control specimens were included in every immunostaining run. Tobacco use was defined as having ≥ 10 pack-years of smoking history.

Treatment Planning and Contouring the Thyroid Gland
Treatment planning was carried out using Eclipse software (Varian Medical Systems, Inc., Palo Alto, CA, USA) version 13.6 or 15.1. Dose was calculated by means of an anisotropic analytical algorithm (AAA) with 2.5 mm grid size. The aim of the planning optimization was to cover at least 95% of the planning target volume (PTV) with 100% of the prescription dose. The main OARs considered were brain stem, spinal cord, larynx, mandible, parotid glands and esophagus. Sparing the thyroid gland was not part of the optimization objectives. For this study, thyroid glands were therefore retrospectively contoured by two experienced radiation oncologists according to the recent guidelines for OARs in the head and neck region [31].

Thyroid Function Assessment and Clinical Endpoint Definition
Laboratory assessments were performed at baseline (during the 7 days before the beginning of RT), shortly after RT (during the 7 days since the last fraction) and at follow-up (median time to follow-up 28 months). The specimens were collected during standard assessments associated with the RT treatment and follow-up visits. The normal range for TSH was defined as 0.3-4 mIU/L and the normal range for fT4 was defined as 7-22 pg/mL. To avoid bias, laboratory assessment, contouring and statistical analysis were performed by independent researchers and the person responsible for the outcome assessments was blinded to the results of the NTCP models being validated.
We defined the primary endpoint as grade ≥2 RIHT per the Common Terminology Criteria for Adverse Events grading system, version 4.03 [37].

Sample Size and Missing Data
Since no generally accepted approaches exist to calculate the sample size for validation studies [38], we did not perform formal sample size calculations. Instead, to maximize the power and generalizability of the comparison between evaluated models, we utilized data from all patients that met the inclusion criteria during the recruitment period and for whom follow-up information was available. We analyzed only complete cases in whom the outcome was known, i.e., when the patient was unavailable for the follow-up assessment we excluded them entirely from analysis; no data imputation was used.

Statistical Analysis
Univariate analysis of RIHT association with an outcome was conducted by creating logistic regression models with a single predictor; odds ratios (ORs) with 95% confidence intervals (95%CI) and p values were reported. The following variables were considered for multivariate analysis: age, sex, surgery, tobacco use, thyroid volume, minimum dose to thyroid, mean dose to thyroid, maximum dose to thyroid, percentage of thyroid gland volume receiving no more than 30 Gy, TSH level at baseline and short-term change in TSH level. Stepwise regression with backward feature elimination was implemented, with threshold p = 0.1 for feature elimination. Laboratory parameters from before and after RT were compared using Wilcoxon rank sum test. Differences between RIHT and no-RIHT groups were evaluated using Mann-Whitney U test.
NTCP predictions were calculated using the original formulas reported for each model [14][15][16][17][18]. Cutoff NTCP = 0.5, representing a 50% predicted probability of RIHT development, was used for outcome prediction according to each respective model, i.e., patients with an NTCP score of 0.5 or above were defined as predicted to develop RIHT. The discriminative performance was assessed by calculating the area under the receiver operating characteristic curve (AUC ROC), Nagelkerke R2, Brier score (mean squared difference between the predicted outcome probabilities and the actual outcomes) [39], the accuracy, sensitivity, specificity and discrimination slope (difference of mean NTCP scores between patients with different outcomes) for predictions at threshold p = 0.5. The calibration performance of the NTCP models was assessed by calibration plots, depicting grouped observed outcome frequencies with 95% confidence intervals (95%CI) vs. mean predicted probabilities [39].
Smoothing function used to estimate observed outcome probability in relation to predicted probability was created using the loess algorithm. Correlation between NTCP scores from different models was assessed with Pearson correlation test.
Platt calibration was used to recalibrate the model predictions [19]. In this approach, a logistic function is fitted using estimates produced by the original models. The patient cohort was split in a 2:1 proportion into train and test sets. The logistic function was built using the train set and the model calibration, visualized in Figure S2 and described in Table S2, was assessed in the test set.
The analyses and reporting were done in accordance with the TRIPOD (Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement [38] (File S1). Study data, including clinical and dosimetric data used to validate the NTCP models, are available in File S2.

Ethics
The study was approved by the Bioethics Committee of the Medical University of Lodz (KE/7/10, RNN/65/18).

Conclusions
Our results from a prospectively evaluated multicenter cohort of patients with OPC confirmed that low thyroid volume and high mean dose to the thyroid gland were both significant risk factors for RIHT. We showed that the models by Rønjom et al. [17] and by Boomsma et al. [15] reliably described the dose-response relationship for RIHT and can be used in the clinic; however, owing to the described calibration issues, the predictions should be taken with caution in patients with lower predicted risk of RIHT.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6694/12/9/2716/s1, Figure S1: Diagram of patient flow through the study. Figure Table S1: Comparison of patient cohorts from the current study and from the studies developing the evaluated NTCP models. Table S2: Parameters of logistic function describing the recalibrated NTCP models. File S1: TRIPOD Checklist for prediction model validation. File S2: Study data.