Total Neoadjuvant Therapy for Rectal Cancer in the CAO/ARO/AIO-12 Randomized Phase 2 Trial: Early Surrogate Endpoints Revisited

Simple Summary Multimodal treatment of rectal cancer is undergoing dynamic change. In phase II/III multimodal rectal cancer trials, long-term survival remains the most objective endpoint for reporting treatment efficacy, but long follow-up is required, and there is a risk that the study results will lose scientific significance over time. To address these limitations, early surrogate endpoints are increasingly used to identify treatment efficacy at an earlier timepoint. We here report the prognostic role of pCR (pathological complete response), TRG (tumor regression grade) and NAR score (neoadjuvant rectal score) for DFS (disease-free survival) in the CAO/ARO/AIO-12 trial. Surrogate markers were significant prognostic factors for DFS, but the higher pCR rate und improved TRG in trial Arm B did not lead to improved survival compared to Arm A. Therefore, early surrogate marker correlated with clinical outcome in the CAO/ARO/AIO-12 trial, but the early differences in pCR and TRG did not translate into a survival benefit. Abstract Background: Early efficacy outcome measures in rectal cancer after total neoadjuvant treatment are increasingly investigated. We examined the prognostic role of pathological complete response (pCR), tumor regression grading (TRG) and neoadjuvant rectal (NAR) score for disease-free survival (DFS) in patients with rectal carcinoma treated within the CAO/ARO/AIO-12 randomized phase 2 trial. Methods: Distribution of pCR, TRG and NAR score was analyzed using the Pearson’s chi-squared test. Univariable analyses were performed using the log-rank test, stratified by treatment arm. Discrimination ability of non-pCR for DFS was assessed by analyzing the ROC curve as a function of time. Results: Of the 311 patients enrolled, 306 patients were evaluable (Arm A:156, Arm B:150). After a median follow-up of 43 months, the 3-year DFS was 73% in both groups (HR, 0.95, 95% CI, 0.63–1.45, p = 0.82). pCR tended to be higher in Arm B (17% vs. 25%, p = 0.086). In both treatment arms, pCR, TRG and NAR were significant prognostic factors for DFS, whereas survival in subgroups defined by pCR, TRG or NAR did not significantly differ between the treatment arms. The discrimination ability of non-pCR for DFS remained constant over time (C-Index 0.58) but was slightly better in Arm B (0.61 vs. 0.56). Conclusion: Although pCR, TRG and NAR were strong prognostic factors for DFS in the CAO/ARO/AIO-12 trial, their value in selecting one TNT approach over another could not be confirmed. Hence, the conclusion of a long-term survival benefit of one treatment arm based on early surrogate endpoints should be stated with caution.

In principle, OS is the most objective endpoint in phase 3 cancer trials, but large cohorts of patients and costly, extensive, and long follow-up protocols are needed to provide accurate long-term survival data [4]. Furthermore, survival analysis can be confounded by successful treatment of disease recurrences or non-cancer related death, particularly in elderly patients. To overcome these limitations and identify promising therapeutic approaches at an earlier stage, several early or intermediate efficacy endpoints have been proposed [4,15].
Pathological complete remission (pCR), neoadjuvant rectal (NAR) score, and tumor regression grading (TRG) have been established as early surrogate endpoints after standard neoadjuvant CRT to reflect both tumor biology and treatment efficacy as reported in several randomized phase II-III trials [4,16,17]. However, the value of these early surrogate endpoints in the era of TNT, with varying sequences and intervals between treatment components, remains largely unexplored. Here, we investigated the prognostic value of these early endpoints in the CAO/ARO/AIO-12 randomized phase 2 trial. In this study, TNT with upfront CRT followed by consolidation CT resulted in higher pCR (primary endpoint) compared to induction CT followed by CRT but did not impact on secondary endpoints such as DFS and OS [18].

Patient Selection
The CAO/ARO/AIO-12 was a multicenter, randomized, phase 2 trial (ClinicalTrials. gov, accessed on 7 June 2022, NCT02363374). Inclusion criteria included patients ≥18 years old with rectal adenocarcinoma, ECOG performance status 0-1, cT3 tumors < 6cm from the anal verge, cT3 tumors in the middle third of the rectum (≥6-12 cm at rigid rectoscopy) with extramural tumor spread into the mesorectal fat of more than 5 mm (>cT3b), cT4 tumors, or clinical lymph node involvement. Evaluation of clinical nodal involvement was based on mandatory magnetic resonance imaging. Distant metastases were excluded by CT scan of the abdomen and chest. Laboratory tests for adequate organ function were conducted prior to enrollment in the trial.

Treatment
Patients were randomly assigned to treatment Arm A for induction CT prior to CRT or to Arm B for CRT followed by consolidation CT. Radiotherapy was prescribed to the primary tumor and the mesorectal, presacral, and internal iliac lymph nodes to a total dose of 50,4 Gy in 28 fractions. Fluorouracil as continuous infusion (250 mg/m 2 ) on day 1-14 and day 22-35 intensified with oxaliplatin (50 mg/m 2 ) on day 1, 8, 22 and 29 were administered simultaneously during radiotherapy. Oxaliplatin 100 mg/m 2 as a two-hour infusion, followed by a 2 h infusion of leucovorin (400 mg/m 2 ) and a continuous 46 h infusion of fluorouracil (2400 mg/m 2 ), repeated on day 15 for a total of three cycles was administered as induction or consolidation CT. If necessary, due to toxicity, doses were modified according to the trial protocol. Independently of primary tumor response, on day 123 after initiation of TNT, total mesorectal excision (TME) surgery was scheduled. Nonoperative management was considered a protocol violation, but 10 patients with clinical complete response after TNT rejected surgery. Adjuvant CT after curative surgery was not recommended.

Early Efficacy Endpoints
The analysis of the primary endpoint, pathological complete response (pCR, ypT0N0) has already been reported [19]. TRG was recorded prospectively in both arms of the study according to Dworak et al. [20]. The neoadjuvant rectal score (NAR) incorporates cT to account for tumor downstaging, and ypT and ypN that are influenced directly by preoperative treatment [15]. The NAR formula is as follows: NAR = (5 pN − 3*(cT-pT) + 12)/9.61, where cT in (1,2,3,4), pT in (0, 1, 2, 3, 4) and pN in (0, 1, 2). NAR consists of 24 distinct scores that range from 0 to 100. For ypT-category and ypN-category, a relative weight of 3 and 5 was suggested to reflect the impact of these variables, based on the nomogram of Valentini [21]. The constant 12 is included to maintain all scores inside the brackets as positive. The scaling factor 9.61 was introduced to ensure that the final scores range from 0 to 100. The NAR score was classified as low (NAR < 8), intermediate (NAR = 8 − 16), and high (NAR > 16) as reported before [15].

Statistical Analysis
The distribution of pCR, TRG and NAR between both treatment arms was examined with the chi-squared test. The secondary endpoint, DFS, was defined as the time between randomization and the first of the following events: macroscopically incomplete surgery (R2 resection), locoregional or metastatic recurrence or death from any course.
We used the log-rank test to determine the prognostic role of pCR, NAR and TRG for DFS. DFS in subgroups defined by pCR, non-pCR and TRG 0/1, TRG 2/3, TRG 4, respectively, NAR low, intermediate, and high risk were compared with the log-rank test as well. Unadjusted subgroup analyses to identify potential different treatment effects in pCR and non-pCR subgroups were performed using the "subtee" package in R [22]. The methodology of time-dependent ROC curve analysis is described in the Supplementary Methods. Patients with missing values were excluded. Statistical analyses were performed with the SPSS 25 software (SPSS Inc., Chicago, IL, USA) and the R system, version 4.1 (packages: "subtee", "risksetROC" and "timeroc").

Accrual and Patient Characteristics
From 15 June 2015 to 31 January 2018, 311 patients from 18 centers in Germany were recruited. Five patients proved ineligible after enrollment. Of the remaining 306 eligible patients, 156 patients were randomized to Group A (sequence CT/CRT/Surgery) and 150 patients to Group B (sequence CRT/CT/Surgery). All 156 patients started induction CT in Arm A, whereas consolidation CT in Arm B was started in 140 (93%) patients. In Arm A 151 (97%) patients proceeded to CRT and in Arm B 159 (99%) received CRT. 143 (92%) patients in Arm A and 143 (95%) patients in Arm B underwent surgery [18,19].
There were no significant differences in the distribution of any of the three parameters between the two TNT groups, albeit a trend toward higher pCR rates (p = 0.086) as well as high rates of TRG4 (27% vs. 19%) and low NAR score (36% vs. 26%) was observed in group B (Table 1).

Treatment Efficacy
The pCR rate, pathological evaluation, treatment toxicity, surgical morbidity, adherence to treatment as well as oncological outcomes have previously been reported [18,19].
Regarding the prognostic value of early endpoints, in univariate analysis, pCR, TRG, and NAR score were significantly associated with 3-year DFS in the entire cohort ( Figure 1). The strong prognostic value of the early endpoints for DFS remained in both treatment groups (Table 2). Furthermore, we examined the prognostic impact of each of the subgroups of pCR, TRG and NAR score on the 3-year DFS separately, as shown in Table 3. We did not observe a significant difference between the two TNT regimens in terms of 3-year DFS for any of the subgroups of pCR, TRG, and NAR score. The log-rank test was used to assess statistical significance. The statistical test was two-sided. Figure 1. Prognostic significance of pCR (A), TRG (B) and NAR score (C) for disease-free survival. The log-rank test was used to assess statistical significance. The statistical test was two-sided.    Further, using the unadjusted estimates for subgroups, we failed to detect a differential treatment effect when testing pCR vs. non-pCR (Figure 2). A discrimination ability test was performed as shown in the Supplementary Materials and in Figure S1.

Discussion
In this post hoc, secondary analysis of the CAO/ARO/AIO-12 trial, we examined the prognostic impact of the early efficacy measures pCR, TRG and NAR score for DFS. Albeit early outcome measures were significantly prognostic for 3-year DFS in the entire cohort, with each TNT arm separately, there were no significant differences in early efficacy end- Figure 2. Plot of unadjusted treatment effects for pCR and non-pCR subgroups for disease-free survival. The 90% confidence interval is plotted. The overall treatment effect under the model with no treatmentsubgroup interactions are plotted with a dashed line, as well as its confidence intervals plotted as a gray shaded area. Statistical significance was examined and plotted with the "subtee" package in R.

Discussion
In this post hoc, secondary analysis of the CAO/ARO/AIO-12 trial, we examined the prognostic impact of the early efficacy measures pCR, TRG and NAR score for DFS. Albeit early outcome measures were significantly prognostic for 3-year DFS in the entire cohort, with each TNT arm separately, there were no significant differences in early efficacy endpoints and 3-year DFS between the two TNT groups, or for any of the subgroups for pCR, TRG and NAR score [18,19]. In the initial report of the primary endpoint of the CAO/ARO/AIO-12, we found significantly higher pCR compared to an assumed historical pCR rate of 15% after standard preoperative fluorouracil-based CRT (p < 0.001) in Arm B but not in Arm A, based on the modified "pick-the-winner" statistical trial design. Notably, improved pCR did not translate to better oncologic outcome after a median follow-up of 43 months (DFS was 73% in both TNT) [19].
Previous clinical studies have reported heterogeneous results regardless of whether pCR could serve as a surrogate for DFS/OS. Historically, the POLISH I trial and the TROG 01.04 trial compared SCRT followed by surgery within one week and adjuvant CT versus long-course CRT followed by delayed surgery and adjuvant CT. Both trials reported significantly higher pCR rates after CRT and delayed surgery with no significant differences in DFS and OS [23,24]. More recently, the STOCKHOLM III trial investigated delayed versus immediate surgery after SCRT and reported improved pCR rates but no DFS/OS benefit [25][26][27] (Table 4). These data reflect the limitation of pCR as a surrogate endpoint for DFS/OS [28], which was also shown in the meta-analysis of 22 studies in 10,050 patients by Petrelli et al. [29].
With respect to intensified neoadjuvant CRT regimen, our previous randomized CAO/ARO/AIO-04 trial showed that pCR was achieved in 17% of patients treated with oxaliplatin/fluorouracil-based CRT vs. 13% (p = 0.038) treated within fluorouracil-based CRT. This higher pCR rate correlated with superior 3-year DFS of the experimental arm [5,30] (Table 4). The FORWARC trial also reported improved pCR rates by the addition of oxaliplatin to neoadjuvant fluorouracil-based CRT (28% vs. 14%), however without a significant improvement in 3-year DFS [31,32]. Conversely, a Chinese trial by Jia et al. reported lower incidence of distant metastasis but no increase in pCR rates through intensified neoadjuvant CRT with oxaliplatin [33].
Regarding the value of pCR as a surrogate measure for survival in TNT, the recently reported clinical trials also provided heterogenous results. In the RAPIDO trial, higher pCR rates translated to improved DFS and lower incidence of distant metastasis [10]. In the PRODIGE-23 trial, pCR, TRG 4 and low-risk NAR score in the TNT arm correlated with improved DFS, whereas in the STELLAR trial, improved pCR was not associated with better DFS [12,13] (Table 4).
Thus, tumor response as a dynamic process is not only affected by tumor-and patientrelated factors, such as tumor size, molecular profile, histology, or host's immune system. Treatment-related factors, RT dose and fractionation, administration of concurrent CT and/or use of induction/consolidation CT and, most importantly, the time interval between radiotherapy and response assessment are critical to tumor response. Tumor response, as measured by pCR, may predict favorable long-term survival for individual patients within a certain treatment protocol, but implication for superior outcome in comparative trials cannot be necessarily concluded therefrom [4].
Accordingly, in our trial, the difference in pCR between the two TNT groups likely reflects the different interval and continuously ongoing response from the last radiotherapy fraction to surgery, which was (median) 85 days in Arm B vs. 42 days in Arm A. In addition, reduced adherence to CRT following induction CT, as well as selection and expansion of more radiation-resistant clones by induction CT (which may alter apoptotic pathways, upregulate epidermal growth factor receptor expression, and affect angiogenesis and stromal proliferation) may have contributed to less pCR in Arm A [34]. E-free survival; ° no OX arm vs. Ox arm; '' Group 1 vs. Group 4; ``Arm A-surgery within erapy; °° SRT vs. SRT-delay. Abbreviations: pCR, pathological complete remission; TRG, ee survival; LR, incidence of local-regional recurrence or local-regional free survival; DM, significant; TNT, total neoadjuvant treatment; n.r., not reported. ints (pCR, TRG and NAR score) and survival in major prospective rectal cancer trials.  Regarding non-pCR, the discrimination ability for DFS remained largely constant over the follow-up period but differed slightly between the two treatment arms. The discrimination ability seems to be lower in Arm A than in Arm B (AUC 55 vs. 61), and DFS for non-pCR was higher compared to Arm B (69.5% vs. 66.3%), suggesting that the non-pCR subgroup in Arm A included good prognostic patients that would have developed pCR with a longer interval to response assessment [19].
Unlike pCR, TRG and the NAR score classifies tumor response more gradually beyond a simple binary system that may reflect treatment efficacy better and may have a greater ability than pCR to predict DFS or OS, as proposed by Yothers et al. [4,35]. Even if the surrogacy of TRG and NAR for improved DFS has been validated in the CAO/ARO/AIO-04 trial [16,17] and a significant trend to higher tumor regression and low NAR score correlated with improved DFS and lower incidence of distant metastasis in the PRODIGE-23 trial [12], potential surrogacy of both parameters has not been broadly reported in recent trials (Table 4) [10,13,14]. Furthermore, assessment and reporting of TRG is heterogeneous, and no universally approved standardization method is accepted [4,36]. The NAR score was proposed by the NRG Oncology as a surrogate endpoint for DFS and OS [15]. As no improvement in overall survival has been reported in any of the recent clinical trials in rectal cancer, except for in the Stellar trial, the ability to validate NAR as a surrogate for survival remains quite limited. Even if the NAR score incorporates pre-and post-neoadjuvant CRT tumor extent, further analyses of its surrogacy based on the recent published TNT trials are lacking.
Our study has limitations. First, this study constitutes a post hoc analysis. Second, analyses of changes of discrimination ability over time for surrogate measures have thus far not been performed in rectal cancer. These analyses are based on highly complex mathematical models, and heterogenous statistic methodologies have been published [37][38][39]. Therefore, interpretation of the potential difference in discrimination ability between both treatment arms should be interpreted with cautious even if the slightly weaker DFS in patients with pCR in Arm A support the thesis of a weaker discrimination ability of non-pCR for DFS in Arm A. Third, a central pathologic review for tumor regression was not conducted.

Conclusions
In summary, pCR, TRG and NAR score were prognostic parameters for DFS in the entire cohort as well as in both arms of the CAO/ARO/AIO-12 trial and likely reflect (and unmask) different tumor biology. However, their value in selecting one TNT approach over another could not be confirmed, as a better response in TNT group B did not translate to superior DFS or OS. Altering the sequence and intervals between components in multimodal rectal treatment may have substantial impact on early efficacy endpoints, but thus far, significant differences between the two TNT sequences on long-term clinical outcome measures after TME have not been reported. With the advent of TNT with selective nonoperative management (NOM) for patients with (near) clinical complete response, as reported in the recent OPRA trial [14] and currently investigated in our ongoing ACO/ARO/AIO-18.1 trial (NCT04246684), sustained local control without regrowth (i.e., TME-free survival) and disease-free survival including NOM and events of salvage surgery [4] have been incorporated as relevant clinical endpoints.