Refining Outcomes in Technically Resectable Colorectal Liver Metastases: A Simplified Risk Model and the Role of Preoperative Chemotherapy
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe topic is clinically relevant (quick risk stratification for technically resectable CRLM, and how to interpret preoperative chemotherapy response). The key issue is that the core message, the reported multivariable results, and several numeric elements are internally inconsistent across the Abstract/Summary, main tables, figures, and supplementary tables—this currently undermines credibility and would almost certainly trigger major revision at Cancer.
My recommendation: Major revision (fix internal inconsistencies; strengthen methods/validation; clarify how the score is derived and compared; improve reporting quality).
Major comments
1) Internal inconsistencies in the “independent predictors” and in the proposed score
-
Abstract/Simple Summary state that the combination of number ≥3 and largest diameter ≥5 cm is “the only independent risk factor for both recurrence and survival” (HR 2.05 and 2.24).
-
-
For recurrence: number ≥3 and NLR ≥2.9 (largest diameter not independent).
-
For OS: CEA ≥5 and number ≥3 (and Table 3 has additional issues—see next point).
Main Results tables report different independent predictors:
-
-
Later, you introduce “poor prognostic criteria” (low vs high risk) and show strong Kaplan–Meier separation (Figure 3), but the definition and the modelling strategy (is this a composite variable replacing the individual covariates?) need to be made explicit and consistent everywhere.
-
Decide one primary model strategy and align Abstract ... Methods ... Results tables ... Figures ... Supplement accordingly.
-
If the goal is a simple score, then:
-
Show how it was derived (pre-specified cut-offs vs data-driven ROC),
-
Fit a Cox model with the score as predictor (and specify what else is adjusted for),
-
Present discrimination + calibration (not just AIC),
-
State clearly whether NLR and/or chemo response are add-ons.
-
2) Table 3 / OS counts and percentages appear erroneous
In the OS univariate table (Table 3), the reported CEA strata (33 and 98) and CA19-9 strata (78 and 53) sum to 131, not 115—this is a hard numerical error that must be corrected. The same issue appears in the supplementary OS table.
-
Audit the dataset used for OS analysis (and state whether OS was available for all 115).
-
If there is missingness, report missing data per variable, how handled (complete-case? imputation?), and ensure denominators/percentages match.
3) Missing data are likely present but not reported
In Table 1, several binary items do not sum to 115 (e.g., LN status 36+76=112; ly 13+95=108; v 27+82=109). This strongly suggests missing/unknown values not presented.
-
Add an “Unknown/Missing” category or explicitly report the number missing per variable.
-
State how missingness was handled in Cox models.
4) Derivation of cut-offs (ROC-driven) needs full reporting and safeguards
You state the cutoffs for number of metastases and NLR were set using ROC “for recurrence” (number=3; NLR=2.9).
This is potentially optimistic in a small retrospective cohort (n=115), especially when subsequently used to create a prognostic score.
-
Provide AUC, CI, and method (time-dependent ROC if using survival outcome).
-
Prefer: pre-specified clinically meaningful cutoffs (≥3 is already common in the field) and justify; or if data-driven, perform internal validation (bootstrap).
5) Model performance: AIC alone is not sufficient for a prognostic claim
You compare AIC vs Beppu and Fong and call it “comparable predictive accuracy”. AIC is not “accuracy” in the clinical prediction sense (discrimination/calibration).
-
Add at minimum: Harrell’s c-index, time-dependent AUC, and calibration (e.g., calibration slope at 3/5 years), ideally with bootstrap optimism correction.
-
If comparing against Fong/Beppu, specify exactly how you implemented them (score reconstruction? Cox model with score?).
6) Chemotherapy subgroup analysis: risk of bias and underpowered PD group
Among patients receiving preoperative chemotherapy (n=72), PD is n=8. You show large survival differences, but this is fragile and confounded by treatment selection and biology.
-
Provide HRs with CI (not only p-values) for PD vs control disease.
-
Clarify chemo intent (neoadjuvant for resectable vs conversion) and the criteria used.
-
Consider a sensitivity analysis: exclude “borderline” cases, or perform a landmark approach if applicable.
7) Key prognostic variables are missing or only briefly acknowledged
You mention RAS/BRAF and MMR are unavailable in many cases. For Cancer, readers will expect at least:
-
Proportion tested / missing, and
-
A discussion of how this limitation may bias results.
Also consider reporting: bilobar distribution, margin width, anatomic vs parenchymal-sparing resections, use of ablation, repeat hepatectomy, and recurrence patterns—these can materially affect RFS/OS.
Minor comments
-
Use one term: “recurrence-free survival” vs “relapse-free survival” (currently mixed, including Figure 3 caption).
-
Figure 4 panel (d) labels “CSS median” while the text says OS—likely a typo.
-
Methods: “Patiend data” typo; “erthoxybenzyl-magnetic resonance imaging” should be gadoxetic acid–enhanced MRI (EOB-MRI).
-
Define “technically resectable” precisely (criteria, who decided, and whether based on future liver remnant thresholds).
-
State follow-up schedule and recurrence ascertainment method (imaging frequency; how extrahepatic recurrences counted).
-
Confirm proportional hazards assumption testing.
Finally I would strengthen References adding
Igami T, Hayashi Y, Yokyama Y, Mori K, Ebata T. Development of real-time navigation system for laparoscopic hepatectomy using magnetic micro sensor. Minim Invasive Ther Allied Technol. 2024;33(3):129–139. doi:10.1080/13645706.2023.2301594.
it should be added to Methods 2.1, immediately after your sentence:
“…abdominal ultrasonography, computed tomography scan, and erthoxybenzyl-magnetic resonance imaging.”
maybe adding the sentence “Beyond preoperative imaging, real-time navigation systems for laparoscopic hepatectomy are being developed to support intraoperative orientation and lesion targeting [NEW REF].”
then
Boretto L, Pelanis E, Regensburger A, Fretland ÅA, Edwin B, Elle OJ. Hybrid optical-vision tracking in laparoscopy: accuracy of navigation and ultrasound reconstruction. Minim Invasive Ther Allied Technol.2024;33(3):176–183. doi:10.1080/13645706.2024.2313032.
it should be added to (Methods 2.1), right after the sentence you just expanded with Igami et al.
maybe adding the sentence “Hybrid tracking approaches enabling more accurate laparoscopic ultrasound reconstruction represent another promising direction to improve intraoperative spatial understanding during minimally invasive liver surgery [NEW REF].”
and finally
Peng Z, Zhu ZR, He CY, Huang H. A meta-analysis: laparoscopic versus open liver resection for large hepatocellular carcinoma. Minim Invasive Ther Allied Technol. 2025;34(1):24–34. doi:10.1080/13645706.2024.2334762.
it should be added to Discussion, in the paragraph where you contextualize surgical management/outcomes and future optimisation (a good anchor is right after you discuss the need to “refine” decision-making and outcomes).
maybe adding the sentence “More broadly, the growing minimally invasive liver surgery literature—including recent meta-analytic evidence—suggests that laparoscopic liver resection can reduce length of stay and morbidity without clear compromise in oncologic surrogates in selected settings [NEW REF].”
Author Response
The topic is clinically relevant (quick risk stratification for technically resectable CRLM, and how to interpret preoperative chemotherapy response). The key issue is that the core message, the reported multivariable results, and several numeric elements are internally inconsistent across the Abstract/Summary, main tables, figures, and supplementary tables—this currently undermines credibility and would almost certainly trigger major revision at Cancer.
My recommendation: Major revision (fix internal inconsistencies; strengthen methods/validation; clarify how the score is derived and compared; improve reporting quality).
Response:Thank you for highlighting the internal inconsistencies and reporting gaps. We agree that these issues could undermine credibility. We therefore (i) re-checked the analytic dataset and endpoint definitions, (ii) rebuilt the primary models under a single, explicitly stated modeling strategy, and (iii) aligned the Abstract, Methods, Results tables, figures, and Supplementary tables accordingly. We also expanded model performance reporting beyond AIC to include discrimination and calibration, with internal validation.
Major comments
1) Internal inconsistencies in the “independent predictors” and in the proposed score
- Abstract/Simple Summary state that the combination of number ≥3 and largest diameter ≥5 cm is “the only independent risk factor for both recurrence and survival” (HR 2.05 and 2.24).
- For recurrence: number ≥3 and NLR ≥2.9 (largest diameter not independent).
- For OS: CEA ≥5 and number ≥3 (and Table 3 has additional issues—see next point).
Main Results tables report different independent predictors:
- Later, you introduce “poor prognostic criteria” (low vs high risk) and show strong Kaplan–Meier separation (Figure 3), but the definition and the modelling strategy (is this a composite variable replacing the individual covariates?) need to be made explicit and consistent everywhere.
- Decide one primary model strategy and align Abstract ... Methods ... Results tables ... Figures ... Supplement accordingly.
- If the goal is a simple score, then:
- Show how it was derived (pre-specified cut-offs vs data-driven ROC),
- Fit a Cox model with the score as predictor (and specify what else is adjusted for),
- Present discrimination + calibration (not just AIC),
- State clearly whether NLR and/or chemo response are add-ons.
Response1:Thank you for this important and detailed comment. We agree that the manuscript needed clearer and more consistent explanation of the modelling strategy across the Abstract, Methods, Results tables, and figures.
The primary aim of this study was to develop a pragmatic, simple preoperative risk stratification approach after curative-intent resection of CRLM, applicable to contemporary cohorts that include patients treated with preoperative chemotherapy. To achieve this, we performed multivariable analyses to identify preoperative clinicopathologic factors associated with RFS and OS and then derived a simple rule-based classification using the most robust and clinically interpretable tumor-burden variables.
In the overall cohort, the only factor that consistently emerged as independently associated with both RFS and OS was the number of CRLMs (≥3). In contrast, among patients with 1–2 CRLMs, the largest tumor diameter (≥5 cm) was the only factor associated with RFS. Based on this hierarchical finding, we constructed a composite, rule-based high-risk criterion defined as: high risk = (number of CRLMs ≥3) OR (number of CRLMs = 1–2 AND largest diameter ≥5 cm). This composite variable is intended as a dichotomous classification rule rather than an additive score, and the two criteria are mutually exclusive by definition.
To address the reviewer’s concern regarding inconsistencies, we revised the Abstract/Simple Summary and the main text to make this strategy explicit and aligned throughout. Specifically, we now (i) clearly define the composite criterion at first mention, (ii) present the Cox models using this composite criterion as the primary predictor in the main Results and figures, and (iii) clarify that other covariates identified in outcome-specific multivariable models (e.g., NLR for recurrence and CEA for OS) reflect exploratory/secondary modelling and are not components of the proposed simple risk classification. We believe these revisions resolve the apparent discrepancies and ensure consistency across the Abstract, Methods, Results Tables, and Figure 3.
2) Table 3 / OS counts and percentages appear erroneous
In the OS univariate table (Table 3), the reported CEA strata (33 and 98) and CA19-9 strata (78 and 53) sum to 131, not 115—this is a hard numerical error that must be corrected. The same issue appears in the supplementary OS table.
- Audit the dataset used for OS analysis (and state whether OS was available for all 115).
- If there is missingness, report missing data per variable, how handled (complete-case? imputation?), and ensure denominators/percentages match.
Response2:We appreciate this careful check. You are correct that the denominators reported in the original Table 3 (and the corresponding Table S3) were inconsistent with the study cohort size, indicating a reporting error.
In addition, we added a short “Missing data” paragraph detailing variable-specific missingness and the analytic handling strategy (complete-case analysis; see also Response to Major comment 3).
Edits in manuscript: Table 3; Table S3; Materials and Methods (2.3. Statistical analysis).
3) Missing data are likely present but not reported
In Table 1, several binary items do not sum to 115 (e.g., LN status 36+76=112; ly 13+95=108; v 27+82=109). This strongly suggests missing/unknown values not presented.
- Add an “Unknown/Missing” category or explicitly report the number missing per variable.
- State how missingness was handled in Cox models.
Response3:We agree. In the revised Table 1, we now report unknown values explicitly for variables where applicable, either by adding an “Unknown” category or by providing a dedicated missingness column (Primary tumor T status: n=1 (0.9%), Primary tumor LN status: n=3 (2.6%), ly (primary tumor): n=7 (6.1%), v (primary tumor): n=6 (5.2%)). This change ensures that all variables reconcile to the cohort denominator and that the extent of missingness is transparent.
We also added a dedicated statement in the Statistical Analysis section describing how missing data were handled in regression modeling. Specifically, we used complete-case analysis for multivariable Cox models, and we report the effective sample size for each model.
Edits in manuscript: Table 1 (Missing data); Materials and Methods (2.3 Statistical analysis); Results (3.1 Subsection patient background).
4) Derivation of cut-offs (ROC-driven) needs full reporting and safeguards
You state the cutoffs for number of metastases and NLR were set using ROC “for recurrence” (number=3; NLR=2.9).
This is potentially optimistic in a small retrospective cohort (n=115), especially when subsequently used to create a prognostic score.
- Provide AUC, CI, and method (time-dependent ROC if using survival outcome).
- Prefer: pre-specified clinically meaningful cutoffs (≥3 is already common in the field) and justify; or if data-driven, perform internal validation (bootstrap).
Response4:We are grateful to the reviewer for raising the issue. We have added the description as follow:
Line 136. The cutoff values for the number of liver metastases and NLR were 3 and 2.9, respectively. The receiver operating characteristic (ROC) curve analysis for recurrence using logistic regression was used to determine the cutoff values by maximizing Youden's index (defined as sensitivity + specificity − 1) for each variable.
Edit in manuscript: Materials and Methods (2.2. Variables evaluated for univariate analysis)
5) Model performance: AIC alone is not sufficient for a prognostic claim
You compare AIC vs Beppu and Fong and call it “comparable predictive accuracy”. AIC is not “accuracy” in the clinical prediction sense (discrimination/calibration).
- Add at minimum: Harrell’s c-index, time-dependent AUC, and calibration (e.g., calibration slope at 3/5 years), ideally with bootstrap optimism correction.
- If comparing against Fong/Beppu, specify exactly how you implemented them (score reconstruction? Cox model with score?).
Response5:Thank you for your important comment. We have reanalyzed the data conducting time-dependent ROC curve analysis and modified the description as below:
Line 147. The models were evaluated by the time-dependent ROC curve analysis using a certain time period. All statistical analyses were performed using JMP Pro 18 software (SAS Institute, Cary, NC, USA), with the exception of the time-dependent ROC curve analysis, which was performed using the statistical programming language R (version 4.4.1, R Development Core Team). A P-value <0.05 was considered statistically significant.
Line 272. We calculated the area under the curve (AUC) to assess the predictive accuracy of our model for recurrence and OS applying time-dependent ROC curve analysis at 36- and 60-month points (Figure 4.). The AUCs for recurrence in our model were 0.68 (95% CI: 0.58-0.77) at 36-month and 0.66 (95% CI: 0.55-0.77) at 60-month. For OS, AUCs were 0.59 (95% CI: 0.45-0.72) at 36-month and 0.65 (95% CI: 0.54-0.76) at 60-month. For comparison, the Beppu nomogram yielded AUCs of 0.70 (p=0.683) at 36-month and 0.68 (p=0.766) at 60-month for recurrence, and Fong's clinical risk score resulted in AUCs of 0.64 (p=0.430) at 36-month and 0.74 (p=0.074) at 60-month for OS.
Edit in manuscript: Abstract; Materials and Methods (2.3. Statistical analysis); Results (3.4. Novel prognostic criteria of CRLM after curative resection); Figure 4. (new addition); Discussions (comparable to Beppu score and Fong’s score paragraph)
6) Chemotherapy subgroup analysis: risk of bias and underpowered PD group
Among patients receiving preoperative chemotherapy (n=72), PD is n=8. You show large survival differences, but this is fragile and confounded by treatment selection and biology.
- Provide HRs with CI (not only p-values) for PD vs control disease.
- Clarify chemo intent (neoadjuvant for resectable vs conversion) and the criteria used.
- Consider a sensitivity analysis: exclude “borderline” cases, or perform a landmark approach if applicable.
Response6:We agree that the preoperative chemotherapy subgroup analysis is vulnerable to limited power (particularly the PD subgroup) and to confounding by indication. We revised this section to improve transparency and reduce over-interpretation.:
・We now report effect sizes as HRs with 95% CIs in RFS and OS for PD versus disease control (CR/PR/SD), and we explicitly acknowledge the imprecision associated with small subgroup sizes. 3.5 RFS and OS related to therapeutic effects of preoperative chemotherapy, and Figure5. (a), (b)
・We clarified the clinical intent of preoperative chemotherapy (neoadjuvant therapy for technically resectable cases and conversion therapy for technically and oncologically unresectable cases) and the criteria used at each institution to select patients for chemotherapy. 2.1 Patient data and 3.1 Subsection patient background
Edits in manuscript: Materials and Methods (2.1. Patient data); Results (3.1. Subsection patient background); Figure5. (a), (b).
7) Key prognostic variables are missing or only briefly acknowledged
You mention RAS/BRAF and MMR are unavailable in many cases. For Cancer, readers will expect at least:
- Proportion tested / missing, and
- A discussion of how this limitation may bias results.
Also consider reporting: bilobar distribution, margin width, anatomic vs parenchymal-sparing resections, use of ablation, repeat hepatectomy, and recurrence patterns—these can materially affect RFS/OS.
Response7:We agree. In the revised manuscript, we now report the testing rate and missingness for RAS, BRAF, and MMR/MSI status (RAS status; tested in n=85 (73.9%), missing in n=30 (26.1%), BRAF status; tested in n=47 (40.9%), missing in n=68 (59.1%), and MSI/MMR status; tested in n=21 (18.3%), missing in n=94 (81.7%)). We also added a dedicated Discussion paragraph explaining how missing molecular data could bias results (including the possibility that missingness is not random) and how this limitation affects generalizability.
Thank you for this thoughtful suggestion. We agree that bilobar distribution and resection margin width are clinically important factors that can influence RFS/OS. Unfortunately, these variables were not available in a form that allowed reliable analysis in the current dataset; we have therefore added this point to the Discussion as an important limitation, noting the possibility of residual confounding and selection bias.
Regarding the extent of resection (anatomic vs parenchymal-sparing), the comparative oncologic benefit remains uncertain and is strongly influenced by case selection and tumor biology. Because this analysis was not prespecified and would require a more granular and standardized operative classification, we did not include it in the present models.
With respect to ablation and repeat hepatectomy, our cohort was intentionally restricted to patients undergoing first-time hepatectomy for initial liver metastases; therefore, cases treated with concomitant ablation or repeat hepatectomy were not included.
Finally, we agree that recurrence patterns may affect outcomes; however, the primary aim of this study was to develop a purely preoperative prognostic stratification framework, and we therefore did not incorporate post-treatment variables such as recurrence patterns. We have added this as a future direction, as the relationship between our preoperative criteria and subsequent recurrence patterns is of clear interest.
Edits in manuscript: Table 1 (molecular testing availability), Results (3.1 Subsection patient background) and Discussion (limitations and bias)
Minor comments
- Use one term: “recurrence-free survival” vs “relapse-free survival” (currently mixed, including Figure 3 caption).
- Figure 4 panel (d) labels “CSS median” while the text says OS—likely a typo.
- Methods: “Patiend data” typo; “erthoxybenzyl-magnetic resonance imaging” should be gadoxetic acid–enhanced MRI (EOB-MRI).
- Define “technically resectable” precisely (criteria, who decided, and whether based on future liver remnant thresholds).
- State follow-up schedule and recurrence ascertainment method (imaging frequency; how extrahepatic recurrences counted).
- Confirm proportional hazards assumption testing.
Response:Thank you for these detailed editorial points. We have corrected and standardized the following:
・Unified terminology to “recurrence-free survival (RFS)” throughout the manuscript, manuscript, tables, and figure captions.
・Corrected Figure 5(d) labeling to match the endpoint reported in the text (OS, not CSS).
・Corrected typographical errors (“Patient data” → “Patient data”; “erthoxybenzyl-…” → gadoxetic acid–enhanced MRI (EOB-MRI)).
・Added an explicit definition of “technically resectable CRLM”, including decision process and minimum future liver remnant thresholds in Materials and Methods (2.1. Patient data).
・Detailed follow-up schedule and recurrence ascertainment (imaging intervals; handling of extrahepatic recurrences) in Materials and Methods (2.1. Patient data).
・We thank the reviewer for pointing out this issue. We have checked the non-proportional hazards (PH) using the model fit with predictors for RFS as below. Based on the global test and a visualization that illustrates how the coefficients vary with time, we would conclude that the PH assumption is met for all the multivariate variables.
|
|
degrees of freedom |
p-value |
|
|
ly (primary tumor) |
0.4291 |
1 |
0.51 |
|
CEA level |
0.0726 |
1 |
0.79 |
|
CA19-9 level |
0.4524 |
1 |
0.5 |
|
Largest tumor diameter |
1.2132 |
1 |
0.27 |
|
Number of CRLM |
0.8762 |
1 |
0.35 |
|
NLR |
0.0372 |
1 |
0.85 |
|
GLOBAL |
3.4908 |
6 |
0.75 |
Finally I would strengthen References adding
Igami T, Hayashi Y, Yokyama Y, Mori K, Ebata T. Development of real-time navigation system for laparoscopic hepatectomy using magnetic micro sensor. Minim Invasive Ther Allied Technol. 2024;33(3):129–139. doi:10.1080/13645706.2023.2301594.
it should be added to Methods 2.1, immediately after your sentence:
“…abdominal ultrasonography, computed tomography scan, and erthoxybenzyl-magnetic resonance imaging.”
maybe adding the sentence “Beyond preoperative imaging, real-time navigation systems for laparoscopic hepatectomy are being developed to support intraoperative orientation and lesion targeting [NEW REF].”
then
Boretto L, Pelanis E, Regensburger A, Fretland ÅA, Edwin B, Elle OJ. Hybrid optical-vision tracking in laparoscopy: accuracy of navigation and ultrasound reconstruction. Minim Invasive Ther Allied Technol.2024;33(3):176–183. doi:10.1080/13645706.2024.2313032.
it should be added to (Methods 2.1), right after the sentence you just expanded with Igami et al.
maybe adding the sentence “Hybrid tracking approaches enabling more accurate laparoscopic ultrasound reconstruction represent another promising direction to improve intraoperative spatial understanding during minimally invasive liver surgery [NEW REF].”
and finally
Peng Z, Zhu ZR, He CY, Huang H. A meta-analysis: laparoscopic versus open liver resection for large hepatocellular carcinoma. Minim Invasive Ther Allied Technol. 2025;34(1):24–34. doi:10.1080/13645706.2024.2334762.
it should be added to Discussion, in the paragraph where you contextualize surgical management/outcomes and future optimisation (a good anchor is right after you discuss the need to “refine” decision-making and outcomes).
maybe adding the sentence “More broadly, the growing minimally invasive liver surgery literature—including recent meta-analytic evidence—suggests that laparoscopic liver resection can reduce length of stay and morbidity without clear compromise in oncologic surrogates in selected settings [NEW REF].”
Response:We appreciate these suggestions and have added the requested references. In Methods 2.1. Patient data, immediately after the imaging description, we inserted two brief sentences to contextualize emerging intraoperative navigation approaches and hybrid tracking for laparoscopic ultrasound reconstruction, citing Igami et al. and Boretto et al. We also added the meta-analysis by Peng et al. to the Discussion when contextualizing evolving minimally invasive liver surgery evidence.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors1.The study employed a multicenter retrospective design, but the sample size was only 115 cases, which may have impacted statistical power and the generalizability of the results. This limitation could be further discussed in the Discussion section. Although this study is a multicenter retrospective analysis, it only included patients from two Japanese medical institutions, potentially introducing regional and population selection bias. If feasible, expanding the sample sources or conducting prospective studies could further validate the model's generalizability.
2.The study did not incorporate important molecular markers such as RAS/BRAF status and mismatch repair status, which hold potential value in CRLM prognosis and treatment selection. Further discussion is warranted regarding their potential impact on the model's predictive capability. If the missing rate is low, authors may consider incorporating it into multivariate analyses; if the missing rate is high, sensitivity analyses can be conducted to assess its potential impact on the results.
3.The proposed two-factor model does indeed simplify clinical assessment, but its robustness in settings with multiple confounding factors requires further clarification. Validation results across different subgroups could be added to the model.
4.In Table 3, the grouped data for indicators such as CEA exhibit inconsistencies (e.g., n values do not match between Table 1 and Table 3). The data should be verified and corrected.
5.The comparison between the Fong score and the Beppu nomogram was based solely on AIC values, without evaluating discriminatory performance (e.g., C-index) or clinical utility (e.g., decision curve analysis).
6.The study mentions that tumor size and number were measured via abdominal ultrasound, CT, and MRI. Is the consistency of measurements across different imaging modalities good?
7.The newly constructed two-factor model lacks explicit scoring rules (e.g., whether each criterion met counts as one point). It is recommended to supplement specific scoring criteria and incorporate either internal validation (e.g., Bootstrap method) or external validation (e.g., inclusion of data from other centers).
Author Response
1.The study employed a multicenter retrospective design, but the sample size was only 115 cases, which may have impacted statistical power and the generalizability of the results. This limitation could be further discussed in the Discussion section. Although this study is a multicenter retrospective analysis, it only included patients from two Japanese medical institutions, potentially introducing regional and population selection bias. If feasible, expanding the sample sources or conducting prospective studies could further validate the model's generalizability.
Response1:We agree that the cohort size (n = 115) and the inclusion of only two Japanese institutions limit statistical power and generalizability. We expanded the Discussion to address (i) potential selection bias, (ii) uncertainty around effect estimates, and (iii) the need for broader validation. We additionally clarify that the proposed score is intended as a pragmatic risk stratification tool in technically resectable CRLM and should be externally validated prior to clinical adoption. We outline a concrete multicenter prospective validation framework in the revised Discussion (see also Reviewer 3 response regarding trial framework).
Edit in manuscript: Discussions (limitations paragraph)
2.The study did not incorporate important molecular markers such as RAS/BRAF status and mismatch repair status, which hold potential value in CRLM prognosis and treatment selection. Further discussion is warranted regarding their potential impact on the model's predictive capability. If the missing rate is low, authors may consider incorporating it into multivariate analyses; if the missing rate is high, sensitivity analyses can be conducted to assess its potential impact on the results.
Response2:We agree. In the revised manuscript, we now report the testing rate and missingness for RAS, BRAF, and MMR/MSI status (RAS status; tested in n=85 (73.9%), missing in n=30 (26.1%), BRAF status; tested in n=47 (40.9%), missing in n=68 (59.1%), and MSI/MMR status; tested in n=21 (18.3%), missing in n=94 (81.7%)). We also added a dedicated Discussion paragraph explaining how missing molecular data could bias results (including the possibility that missingness is not random) and how this limitation affects generalizability.
Edit in manuscript: Table1.; Results (3.1 Subsection patient background) and Discussions (limitations paragraph)
3.The proposed two-factor model does indeed simplify clinical assessment, but its robustness in settings with multiple confounding factors requires further clarification. Validation results across different subgroups could be added to the model.
Response3:One of the main limitations of this study was the use of a small amount of sample data. There is room for developing more accurate criteria by assessing more cases with multiple institutions to establish more precise criteria, and we aim to carry out a prospective study to verify the validity of this study's findings.
4.In Table 3, the grouped data for indicators such as CEA exhibit inconsistencies (e.g., n values do not match between Table 1 and Table 3). The data should be verified and corrected.
Response4:Thank you. We audited the source dataset and corrected Table 3 so that all strata counts match the eligible OS analysis set and remain consistent with Table 1 and the Supplementary Materials. We also added explicit missingness reporting to prevent future ambiguities.
Edit in manuscript: Table1.; Table3.; Table S3.
5.The comparison between the Fong score and the Beppu nomogram was based solely on AIC values, without evaluating discriminatory performance (e.g., C-index) or clinical utility (e.g., decision curve analysis).
Response5:Thank you for your important comment. We have reanalyzed the data conducting time-dependent ROC curve analysis and modified the description as below:
Line 147. The models were evaluated by the time-dependent ROC curve analysis using a certain time period. All statistical analyses were performed using JMP Pro 18 software (SAS Institute, Cary, NC, USA), with the exception of the time-dependent ROC curve analysis, which was performed using the statistical programming language R (version 4.4.1, R Development Core Team). A P-value <0.05 was considered statistically significant.
Line 272. We calculated the area under the curve (AUC) to assess the predictive accuracy of our model for recurrence and OS applying time-dependent ROC curve analysis at 36- and 60-month points (Figure 4.). The AUCs for recurrence in our model were 0.68 (95% CI: 0.58-0.77) at 36-month and 0.66 (95% CI: 0.55-0.77) at 60-month. For OS, AUCs were 0.59 (95% CI: 0.45-0.72) at 36-month and 0.65 (95% CI: 0.54-0.76) at 60-month. For comparison, the Beppu nomogram yielded AUCs of 0.70 (p=0.683) at 36-month and 0.68 (p=0.766) at 60-month for recurrence, and Fong's clinical risk score resulted in AUCs of 0.64 (p=0.430) at 36-month and 0.74 (p=0.074) at 60-month for OS.
Edit in manuscript: Abstract; Materials and Methods (2.3. Statistical analysis); Results (3.4. Novel prognostic criteria of CRLM after curative resection); Figure 4. (new addition); Discussions (comparable to Beppu score and Fong’s score paragraph)
6.The study mentions that tumor size and number were measured via abdominal ultrasound, CT, and MRI. Is the consistency of measurements across different imaging modalities good?
Response6:We agree that measurement heterogeneity across imaging modalities can influence tumor number and size. We clarified in Methods that CT scans and/or EOB-MRI served as the primary modalities for preoperative staging, with ultrasound used as adjunct when applicable. Where multiple modalities were available, we specified the hierarchy used to define tumor number and maximal diameter (e.g., CT/MRI prioritized; largest recorded diameter used). We also added a limitation statement acknowledging potential inter-modality and inter-center variability and its likely direction of bias.
Edit in manuscript: Materials and Methods (2.1 Patient data); Discussions (limitation paragraph)
7.The newly constructed two-factor model lacks explicit scoring rules (e.g., whether each criterion met counts as one point). It is recommended to supplement specific scoring criteria and incorporate either internal validation (e.g., Bootstrap method) or external validation (e.g., inclusion of data from other centers).
Response7:Thank you for this helpful comment. We apologize for any ambiguity. The proposed “two-factor model” is not intended as an additive point-based score (e.g., 0–2 points). Rather, it is a binary classification rule: patients are classified as high risk if they meet either criterion (≥3 CRLMs, or—among patients with 1–2 CRLMs—a largest tumor diameter ≥5 cm); all other patients are classified as low risk. Because the second criterion applies only to patients with 1–2 lesions, the two criteria are mutually exclusive and a “two-criteria-met” category does not exist. We have revised the Methods/Results to state this decision rule explicitly. We emphasize that external validation in additional centers is required and propose a prospective validation plan.
Edit in manuscript: Results (3.4. Novel prognostic criteria of CRLM after curative resection) and Discussion (future studies paragraph)
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsMajor Revision Recommendation
The manuscript addresses an important clinical question; however, substantial revisions are required before it can be considered for publication. The retrospective design, limited sample size, and heterogeneity in chemotherapy regimens significantly weaken the robustness and generalizability of the proposed risk model. Key molecular variables (e.g., RAS/BRAF, MSI status) are omitted, despite their established prognostic relevance. Additionally, methodological justification for cutoff selection and external validation are insufficient, necessitating deeper statistical rigor and clearer clinical contextualization.
- Given the retrospective, multicenter design, how did the authors control for center-specific surgical expertise, perioperative management, and follow-up protocols that could bias survival outcomes?
- The study spans 2010–2021, a period with major advances in systemic therapy. How do the authors address treatment-era bias, particularly regarding chemotherapy regimens and targeted agents?
- How sensitive is the proposed cutoff of ≥3 metastases to minor changes in categorization (e.g., 2 vs. 3 lesions), especially given the relatively small subgroup sizes?
- The multivariate model includes variables with p < 0.10 in univariate analysis. How did the authors assess potential multicollinearity among tumor burden variables?
- Why was AIC chosen as the sole metric for model comparison, without complementary measures such as C-index or calibration plots?
- The preoperative chemotherapy group is highly heterogeneous. How do the authors justify pooling doublet, triplet, and “other” regimens in outcome analyses?
- Was response assessment centrally reviewed or independently validated to minimize RECIST interpretation bias?
- Given similar AIC values to established models, what is the incremental clinical value of replacing rather than complementing existing tools?
- How might differences in imaging resolution and diagnostic practices across centers affect the accuracy of tumor number and size measurements?
- The authors acknowledge missing molecular data. Can they estimate how many patients lacked RAS/BRAF/MMR status, and whether this missingness was random?
- The manuscript proposes prospective validation, yet no concrete trial framework is outlined. Can the authors clarify how this score would be prospectively tested?
Author Response
Major Revision Recommendation
The manuscript addresses an important clinical question; however, substantial revisions are required before it can be considered for publication. The retrospective design, limited sample size, and heterogeneity in chemotherapy regimens significantly weaken the robustness and generalizability of the proposed risk model. Key molecular variables (e.g., RAS/BRAF, MSI status) are omitted, despite their established prognostic relevance. Additionally, methodological justification for cutoff selection and external validation are insufficient, necessitating deeper statistical rigor and clearer clinical contextualization.
1. Given the retrospective, multicenter design, how did the authors control for center-specific surgical expertise, perioperative management, and follow-up protocols that could bias survival outcomes?
Response1:Thank you for raising this important concern. We sought to minimize center-related bias by restricting the study to two hospitals within the Yamaguchi University network, where hepatobiliary surgeons share similar training and where perioperative care and follow-up practices generally adhere to common institutional standards. Accordingly, we anticipate that inter-center differences in surgical expertise, perioperative management, and surveillance were limited; nevertheless, residual center effects cannot be entirely excluded and are acknowledged as a limitation.
Detailed follow-up schedule and recurrence ascertainment (imaging intervals; handling of extrahepatic recurrences) in Materials and Methods (2.1. Patient data).
2. The study spans 2010–2021, a period with major advances in systemic therapy. How do the authors address treatment-era bias, particularly regarding chemotherapy regimens and targeted agents?
Response2:We agree that systemic therapy evolved substantially during 2010–2021. We therefore included calendar period in a result section as a covariate (e.g., 2010–2015 vs 2016–2021) and performed sensitivity analyses restricted to later years when modern regimens and targeted agents were more prevalent. We also expanded the Discussion to acknowledge residual era-related confounding that cannot be fully addressed in a retrospective cohort.
Edit in manuscript: Results (3.1. Subsection patient background; chemotherapy paragraph); Discussions (limitations paragraph)
3. How sensitive is the proposed cutoff of ≥3 metastases to minor changes in categorization (e.g., 2 vs. 3 lesions), especially given the relatively small subgroup sizes?
Response3:We agree and performed a threshold sensitivity analysis examining alternative categorizations (the numbers of CRLMS; 1, 2 vs 3). Using the Kaplan-Meier curve for log-rank testing to compare RFS and OS between the numbers of CRLMs 1, 2 and 3, no significant difference was observed between the two groups: p=0.127 (HR=1.77 (95%CI; 0.83-3.77)) for RFS and p=0.124 (HR=2.12 (95%CI; 0.79-5.67)) for OS. However, because even the numbers of CRLMs 3 alone were associated with poor recurrence and survive, we think a cutoff of ≥3 metastases appeared reasonable.
4. The multivariate model includes variables with p < 0.10 in univariate analysis. How did the authors assess potential multicollinearity among tumor burden variables?
Response4:Thank you for your comment. We had checked the multicollinearity of the final model using the variance inflation factor (VIF) in our model, but all the values were below 5 so that the predictors were acceptable from the statistical point of view.
5. Why was AIC chosen as the sole metric for model comparison, without complementary measures such as C-index or calibration plots?
Response5:Thank you for your important comment. We have reanalyzed the data conducting time-dependent ROC curve analysis and modified the description as below:
Line 147. The models were evaluated by the time-dependent ROC curve analysis using a certain time period. All statistical analyses were performed using JMP Pro 18 software (SAS Institute, Cary, NC, USA), with the exception of the time-dependent ROC curve analysis, which was performed using the statistical programming language R (version 4.4.1, R Development Core Team). A P-value <0.05 was considered statistically significant.
Line 272. We calculated the area under the curve (AUC) to assess the predictive accuracy of our model for recurrence and OS applying time-dependent ROC curve analysis at 36- and 60-month points (Figure 4.). The AUCs for recurrence in our model were 0.68 (95% CI: 0.58-0.77) at 36-month and 0.66 (95% CI: 0.55-0.77) at 60-month. For OS, AUCs were 0.59 (95% CI: 0.45-0.72) at 36-month and 0.65 (95% CI: 0.54-0.76) at 60-month. For comparison, the Beppu nomogram yielded AUCs of 0.70 (p=0.683) at 36-month and 0.68 (p=0.766) at 60-month for recurrence, and Fong's clinical risk score resulted in AUCs of 0.64 (p=0.430) at 36-month and 0.74 (p=0.074) at 60-month for OS.
Edit in manuscript: Abstract; Materials and Methods (2.3. Statistical analysis); Results (3.4. Novel prognostic criteria of CRLM after curative resection); Figure 4. (new addition); Discussions (comparable to Beppu score and Fong’s score paragraph)
6. The preoperative chemotherapy group is highly heterogeneous. How do the authors justify pooling doublet, triplet, and “other” regimens in outcome analyses?
Response6:We appreciate the reviewer’s concern regarding regimen heterogeneity. Preoperative chemotherapy was given to 72 patients (62.6%), with most receiving doublet therapy (56/72, 77.8%); only 7 patients received FOLFOXIRI (triplet; targeted agents in 3) and 9 received other regimens. Because these non-doublet strata are small, regimen-specific survival analyses would be underpowered and yield unstable estimates, and regimen choice is also subject to confounding by indication. Therefore, in line with our study aim, we analyzed preoperative chemotherapy as a binary exposure and provided the detailed regimen distribution descriptively.
Regarding triplet therapy, the neoadjuvant survival advantage of FOLFOXIRI over standard doublet therapy in resectable CRLM has not been established, and given the small number of patients treated with FOLFOXIRI, any impact on our survival estimates is likely limited. Nonetheless, the efficacy of triplet therapy in this context warrants further study. We have revised the manuscript accordingly and note regimen heterogeneity as a limitation.
Edit in manuscript: Discussions (limitations paragraph)
7. Was response assessment centrally reviewed or independently validated to minimize RECIST interpretation bias?
Response7:We agree. We clarified how radiologic response was assessed, including imaging modality and timing relative to surgery, and whether assessments were performed at a conference involving hepatobiliary and pancreatic surgeons from each institution. Because both facilities involved in this study are affiliated with Yamaguchi University, we consider that there are no significant differences in RECIST assessment bias.
Edit in manuscript: Materials and Methods (2.1. Patient data, therapeutic effects paragraph); Discussions (limitations paragraph)
8. Given similar AIC values to established models, what is the incremental clinical value of replacing rather than complementing existing tools?
Response8:We agree this point requires clearer articulation. In the revised Discussion, we emphasize that our aim is not to replace established tools, but to provide a parsimonious, bedside-usable stratifier for technically resectable CRLM using variables routinely available preoperatively. We also discuss practical scenarios where a two-factor tool may be advantageous (rapid counseling, triage for neoadjuvant consideration, and communication in multidisciplinary settings).
Edit in manuscript: Discussions (our model vs existing model (Fong and Beppu score) paragraph)
9. How might differences in imaging resolution and diagnostic practices across centers affect the accuracy of tumor number and size measurements?
Response9:We agree. We added a Methods description of imaging protocols and modality hierarchy and a limitation acknowledging that differences in imaging quality and interpretation may introduce measurement error in tumor number/size. However, because both facilities involved in this study are affiliated with Yamaguchi University, we consider that there are no significant differences in imaging protocols and CRLMs assessments including number/size.
Edit in manuscript: Discussions (limitations paragraph)
10. The authors acknowledge missing molecular data. Can they estimate how many patients lacked RAS/BRAF/MMR status, and whether this missingness was random?
Response10:We agree. In the revised manuscript, we now report the testing rate and missingness for RAS, BRAF, and MMR/MSI status (RAS status; tested in n=85 (73.9%), missing in n=30 (26.1%), BRAF status; tested in n=47 (40.9%), missing in n=68 (59.1%), and MSI/MMR status; tested in n=21 (18.3%), missing in n=94 (81.7%)). We also added a dedicated Discussion paragraph explaining how missing molecular data could bias results (including the possibility that missingness is not random) and how this limitation affects generalizability.
Edit in manuscript: Results (3.1 Subsection patient background); Discussions (limitations paragraph)
11. The manuscript proposes prospective validation, yet no concrete trial framework is outlined. Can the authors clarify how this score would be prospectively tested?
Response11:We agree and have added a concrete prospective validation plan. Briefly, we propose a multicenter prospective observational cohort enrolling consecutive patients with technically resectable CRLM scheduled for hepatectomy (with or without planned neoadjuvant chemotherapy). The study would prespecify the two-factor score, endpoints (RFS and OS), follow-up imaging intervals, and centralized (or standardized) radiology response assessment. Model performance would be evaluated using discrimination, calibration, and decision-curve analysis with prespecified time points.
Edit in manuscript: Discussions (limitation paragraph)
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for Authorsthis version is fine.
Reviewer 3 Report
Comments and Suggestions for AuthorsAccept in present form

