Next Article in Journal
Single Anastomosis Duodenoileostomy with Sleeve Gastrectomy Versus Sleeve Gastrectomy Alone: A Systematic Review and Meta-Analysis on Behalf of TROGSS—The Robotic Global Surgical Society
Previous Article in Journal
Endoscopic Clipping Versus Suturing for Mucosotomy Closure in E-POEM and G-POEM: A Systematic Review and Meta-Analysis
 
 
Article
Peer-Review Record

Development and Internal Validation of a Machine Learning-Based Colorectal Cancer Risk Prediction Model

Gastrointest. Disord. 2025, 7(2), 26; https://doi.org/10.3390/gidisord7020026
by Deborah Jael Herrera 1,†, Daiane Maria Seibert 2,†, Karen Feyen 2, Marlon van Loo 3, Guido Van Hal 1,*,‡ and Wessel van de Veerdonk 1,3,‡
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Gastrointest. Disord. 2025, 7(2), 26; https://doi.org/10.3390/gidisord7020026
Submission received: 24 January 2025 / Revised: 12 March 2025 / Accepted: 14 March 2025 / Published: 24 March 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have written an interesting manuscript presenting a Machine learning based model for prediction of Colorectal Cancetr risk based on common and parameters of lifestyle, easy to obtain and modifiable by medical counseling. The study utilized secondary data from a previous large ramdomized controlled Cancer Screening Trial. 

The study is well design and the manuscript is well presented. Objectives are clear, the methods are adequately described. Results are clearly presented. The conclusions are in line with the results obtained.

Only one question. The authors may indicate how a primary care physician can use the predictive model in the clinical practice. The authors identified three risk factor and one protective factor. Can the model indicate the individual risk of cancer that has a patient according to his particular lifestyle habits?  Do the model provide a quantification of the risk?

Author Response

Reviewer 1:

Comment 1: The authors have written an interesting manuscript presenting a Machine learning based model for prediction of Colorectal Cancer risk based on common and parameters of lifestyle, easy to obtain and modifiable by medical counseling. The study utilized secondary data from a previous large randomized controlled Cancer Screening Trial. The study is well design, and the manuscript is well presented. Objectives are clear, the methods are adequately described. Results are clearly presented. The conclusions are in line with the results obtained.

 

Only one question. The authors may indicate how a primary care physician can use the predictive model in clinical practice. The authors identified three risk factors and one protective factor. Can the model indicate the individual risk of cancer that a patient has according to his particular lifestyle habits?  Does the model provide a quantification of the risk?

 

Response:

We appreciate the reviewer’s request for clarification on how primary care physicians can use our predictive model in clinical practice and whether the model provides individualized risk quantification based on a patient’s lifestyle habits. In response, we have expanded the Results (3.6. Clinical Applicability of the Model in Practice) and Discussion (4.5. How Primary Care Physicians Can Use the Model) sections to provide a more detailed explanation of the model’s clinical application.

  1. Integration into an Interactive Risk Estimator
    1. To enhance usability in primary care, we integrated our CRC risk prediction model into an interactive risk estimator that dynamically adjusts based on patient responses. This tool enables real-time risk assessment during consultations, making it feasible for routine use. We have provided a reference link to the tool (https://bibopp-acc.vito.be/orient) in the manuscript to allow readers to explore its functionality.
  2. Individualized Risk Quantification
    1. The model generates a personalized risk score by assessing modifiable and non-modifiable risk factors, including smoking, alcohol consumption, BMI, hypertension, diabetes, and age. Using SHAP analysis, it identifies the most influential factors contributing to an individual's risk, providing explainability for clinical decision-making.
    2. The model categorizes individuals into average, increased, or high-risk groups and presents the risk score using a color-coded gauge to enhance interpretability for both clinicians and patients.
  3. Clinical Decision Support & Personalized Recommendations
    1. If a family history of CRC is reported, the tool automatically recommends colonoscopy, following established guidelines.
    2. For individuals without a family history, the model continues evaluating modifiable risk factors, allowing physicians to discuss personalized prevention strategies.
    3. The tool identifies underlying modifiable and non-modifiable risk factors, as well as protective factors for risk prediction. These factors can be discussed during consultations with a nurse or GP, supporting informed conversations in primary care. adherence, making it an effective support system for shared decision-making in primary care.
  4. Threshold Selection for Clinical Prioritization
    1. The model uses a flexible risk threshold (0.0082) to balance sensitivity (74.7%) and specificity (60.7%), ensuring that high-risk patients are identified while minimizing unnecessary follow-ups. This threshold can be adjusted based on clinical priorities or resource availability.

Reviewer 2 Report

Comments and Suggestions for Authors

I read with interest the manuscript by Herrera et al, titled 'Development and Internal Validation of a Machine Learning-Based Colorectal Cancer Risk Prediction Model'. I have the following comments:

- Please review the abstract to ensure consistency in abbreviations and extended terms.
- Overall, the abstract is difficult to read. I recommend a careful revision to improve clarity while maintaining scientific rigor.
- The aim of the study is unclear. Please refine it to convey a concrete, practical message, which will also enhance the manuscript's overall interpretation. While the results have significant potential, presenting a complex prediction model with the goal of "facilitating meaningful discussion" seems less impactful from a scientific perspective.
- A key discussion point is the use of a database in which the population is now 25 years older (or more) than at the initial screening. While this provides a longer follow-up, colorectal cancer incidence has changed dramatically, particularly with the rise in early-onset cases, which are more likely driven by environmental factors. I suggest the authors expand this point in the discussion, including the potential limitations of using this database.
- Although the issue of missing data is addressed, it remains a major limitation of the analysis. A 30% missing data cut-off may be too broad to ensure statistical significance. Additionally, I have concerns regarding the 489 patients with missing age data. Given the critical importance of age as a baseline variable, I suggest excluding these patients from the analysis.
- The discussion needs to be expanded. The topic addressed is broad, yet the evidence presented for comparison is limited. I encourage the authors to elaborate further, particularly on risk factors and the practical utility of predictive models like the one proposed.

Author Response

Reviewer 2:

I read with interest the manuscript by Herrera et al, titled 'Development and Internal Validation of a Machine Learning-Based Colorectal Cancer Risk Prediction Model'. I have the following comments:

  1. Please review the abstract to ensure consistency in abbreviations and extended terms.

Response:

All abbreviations and extended terms are now reviewed and revised for consistency.

  1. Overall, the abstract is difficult to read. I recommend careful revision to improve clarity while maintaining scientific rigor.

Response:

We have carefully revised the abstract to enhance clarity while maintaining scientific rigor. Specifically, we have changed the sentence structures, ensured consistency in terminology and abbreviations, and clearly articulated the study’s aim, methodology, and key findings.

  1. The aim of the study is unclear. Please refine it to convey a concrete, practical message, which will also enhance the manuscript's overall interpretation. While the results have significant potential, presenting a complex prediction model with the goal of "facilitating meaningful discussion" seems less impactful from a scientific perspective.

Response:

We agree that the study’s aim should be presented with greater clarity and a stronger practical focus. To address this, we have refined the abstract and manuscript to explicitly state that our goal was to develop and internally validate a CRC risk prediction model based on health and lifestyle factors, which has been integrated into a risk estimator for clinical use. Instead of simply facilitating discussions, this tool categorizes individuals as average, increased, or high risk, highlighting modifiable risk factors to support informed decision-making and personalized lifestyle modifications. Additionally, we have strengthened the conclusion to emphasize the need for external validation to confirm the model’s applicability across diverse populations and its effectiveness in real-world healthcare settings.

  1. A key discussion point is the use of a database in which the population is now 25 years older (or more) than at the initial screening. While this provides a longer follow-up, colorectal cancer incidence has changed dramatically, particularly with the rise in early-onset cases, which are more likely driven by environmental factors. I suggest the authors expand this point in the discussion, including the potential limitations of using this database.

Response:

We have expanded Section 4.6 (Temporal Limitations) to address the implications of using the PLCO dataset, particularly the limitations posed by the evolving epidemiology of CRC. Specifically, we:

  • Acknowledged the role of environmental and lifestyle factors, such as ultra-processed food consumption, obesity, sedentary behavior, and gut microbiome changes, in shaping CRC risk profiles. However, early-onset CRC was not the focus of our study, as our lower age range (for ORIENT) was limited to 50 years and above.
  • Highlighted that these emerging risk factors were not adequately captured in the PLCO dataset, potentially affecting the generalizability of our model for younger individuals at risk of CRC.
  • Compared our dataset to newer CRC risk models, which incorporate dietary trends and metabolic risk factors that have become more relevant in recent years.
  • Discussed the continued relevance of traditional lifestyle factors (smoking, diet, family history) in CRC risk prediction while acknowledging the need for models to evolve to include newly emerging risk determinants.

 

  1. Although the issue of missing data is addressed, it remains a major limitation of the analysis. A 30% missing data cut-off may be too broad to ensure statistical significance. Additionally, I have concerns regarding the 489 patients with missing age data. Given the critical importance of age as a baseline variable, I suggest excluding these patients from the analysis.

 

Response:

We have strengthened our methodology by providing a detailed explanation of the mode imputation method for categorical variables and k-nearest neighbors (KNN) imputation for numerical variables (see Section 2.3.3: Handling Missing Data). These methods ensure that missing values are replaced with the most statistically appropriate values based on similar patient characteristics, thus maintaining the robustness of our dataset while avoiding excessive case exclusions.

However, we acknowledge the critical importance of age as a baseline variable in CRC risk prediction. As a result, we have now excluded the 489 patients with missing age data to maintain model accuracy and interpretability. This update is reflected in the Participant Selection section (see Section 3.1: Participant Selection and Figure 2: Inclusion Criteria Flowchart), ensuring that only individuals with complete and reliable data are included in model development.

Additionally, we have performed data restriction based on age to align with CRC screening age recommendations (see Section 2.3.4: Data Restriction by Age).

 

  1. The discussion needs to be expanded. The topic addressed is broad, yet the evidence presented for comparison is limited. I encourage the authors to elaborate further, particularly on risk factors and the practical utility of predictive models like the one proposed.

Response:

We have significantly elaborated on these aspects in the Discussion section (Sections 4.3–4.7).

We have added an in-depth interpretation of the most significant predictors identified by SHAP analysis, such as age, weight, and smoking status, and their relationship with CRC risk (see Section 4.4: Interpretability and Feature Importance). Additionally, we discuss the protective effect of heart medications, exploring the existing literature and potential mechanisms that could explain this observation.

To highlight our model’s added value, we have included a comparison with widely used CRC risk prediction tools, such as QCancer, the NHS Bowel Cancer Screening Model, the APCS Score, Kaminski’s Risk Score, and the NCI-CRC Risk Assessment Tool (see Section 4.2: Comparison with Conventional Risk Models). This discussion clarifies how our model addresses key limitations in existing models, including interpretability, feasibility in primary care, and adaptability to different populations.

We have expanded Section 4.6: How Primary Care Physicians Can Use the Model, providing a step-by-step explanation of how the risk estimator supports real-time risk assessment, possible lifestyle counseling, and referral decisions. The model’s ability to generate personalized, color-coded risk scores makes it easier for clinicians and patients to interpret their risk level and engage in shared decision-making.

We have further elaborated on the temporal limitations of using the PLCO dataset (1993–2001) and how changing CRC epidemiology, lifestyle risk factors, and screening practices may affect the generalizability of our model (see Section 4.6.2: Evolving Screening Practices and Their Impact on Risk Prediction). Additionally, we highlight the need for external validation using more recent population-based datasets to confirm our model’s real-world applicability.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you to the Editor for the opportunity to review this manuscript, and thank you as well to the authors for their thoughtful work on an important topic in colorectal cancer risk stratification. Below, I have provided a series of comments and recommendations that I hope will aid in further strengthening the manuscript and clarifying key points for readers.

  1. Introduction and background
    • The Introduction provides a solid rationale for individualized CRC risk prediction over simple age‐based approaches. You may wish to highlight any notable differences between your approach and other commonly cited CRC risk models (e.g., QCancer, NHS, or other widely validated algorithms) to underscore your model’s unique contributions.
  2. Methods
    • Data Imputation: The paper outlines a median‐based imputation strategy using similar patients and correlated factors. While the rationale is understandable, please clarify more explicitly how “similar” patients are chosen (e.g., are you using a nearest‐neighbors approach with a certain distance metric, or is it purely based on a few categorical/demographic variables?). This is a key point, as imputation can significantly influence downstream model performance.
    • Feature Selection: You explain that multicollinearity was addressed by removing highly correlated features. Consider providing more detail on thresholds used (e.g., Pearson’s r > 0.80 or VIF > 10) and any iterative approach taken to eliminate redundant features.
    • Threshold Choice: The threshold of 0.007 for classifying risk is explained primarily in terms of achieving a balance between sensitivity and specificity. You might strengthen the discussion of how this balance was determined—for example, were there formal cost‐benefit tradeoffs or clinical preferences that guided your threshold?
    • Model Comparisons: Although you mention testing different algorithms, the paper would benefit from a succinct table or figure comparing the performance (e.g., AUROC, sensitivity, specificity) of each ML algorithm you considered (Random Forest, Logistic Regression, etc.) before selecting LightGBM.
    • Figure Order: It appears that Figure 2 is introduced in the text before Figure 1, which disrupts the expected flow of the manuscript. It would be helpful to revise the manuscript so that each figure is introduced in the sequence corresponding to its numbering (i.e., Figure 1 should appear before Figure 2) to maintain clarity for readers.
  3. Results
    • The results are generally well presented, and the flowchart (Figure 1) is helpful.
    • The performance metrics—AUROC of 0.73, along with sensitivity and specificity tradeoffs—are clearly stated. You might consider reporting additional performance measures (e.g., PPV, NPV) for further clinical interpretability.
  4. Discussion
    • The Discussion appropriately places your findings within the broader literature, especially the notion that modifiable factors can be highlighted for primary prevention.
    • One point that could be elaborated is the model’s performance vis-à-vis more conventional risk tools or even simpler logistic regression–based approaches. Showing that your model provides added value (e.g., in reclassification indices) would further strengthen the paper.
    • The observation that certain heart medications may confer a protective effect is noteworthy. However, given the mixed evidence on this topic, as you mention, emphasizing the exploratory nature of this finding might help mitigate any potential overinterpretation.
  5. Limitations
    • You rightly note the potential issues arising from older data (from 1993–2001). Explicitly discussing how shifts in lifestyle or screening practices over the last two decades might affect external validity would be valuable. While you mention the need for external validation, you could also outline possible next steps: for example, prospective validation in a contemporary cohort with updated risk factor definitions.

 

Overall, the manuscript is well written and the English language usage is clear. The text flows logically, and the figures/tables are helpful in illustrating the methods and results.

 

Author Response

Reviewer 3:

 

Thank you to the Editor for the opportunity to review this manuscript, and thank you as well to the authors for their thoughtful work on an important topic in colorectal cancer risk stratification. Below, I have provided a series of comments and recommendations that I hope will aid in further strengthening the manuscript and clarifying key points for readers.

 

  1. Introduction and background
    1. The Introduction provides a solid rationale for individualized CRC risk prediction over simple age‐based approaches. You may wish to highlight any notable differences between your approach and other commonly cited CRC risk models (e.g., QCancer, NHS, or other widely validated algorithms) to underscore your model’s unique contributions.

Response:

We appreciate the reviewer’s suggestion to better differentiate our model from existing CRC risk tools. In response, we have refined the Introduction to present key established models and their limitations. Instead of directly comparing them to our approach in the Introduction, we now provide a clear discussion of their shortcomings and how our model addresses these gaps in the Discussion section.

 

 

  1. Methods
    1. Data Imputation: The paper outlines a median‐based imputation strategy using similar patients and correlated factors. While the rationale is understandable, please clarify more explicitly how “similar” patients are chosen (e.g., are you using a nearest‐neighbors approach with a certain distance metric, or is it purely based on a few categorical/demographic variables?). This is a key point, as imputation can significantly influence downstream model performance.

Response:

We have expanded Section 2.3.3: Handling Missing Data to explicitly describe how missing values were addressed.

For categorical variables (e.g., smoking status, hypertension, diabetes), we applied an imputation approach similar to hot-deck imputation, where missing values were replaced using a subgroup of patients with at least four shared characteristics (age, sex, BMI, smoking quantity, and alcohol-related factors). This method, which resembles a k-nearest neighbors approach, ensures that imputed values reflect the most probable category based on clinically relevant similarities.

For numerical variables (e.g., weight, height, BMI), we used the k-nearest neighbors (KNN) imputation method with 10 nearest neighbors, identified based on Euclidean distance between participants with complete data. This approach allowed us to use the most similar cases in terms of demographic and clinical attributes to predict missing numerical values while preserving variability within the dataset.

 

  1. Feature Selection: You explain that multicollinearity was addressed by removing highly correlated features. Consider providing more detail on thresholds used (e.g., Pearson’s r > 0.80 or VIF > 10) and any iterative approach taken to eliminate redundant features.

 

Response:

To clarify our feature selection process, we have updated Section 2.4: Model Development to explicitly state the thresholds and techniques used:

We excluded factors with a Pearson correlation of ≥ 0.85 between features and factors with a correlation ≤ 0.001 with the target (CRC outcomes). This initial selection process reduced the number of features to 98. SHAP was then used interactively to identify the most informative features by training models with random factors, assessing their contributions, and prioritizing those supported by existing literature, ultimately narrowing the selection to 12 features.

  1. Threshold Choice: The threshold of 0.007 for classifying risk is explained primarily in terms of achieving a balance between sensitivity and specificity. You might strengthen the discussion of how this balance was determined—for example, were there formal cost‐benefit tradeoffs or clinical preferences that guided your threshold?

 

Response:

We have expanded the discussion to clarify the rationale behind choosing 0.0082 as the classification threshold. Specifically, we now explain that our decision was driven by a balance between sensitivity and specificity, prioritizing sensitivity to minimize missed CRC cases, given the significant consequences of undiagnosed cancer. We also discuss how alternative thresholds can be adapted to different clinical contexts, such as using a lower threshold for high-risk populations to improve early detection or a stricter threshold in resource-constrained settings to reduce unnecessary follow-ups (see discussion section 4.3 Risk Stratification and Threshold Selection).

 

While we acknowledge the importance of formal cost-benefit tradeoff analyses in threshold determination, our study did not conduct such an analysis due to the absence of real-world cost data specific to CRC screening workflows in different healthcare settings. We instead relied on clinical priorities and public health literature to inform our selection. However, we recognize that future work incorporating economic modeling could further refine the threshold for cost-benefit in different healthcare systems.

 

  1. Model Comparisons: Although you mention testing different algorithms, the paper would benefit from a succinct table or figure comparing the performance (e.g., AUROC, sensitivity, specificity) of each ML algorithm you considered (Random Forest, Logistic Regression, etc.) before selecting LightGBM.

 

Response:

We have added Table 3 (Results, Section 3.3: Model Development and Performance), which summarizes the performance of the most promising models, including Neural Network (NN), Random Forest (RF), XGBoost, and LightGBM (LGBM). This table presents key performance metrics, including true negatives (TN), false positives (FP), false negatives (FN), true positives (TP), accuracy, sensitivity, and specificity, to help readers understand the strengths and limitations of each model. Additionally, we have updated the discussion on model selection, highlighting that LightGBM was chosen because it achieved the highest sensitivity (0.747), making it better suited for early CRC detection, where reducing false negatives is a priority.

 

  1. Figure Order: It appears that Figure 2 is introduced in the text before Figure 1, which disrupts the expected flow of the manuscript. It would be helpful to revise the manuscript so that each figure is introduced in the sequence corresponding to its numbering (i.e., Figure 1 should appear before Figure 2) to maintain clarity for readers.

 

Response:

We have revised the manuscript to ensure that Figure 1 is introduced before Figure 2, maintaining a logical flow for readers. We have also added additional tables and figures in the manuscript to present new results (e.g., found in Section 3.3.1. Model Selection and Evaluation, 3.6. Clinical applicability of the model)

 

  1. Results

The results are generally well presented, and the flowchart (Figure 1) is helpful. The performance metrics—AUROC of 0.73, along with sensitivity and specificity tradeoffs—are clearly stated. You might consider reporting additional performance measures (e.g., PPV, NPV) for further clinical interpretability. Overall, the manuscript is well written and the English language usage is clear. The text flows logically, and the figures/tables are helpful in illustrating the methods and results.

Response:

We appreciate the reviewer’s suggestion to include additional performance measures for further clinical interpretability. In response, we have expanded our results section to report the Positive Predictive Value (PPV) and Negative Predictive Value (NPV) in addition to AUROC, sensitivity, and specificity.

  • Specifically, we now state that the LightGBM model achieved an NPV of 0.996 and a PPV of 0.017 (Results, Section 3.3: Model Development and Performance). While the NPV is high, indicating that the model is effective at identifying true negative cases, the PPV is relatively low due to the highly imbalanced nature of our dataset, where positive cases are much less frequent than negative cases (1:89 ratio).
  • We also clarify in the discussion that, in screening settings, NPV is often the more relevant metric, as ensuring that low-risk individuals are correctly identified can reduce unnecessary follow-ups and patient anxiety. However, given the low PPV, we acknowledge the need for further calibration or external validation in populations with different CRC prevalence rates.
  1. Discussion
    1. The Discussion appropriately places your findings within the broader literature, especially the notion that modifiable factors can be highlighted for primary prevention. One point that could be elaborated is the model’s performance vis-à-vis more conventional risk tools or even simpler logistic regression–based approaches. Showing that your model provides added value (e.g., in reclassification indices) would further strengthen the paper.

 

Response:

We have addressed this point by expanding Section 4.2 (Model Performance Compared to Conventional Risk Tools) to compare our model’s performance with traditional logistic regression-based CRC risk models. Specifically, we:

  • Highlighted key differences between machine learning models and conventional logistic regression approaches, particularly in their ability to capture non-linear relationships and complex interactions among risk factors.
  • Provided comparative AUC values from prior logistic regression-based CRC risk models (e.g., Cai, 2011; Imperiale, 2015; Briggs, 2022; Deng, 2023) and discussed how our model’s AUROC of 0.726 aligns with or improves upon previously reported performance metrics.
  • Discussed practical advantages of our model, including its reliance on easily obtainable clinical variables, which enhances its applicability in routine consultations, in contrast to some traditional models that require genetic markers or biochemical tests.
  • Emphasized the interpretability of our approach by integrating SHAP analysis, making it easier for clinicians to understand and communicate risk predictions compared to black-box machine learning models and logistic regression models with less direct interpretability.

 

  1. The observation that certain heart medications may confer a protective effect is noteworthy. However, given the mixed evidence on this topic, as you mention, emphasizing the exploratory nature of this finding might help mitigate any potential overinterpretation.

 

Response:

To ensure that this finding is not overstated, we have revised Section 4.4 (Interpretability and Feature Importance) to clarify that this is an associative, rather than causal, observation. Specifically, we now:

  • Acknowledge that SHAP analysis identifies associations, not causal effects.
  • Highlight conflicting evidence in the literature, citing studies that report both positive and null associations.
  • Discuss selection bias and the possibility that patients on heart medications receive better preventive care and screenings, leading to lower observed CRC risk.
  • Emphasize that further research is needed using causal inference methods (e.g., propensity score matching) and prospective validation with detailed medication data.

 

  1. Limitations
    1. You rightly note the potential issues arising from older data (from 1993–2001). Explicitly discussing how shifts in lifestyle or screening practices over the last two decades might affect external validity would be valuable. While you mention the need for external validation, you could also outline possible next steps: for example, prospective validation in a contemporary cohort with updated risk factor definitions.

Response:

We have revised Section 4.4.2 (Evolving Screening Practices and Their Impact on Risk Prediction) to further elaborate on how changes in screening guidelines and practices may affect our model’s external validity. The following key points were added:

  • Recognized that CRC screening guidelines have changed significantly since the PLCO trial, with new recommendations lowering the starting age from 50 to 45 years (USPSTF, 2021), which is not reflected in older datasets like PLCO.
  • Noted that modern screening programs increasingly rely on non-invasive tests such as FIT and FIT-DNA, whereas PLCO was conducted during an era when flexible sigmoidoscopy (FSG) was a more commonly used screening modality.
  • We discussed how these changes could impact the calibration of our model, as screening uptake, diagnostic sensitivity, and patient referral patterns have evolved.

Outlined next steps for external validation, emphasizing the need for external validation using more recent population-based cohorts that incorporate updated risk factor definitions.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors replied to my queries in a satisfactory way

Author Response

Thank you very much for this positive feedback!

Back to TopTop