Background and Objectives: Post-hepatectomy liver failure (PHLF) remains the leading cause of mortality following hepatic resection, with reported incidence rates ranging from 1.2% to 32%. Traditional scoring systems such as the Child–Pugh score, Model for End-Stage Liver Disease (MELD), and Albumin–Bilirubin (ALBI) grade
[...] Read more.
Background and Objectives: Post-hepatectomy liver failure (PHLF) remains the leading cause of mortality following hepatic resection, with reported incidence rates ranging from 1.2% to 32%. Traditional scoring systems such as the Child–Pugh score, Model for End-Stage Liver Disease (MELD), and Albumin–Bilirubin (ALBI) grade have demonstrated limited predictive accuracy for PHLF. Machine learning (ML) algorithms have emerged as promising tools capable of integrating complex, multidimensional clinical data to improve predictive performance. This systematic review aims to evaluate the current evidence on ML-based prediction models for PHLF, assessing their predictive accuracy, methodological quality, clinical applicability, and the key variables utilized across models.
Methods: A systematic literature search was conducted across PubMed, Embase, Web of Science, and the Cochrane Library from inception to January 2026. Studies that developed or validated ML models for predicting PHLF after hepatic resection were included. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to evaluate the risk of bias. Data on model performance, algorithms employed, sample sizes, predictor variables, and validation strategies were extracted. The review was conducted in accordance with the PRISMA 2020 guidelines and registered in PROSPERO.
Results: Twelve PubMed-verified studies involving 6913 patients were retained in the final analysis. Publication years ranged from 2020 to 2025, with five studies published in 2025. Gradient boosting approaches (LightGBM/XGBoost or phase-specific boosting models) were the most frequent best-performing architectures, while ANN/deep learning, radiomics-integrated, and ensemble approaches also showed clinically relevant discrimination. Best reported non-training AUCs ranged from 0.7927 to 0.981 (median, 0.873). The strongest generalization signals came from studies with temporal, external, or prospective validation structures. Common predictor domains included bilirubin-based liver function measures, coagulation variables, platelet count, volumetry or extent of resection, imaging-derived radiomics features, and perioperative dynamic data.
Conclusions: Machine learning models remain promising for PHLF prediction, but the evidence base is smaller and more heterogeneous than the original draft suggested. Performance is highest in studies that combine clinical liver-reserve markers with imaging or perioperative temporal data; however, widespread clinical adoption is still limited by retrospective design predominance, inconsistent outcome definitions, and incomplete external validation.
Full article