Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review

Muntean, Calin; Gaborean, Vasile; Vonica, Razvan Constantin; Stefaniga, Sebastian Aurelian; Faur, Alaviana Monique; Feier, Catalin Vladut Ionut

doi:10.3390/ai7050166

Open AccessSystematic Review

Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review

by

Calin Muntean

¹

,

Vasile Gaborean

^2,3,*

,

Razvan Constantin Vonica

^4,*,

Sebastian Aurelian Stefaniga

⁵

,

Alaviana Monique Faur

⁶ and

Catalin Vladut Ionut Feier

^7,8

¹

Medical Informatics and Biostatistics, Department III-Functional Sciences, “Victor Babeş” University of Medicine and Pharmacy Timişoara, Eftimie Murgu Square No. 2, 300041 Timişoara, Romania

²

Thoracic Surgery Research Center, “Victor Babeş” University of Medicine and Pharmacy Timişoara, Eftimie Murgu Square No. 2, 300041 Timişoara, Romania

³

Department of Surgical Semiology, Faculty of Medicine, “Victor Babeş” University of Medicine and Pharmacy Timişoara, Eftimie Murgu Square No. 2, 300041 Timişoara, Romania

⁴

Preclinical Department, Discipline of Physiology, Faculty of Medicine, “Lucian Blaga” University of Sibiu, 550169 Sibiu, Romania

⁵

Department of Computer Science, West University of Timişoara, Vasile Pârvan Blvd., No. 4, 300223 Timişoara, Romania

⁶

Department of Doctoral Studies, “Victor Babeş” University of Medicine and Pharmacy Timişoara, Eftimie Murgu Square No. 2, 300041 Timişoara, Romania

⁷

Abdominal Surgery and Phlebology Research Center, “Victor Babeş” University of Medicine and Pharmacy Timişoara, Eftimie Murgu Square No. 2, 300041 Timişoara, Romania

⁸

First Surgery Clinic, “Pius Brinzeu” Clinical Emergency Hospital, 300723 Timişoara, Romania

^*

Authors to whom correspondence should be addressed.

AI 2026, 7(5), 166; https://doi.org/10.3390/ai7050166

Submission received: 27 March 2026 / Revised: 25 April 2026 / Accepted: 5 May 2026 / Published: 9 May 2026

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

Background and Objectives: Post-hepatectomy liver failure (PHLF) remains the leading cause of mortality following hepatic resection, with reported incidence rates ranging from 1.2% to 32%. Traditional scoring systems such as the Child–Pugh score, Model for End-Stage Liver Disease (MELD), and Albumin–Bilirubin (ALBI) grade have demonstrated limited predictive accuracy for PHLF. Machine learning (ML) algorithms have emerged as promising tools capable of integrating complex, multidimensional clinical data to improve predictive performance. This systematic review aims to evaluate the current evidence on ML-based prediction models for PHLF, assessing their predictive accuracy, methodological quality, clinical applicability, and the key variables utilized across models. Methods: A systematic literature search was conducted across PubMed, Embase, Web of Science, and the Cochrane Library from inception to January 2026. Studies that developed or validated ML models for predicting PHLF after hepatic resection were included. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to evaluate the risk of bias. Data on model performance, algorithms employed, sample sizes, predictor variables, and validation strategies were extracted. The review was conducted in accordance with the PRISMA 2020 guidelines and registered in PROSPERO. Results: Twelve PubMed-verified studies involving 6913 patients were retained in the final analysis. Publication years ranged from 2020 to 2025, with five studies published in 2025. Gradient boosting approaches (LightGBM/XGBoost or phase-specific boosting models) were the most frequent best-performing architectures, while ANN/deep learning, radiomics-integrated, and ensemble approaches also showed clinically relevant discrimination. Best reported non-training AUCs ranged from 0.7927 to 0.981 (median, 0.873). The strongest generalization signals came from studies with temporal, external, or prospective validation structures. Common predictor domains included bilirubin-based liver function measures, coagulation variables, platelet count, volumetry or extent of resection, imaging-derived radiomics features, and perioperative dynamic data. Conclusions: Machine learning models remain promising for PHLF prediction, but the evidence base is smaller and more heterogeneous than the original draft suggested. Performance is highest in studies that combine clinical liver-reserve markers with imaging or perioperative temporal data; however, widespread clinical adoption is still limited by retrospective design predominance, inconsistent outcome definitions, and incomplete external validation.

Keywords:

machine learning; artificial intelligence; post-hepatectomy liver failure; liver surgery; hepatectomy; predictive modeling; clinical decision support

1. Introduction

Hepatic resection remains the cornerstone treatment for a wide variety of benign and malignant liver diseases, including hepatocellular carcinoma, intrahepatic cholangiocarcinoma, and colorectal liver metastases [1]. Over the past several decades, significant advances in surgical technique, anesthetic management, and perioperative care have substantially improved the safety of liver surgery, reducing overall mortality rates to less than 5% in high-volume centers [2]. Despite these improvements, post-hepatectomy liver failure (PHLF) continues to represent the most feared and devastating complication following hepatic resection, accounting for more than 60% of postoperative deaths [3]. The International Study Group of Liver Surgery (ISGLS) defined PHLF in 2011 as a postoperatively acquired deterioration in the ability of the liver to maintain its synthetic, excretory, and detoxifying functions, characterized by an increased international normalized ratio (INR) and concomitant hyperbilirubinemia on or after postoperative day five [4]. This standardized definition has facilitated more consistent reporting across institutions, though significant variability in PHLF incidence persists, with rates ranging from 1.2% to 32% depending on the patient population, extent of resection, and underlying liver disease [5].

The pathophysiology of PHLF is multifactorial and involves a complex interplay between the quantity and quality of the future liver remnant (FLR), the capacity for hepatic regeneration, and the degree of surgical insult [6]. Patient-related factors, including pre-existing liver disease, steatosis, fibrosis, portal hypertension, and sarcopenia, significantly influence the risk of PHLF [7]. Surgical factors such as the extent of parenchymal resection, intraoperative blood loss, duration of hepatic pedicle clamping, and the need for vascular reconstruction further modulate the risk profile [8]. The occurrence of PHLF not only carries a high mortality risk but also leads to prolonged hospitalization, increased healthcare costs, reduced quality of life, and compromised long-term oncological outcomes [9]. Consequently, accurate preoperative identification of patients at high risk for PHLF is essential to guide surgical planning, optimize patient selection, and implement preventive strategies such as portal vein embolization or associating liver partition and portal vein ligation for staged hepatectomy [10].

Traditional approaches to predicting PHLF have relied on clinical scoring systems and volumetric assessments. The Child–Pugh classification, initially developed for assessing the prognosis of cirrhotic patients, has been widely applied to evaluate liver function reserve before hepatectomy, yet its discriminatory ability for PHLF prediction remains modest [11]. The MELD score, originally designed for prioritizing liver transplant candidates, has similarly shown limited accuracy when applied to the hepatectomy setting [12]. The Albumin–Bilirubin (ALBI) grade, a more objective measure of liver function, has gained popularity as a simpler and potentially more accurate tool, though a recent systematic review and meta-analysis reported a pooled AUC of only 0.67 for PHLF prediction [13]. Volumetric assessment of the FLR using computed tomography has become standard practice, with a minimum FLR of 20% for normal livers and 40% for compromised livers considered necessary thresholds [14]. However, volumetric analysis alone does not account for the functional capacity of the remnant parenchyma, limiting its predictive precision [15].

Functional liver assessment techniques, including indocyanine green (ICG) clearance testing, hepatobiliary scintigraphy with technetium-99m mebrofenin, and gadoxetic acid-enhanced magnetic resonance imaging, have been proposed as more accurate methods for evaluating liver function reserve [16]. The ICG retention rate at 15 min (ICG-R15) has been particularly popular in Asian centers, forming the basis of the Makuuchi criteria for determining the permissible extent of hepatectomy [17]. However, ICG-R15 has demonstrated variable predictive accuracy across different populations and is influenced by factors beyond hepatocellular function, including hepatic blood flow and biliary excretion [18]. Hepatobiliary scintigraphy provides a more direct assessment of hepatocyte function and has shown promise in predicting PHLF, though its availability remains limited and standardization of protocols is lacking [19]. These limitations highlight the need for more sophisticated predictive tools that can integrate multiple risk factors simultaneously to achieve greater accuracy in PHLF prediction [20].

1.1. Clinical Background and Limitations of Conventional Risk Stratification

The preceding paragraphs summarized the clinical and pathophysiological context of PHLF, the variability in its incidence, and the well-recognized constraints of established scoring systems and functional liver tests. Building on this medical and surgical background, the next subsection addresses how computational and information–technology (ICT) approaches, and machine learning in particular, have been proposed as complementary tools for risk stratification.

1.2. Machine Learning as a Computational Approach to PHLF Risk Prediction

Machine learning (ML), a subdomain of artificial intelligence, has emerged as a transformative technology in clinical medicine, demonstrating remarkable capacity to identify complex patterns within high-dimensional datasets [21]. Unlike traditional statistical methods that rely on linear assumptions and predetermined relationships between variables, ML algorithms can autonomously discover non-linear interactions and hierarchical patterns among predictors, potentially capturing the complex pathophysiology underlying PHLF [22]. Various ML techniques, including artificial neural networks, gradient boosting machines, random forests, support vector machines, and deep learning architectures, have been increasingly applied to surgical outcome prediction [23]. In the specific context of liver surgery, ML models have been developed for predicting postoperative complications, survival outcomes, and liver regeneration capacity [24]. The application of ML to PHLF prediction represents a natural extension of these efforts, with several studies reporting promising results that surpass the performance of traditional scoring systems [25].

1.3. Rationale, Aim, and Key Research Questions

Despite the growing body of literature on ML-based PHLF prediction, no systematic review has comprehensively evaluated these models to date. The existing systematic reviews on PHLF prediction have focused predominantly on traditional statistical models and clinical scoring systems without specifically addressing the emerging ML literature [26]. Furthermore, the methodological quality, generalizability, and clinical applicability of these ML models remain uncertain, as many studies suffer from common pitfalls, including small sample sizes, lack of external validation, and insufficient model interpretability [27]. A rigorous systematic assessment of the current evidence is therefore warranted to identify the strengths and limitations of existing ML approaches, determine the most informative predictor variables, and provide direction for future research [28]. This systematic review aims to comprehensively evaluate ML models developed for predicting PHLF, with specific attention to their predictive performance, methodological quality, the algorithms employed, the predictor variables incorporated, and their potential for clinical translation in liver surgery practice [29,30].

More precisely, the present systematic review is guided by four explicit research questions: (i) Which ML algorithms have been developed or validated for the prediction of PHLF after hepatic resection, and what is their reported predictive performance? (ii) Which predictor variables and data modalities (laboratory, imaging, operative, perioperative dynamic) are most consistently incorporated into the best-performing models? (iii) What is the methodological quality and risk of bias of these models when assessed using PROBAST, with particular attention to internal, temporal, and external validation? (iv) Where do current ML models stand relative to traditional risk-stratification tools (Child–Pugh, MELD, ALBI), and what gaps must be closed before clinical translation? Answering these questions provides the structure for the Results and Discussion sections that follow.

2. Materials and Methods

2.1. Study Design and Protocol Registration

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines and the recommendations of the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) prior to initiation of the literature search. The review question was formulated according to the CHARMS (Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies) framework. The target population consisted of adult patients undergoing hepatic resection for any indication. The index models were defined as any ML or artificial intelligence algorithm developed or validated for predicting PHLF. The reference standard was PHLF as defined by established criteria, including the ISGLS definition, the 50-50 criteria, or the peak bilirubin greater than 7 mg/dL criterion. The outcome of interest was the predictive performance of ML models compared to traditional scoring systems. No restrictions on publication language, study design, or geographic location were imposed during the search phase, though only studies published in English were ultimately included in the final analysis. The systematic review team comprised experts in hepatobiliary surgery, artificial intelligence, and evidence-based medicine methodology, ensuring comprehensive evaluation of both the clinical and technical aspects of the included studies. Two independent reviewers performed all stages of the review process, with disagreements resolved by consensus or consultation with a third reviewer. The review protocol was registered on the Open Science Framework with the identifier Registration https://doi.org/10.17605/OSF.IO/67CQ3.

2.2. Search Strategy and Information Sources

A comprehensive electronic database search was conducted across four major bibliographic databases: PubMed/MEDLINE, Embase, Web of Science Core Collection, and the Cochrane Library. The search was performed from database inception through 15 January 2026, with no date restrictions applied. The search strategy was developed in collaboration with a medical librarian experienced in systematic review methodology and was tailored to each database using a combination of Medical Subject Headings (MeSH) terms and free-text keywords. The core search concepts included terms related to the intervention (machine learning, artificial intelligence, deep learning, neural network, random forest, gradient boosting, support vector machine, decision tree, ensemble learning, and predictive modeling), the clinical context (hepatectomy, liver resection, liver surgery, and hepatic resection), and the outcome (post-hepatectomy liver failure, liver failure, hepatic failure, and hepatic insufficiency). Boolean operators (AND, OR) were used to combine search terms appropriately. In addition to the electronic database search, the reference lists of all included studies and relevant review articles were manually screened to identify additional potentially eligible studies. Conference abstracts from major hepatobiliary surgery meetings, including the International Hepato-Pancreato-Biliary Association and the European–African Hepato-Pancreato-Biliary Association meetings from 2020 to 2025, were also reviewed. The grey literature was searched using Google Scholar to minimize publication bias.

2.3. Eligibility Criteria and Study Selection

Studies were eligible for inclusion if they met the following criteria: they developed, validated, or both developed and validated one or more ML models for predicting PHLF in adult patients (age 18 years or older) undergoing hepatic resection; they reported at least one measure of predictive performance such as the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, or accuracy; and they were original research articles published in peer-reviewed journals. The definition of PHLF had to conform to an established criterion, including the ISGLS definition, the 50-50 criteria, peak bilirubin greater than 7 mg/dL, or other validated clinical definitions. Studies were excluded if they focused exclusively on non-surgical liver failure, liver transplantation without hepatic resection, or ablative therapies. Review articles, editorials, letters, conference abstracts without sufficient data, case reports, and studies involving fewer than 50 patients were also excluded. Studies that used only traditional statistical methods such as logistic regression without incorporating ML techniques were not eligible. The study selection process was performed in two stages. In the first stage, two independent reviewers screened titles and abstracts to identify potentially relevant articles. In the second stage, the full texts of selected articles were retrieved and assessed against the predefined inclusion and exclusion criteria. Any disagreements between reviewers were resolved through discussion and, when necessary, adjudication by a senior reviewer. The inter-rater agreement during the screening process was quantified using Cohen’s kappa statistic, with values above 0.80 considered excellent agreement.

2.4. Data Extraction and Quality Assessment

A standardized data extraction form was developed based on the CHARMS checklist and piloted on three randomly selected studies before full extraction commenced. Two reviewers independently extracted data from all included studies, with cross-verification to ensure accuracy and completeness. The following data categories were extracted: study characteristics, including author, year, country, study design, sample size, number of participating centers, and indication for hepatectomy; patient characteristics, including age, sex distribution, body mass index, prevalence of cirrhosis, and hepatitis status; model development characteristics, including the ML algorithm type, number and identity of predictor variables, variable selection method, handling of missing data, and software or programming language used; model performance metrics, including AUC, sensitivity, specificity, accuracy, positive predictive value, negative predictive value, calibration measures, and decision curve analysis results; validation methodology, including the type of validation performed (internal versus external), cross-validation technique, and size of the validation cohort; and comparison with traditional scoring systems if reported. The methodological quality and risk of bias of each included study were assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST). This tool evaluates four domains: participants, predictors, outcome, and analysis. Each domain was rated as low, high, or unclear risk of bias by two independent reviewers. The overall risk of bias for each study was determined as high if any domain was rated as high risk. Concerns regarding applicability were also assessed across the participant, predictor, and outcome domains.

2.5. Data Synthesis and Analysis

Given the anticipated heterogeneity in ML algorithms, predictor variables, outcome definitions, and validation strategies across the included studies, a narrative synthesis approach was employed as the primary method of data integration rather than a formal meta-analysis. Studies were grouped and analyzed according to several predefined categories, including the type of ML algorithm used, the PHLF definition employed, whether external validation was performed, and the clinical population studied. The predictive performance of ML models was summarized using median AUC values with interquartile ranges across studies. Forest-style plots were generated to visually display the AUC and corresponding 95% confidence intervals for each study, facilitating comparison across different models and algorithms. The frequency of individual predictor variables across all included ML models was tabulated to identify the most commonly incorporated and informative predictors. PROBAST risk of bias assessments were presented in summary tables and illustrated using graphical displays. The performance of ML models was compared against traditional scoring systems that were reported within the same studies, using the difference in AUC as the primary measure of comparative advantage. The review findings were interpreted in the context of the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) reporting guidelines to evaluate the transparency and completeness of model development and validation reporting. All analyses and visualizations were performed using the R statistical software version 4.3.2 and Python version 3.11.

3. Results

For clarity, the Results section is organized into four subsections that follow the analytical criteria pre-specified in the Materials and Methods section. Section 3.1 describes the study selection process and the PRISMA flow. Section 3.2 reports the baseline characteristics, design, and outcome definitions of the included cohorts (Table 1). Section 3.3 summarizes the machine learning architectures, predictor variables, and discrimination metrics, including validation strategy (Table 2). Section 3.4 then synthesizes evidence on input domains, interpretability, and methodological strengths/limitations across studies (Table 3).

3.1. Study Selection and PRISMA Flow

The systematic literature search identified a total of 430 records from the four electronic databases (PubMed: 156, Embase: 134, Web of Science: 98, Cochrane Library: 42), with an additional 12 records identified through manual reference screening and conference abstract review. After removal of 144 duplicates, 298 unique records remained for screening. Title and abstract screening excluded 241 records that did not meet the predefined eligibility criteria. The remaining 57 full-text articles were assessed for eligibility, of which 45 were excluded for the following reasons: no ML or AI model utilized (n = 18), PHLF was not the primary outcome (n = 9), review articles or editorials (n = 8), duplicate patient cohorts (n = 5), and insufficient data for extraction (n = 5). Ultimately, 12 studies published between 2020 and 2025 met all inclusion criteria and were included in the systematic review (Figure 1). The inter-rater agreement for study selection was excellent, with a Cohen’s kappa of 0.89.

3.2. Baseline Characteristics and Outcome Definitions

Table 1 summarizes the country, design, sample size, indication, and the operational PHLF endpoint adopted by each cohort. The last column (“PHLF Definition”) reports the criterion used by the original authors to label the outcome event and is critical because predictive metrics are only comparable when the underlying event has comparable severity. The label “ISGLS” denotes the International Study Group of Liver Surgery 2011 definition (postoperative day-5 INR rise plus hyperbilirubinemia, optionally graded A/B/C); “Severe PHLF” refers to grade B or C (clinically significant) failure or equivalent author-defined severe events; “Clinically significant PHLF” (“csPHLF”) follows the same grade B/C threshold; “PHLF” without further specification reflects studies that adopted the ISGLS definition or a comparable composite endpoint without explicit grading; and “Post-op liver failure” is used for studies that pre-date or did not strictly conform to the ISGLS framework. NR = not reported. This heterogeneity is acknowledged again in the Discussion section (Section 4.1) when interpreting cross-study performance.

Publications spanned 2020 to 2025, with five studies appearing in 2025. Nine cohorts originated from China, one from Japan, and two from multinational settings or models that incorporated external non-Chinese validation data. Using total cohort sizes reported in the source articles, the 12 studies represented 6913 patients overall, with individual sample sizes ranging from 101 to 2074 patients (median: 363.5). Most studies were retrospective, and Tang et al. [39] was the only study in this set to include an explicitly prospective validation cohort. Seven studies were single-center and five were multicenter. Populations were predominantly hepatocellular-carcinoma-focused, although several studies enrolled mixed hepatectomy or broader liver cancer cohorts. Outcome definitions were heterogeneous, including severe PHLF, clinically significant PHLF, general PHLF, and postoperative liver failure, which should be considered when comparing event rates and model performance across studies (Table 1).

It is important to acknowledge that 9 of the 12 included studies originated from China, with one further study from Japan and two with multinational or external (including Western) validation data, so the evidence base is geographically concentrated in East Asia. This concentration likely reflects three factors: (i) the very high background incidence of hepatocellular carcinoma and hepatectomy volumes in East Asian centers, where viral hepatitis (in particular HBV-related cirrhosis) drives a large surgical caseload and therefore the data needed to develop ML models; (ii) the stronger emphasis on ICG-R15 and Makuuchi-style functional reserve testing in Asian hepatobiliary practice, which produces structured perioperative datasets well suited to ML; and (iii) the rapid expansion of clinical AI research output from Chinese institutions in recent years. As a consequence, results may be biased toward HBV-driven, ICG-tested, predominantly cirrhotic populations, and may not generalize without adjustment to Western cohorts in which non-alcoholic fatty liver disease, alcohol-related liver disease, and colorectal liver metastases dominate, FLR-volumetry-based decision-making is more common than ICG-R15, and post-hepatectomy event rates and case mix differ. The two studies with non-Chinese or external validation cohorts (Famularo et al. [38], Tashiro et al. [36]) and the external MIMIC-IV cohort used by Wang K et al. [41]—in which the externally validated AUC dropped from 0.952 to 0.654—directly illustrate this risk. We therefore interpret the pooled performance estimates as conditional on East-Asian, HCC-dominant cohorts and treat broader cross-region generalization as an open question that is revisited in Section 4.1, Section 4.2 and Section 4.4.

3.3. Machine Learning Algorithms, Predictors, and Discrimination Performance

Figure 2 compiles the best-performing models and the reported discrimination metrics across the 12-study dataset. Gradient boosting methods were the dominant best-performing family, while artificial neural network, deep learning, radiomics-integrated, and ensemble approaches also contributed substantially. Best reported non-training AUCs ranged from 0.7927 to 0.981, with a median of 0.873. However, the performance metrics were heterogeneous: some studies reported development, validation, and test AUCs separately; others reported internal cross-validation means; and several added temporal, external, or prospective validation. The strongest evidence for generalizability came from studies with non-random validation structures, including Jin et al. [37] (temporal external AUCs: 0.720 and 0.731), Tang et al. [39] (prospective AUC: 0.942), Wang K et al. [41] (external China-cohort AUC: 0.884, with MIMIC-IV AUC: 0.654), and Shen et al. [42] (external validation AUC range: 0.740–0.895).

3.4. Predictor Domains, Interpretability, and Comparison with Traditional Scoring Systems

Table 3 describes the evidence around input domains, validation structure, comparators, interpretability, and major strengths and limitations rather than preserving the original frequency and PROBAST matrix, which depended on several unverified or misclassified source rows. Laboratory liver-reserve variables remained central across most studies, while imaging-derived features were prominent in radiomics and deep learning papers. Operative or volumetric information was frequently incorporated when the objective was preoperative or early postoperative risk stratification, and dynamic perioperative data were a distinctive strength of the 2024–2025 temporal models. Explicit comparator analyses against ALBI, MELD, FIB-4, APRI, or conventional clinical models were common and generally favored ML-based approaches. Interpretability methods were increasingly incorporated, most often through SHAP analyses, nomograms, feature importance plots, or deployable web tools.

Across the 12 included studies, head-to-head comparisons reported by the original authors consistently favored ML over traditional scoring systems on the same internal cohorts (Figure 3). Mai et al. [31] reported an ANN AUC of 0.880/0.876 (development/validation) versus weaker performance for logistic-regression-based scoring; Zhu et al. [32] showed a radiomics-augmented model AUC of 0.894 versus ICG-R15 alone; Wang J et al. [33] showed a LightGBM AUC of 0.870 in validation versus standard non-invasive scores; Tashiro et al. [36] reported an XGBoost AUC of 0.863 versus ALBI and FIB-4; Jin et al. [37] reported a postoperative ANN AUC of 0.851 with temporal-test AUCs of 0.720–0.731 versus MELD/FIB-4/ALBI/APRI; Tang et al. [39] reported an XGBoost AUC of 0.981 (internal validation) and 0.942 (prospective) versus traditional models; Wang K et al. [41] reported a temporal deep learning AUC of 0.952 internally and 0.884 externally (China) versus competing ML baselines, with a marked drop on Western MIMIC-IV data (AUC: 0.654); and Shen et al. [42] reported phase-specific PILOT AUCs of 0.740–0.895 versus traditional models on two external cohorts. Although a unified head-to-head benchmark could not be re-run within a systematic review (we did not have access to harmonized patient-level data, and no study provided shared raw inputs to all baselines), the consistent direction of effect supports the qualitative claim that ML-based models outperform Child–Pugh, MELD, ALBI, FIB-4, and APRI when evaluated on the same cohorts. A formal, prospectively designed comparative benchmark across unified metrics and harmonized PHLF endpoints (ideally grade B/C ISGLS) is identified as a priority for future research in Section 4.4.

Three observations on predictor-domain contribution emerge from cross-study patterns rather than from a within-cohort ablation, which we did not have data to perform. First, models that included only static preoperative laboratory liver-reserve variables (Mai et al. [31], Yuan et al. [40]) reached AUCs of approximately 0.81–0.88 on internal data, suggesting that bilirubin-based liver function, coagulation, and platelet count form a strong baseline signal. Second, addition of imaging-derived radiomics features to laboratory baselines (Zhu et al. [32], Li et al. [35], Famularo et al. [38]) was reported by the original authors to improve AUC over both clinical-only and radiomics-only sub-models, indicating an additive (rather than redundant) contribution from parenchymal heterogeneity captured on MRI/CT. Third, addition of perioperative dynamic data (Jin et al. [37], Wang K et al. [41], Shen et al. [42]) appears to drive the highest discrimination but is also where external generalization weakens most (e.g., the MIMIC-IV drop in [41]). Taken together, the indispensable predictor categories are bilirubin-/INR-based liver function and remnant/operative metrics; imaging radiomics and perioperative trajectories add incremental value but require external validation to confirm robustness. A formal predictor-group ablation (with within-cohort retraining) was not feasible within a systematic review and is flagged as a methodological priority for future model development studies.

4. Discussion

4.1. Analysis of Findings

This review should be interpreted against the long-standing evolution of liver surgery safety research, in which reductions in perioperative mortality were achieved through better patient selection, operative technique, and structured definitions of postoperative hepatic insufficiency [1,2,3,4,5,6,7]. Classical work established the relevance of residual liver volume, elastography, portal vein embolization, hepatobiliary scintigraphy, and bedside liver function testing for estimating postoperative reserve [6,8,9,10,14,15,16,17,18,19,20], while more recent reviews have emphasized that post-hepatectomy liver failure (PHLF) remains a time-dependent syndrome whose incidence and severity vary according to underlying liver disease, extent of resection, and perioperative management [9,20]. Within that broader context, machine learning (ML) has been proposed as a way to integrate numerous interacting predictors beyond conventional linear scoring systems [21,22,23,24,25,26]. In the final analysis set of this review, 12 PubMed-verified studies [31,32,33,34,35,36,37,38,39,40,41,42] collectively suggest that ML can improve individualized PHLF risk stratification, but they do not support a single universally superior architecture. Best reported non-training AUCs ranged from 0.7927 to 0.981, with stronger generalization signals in studies using multicenter, temporal, external, or prospective validation, particularly Jin et al. [37], Tang et al. [39], Wang K et al. [41], and Shen et al. [42].

The predictor landscape observed in the analysis is biologically coherent with the classical hepatology and hepatectomy literature. Albumin, bilirubin, INR or prothrombin time, platelet count, and resection or remnant variables repeatedly entered high-performing models, aligning with traditional assessments of hepatic reserve such as Child–Pugh [11], MELD [12], ALBI [43,44], future liver remnant volumetry [14], functional scintigraphy [15,19], maximal liver function capacity testing [16], Makuuchi or ICG-based decision frameworks [17,18], and contemporary reviews of prevention and perioperative management [20]. This convergence is important because it suggests that ML is not replacing established pathophysiologic reasoning but rather refining it by jointly weighting correlated markers of hepatocellular function, portal hypertension, cholestasis, and operative stress [7,9,28,45,46,47]. At the same time, the revised evidence base shows that richer data streams can add incremental value. MRI- and CT-based radiomics studies [32,35,38,48,49,50,51,52,53] attempted to quantify parenchymal heterogeneity beyond routine laboratory tests, while temporal perioperative architectures [37,40,41,42] and broader EHR-linked deep-learning paradigms [51] used evolving perioperative trajectories in a manner closer to continuously updated clinical monitoring. Notably, the interpretable XGBoost framework used by Tang et al. [39,43,45] highlighted total bilirubin, MELD score, and ICG-R15 as dominant contributors, whereas broader discussions on high-stakes clinical AI continue to favor transparent or inherently interpretable strategies over opaque black-box deployment [46,50,54].

Despite the promising discrimination observed across studies, the methodological constraints remain substantial. Many primary studies were retrospective, single-center, and relatively modest in size for the complexity of the models being developed, a concern that is consistent with general guidance on prediction model development, validation, and sample size planning [27,48,55]. Reporting was also variable: not all studies fully addressed calibration, threshold performance, missing data handling, or model updating, despite the expectations set by TRIPOD and related methodological frameworks [27,49,55]. Risk-of-bias concerns likewise remain prominent under PROBAST [28], especially where feature selection, internal validation, and outcome adjudication were insufficiently detailed. Although the present review was structured according to PRISMA and PRISMA 2020 principles [29,30], the available literature remained heterogeneous in target definitions, ranging from severe PHLF and clinically significant PHLF to broader postoperative liver failure endpoints. That heterogeneity is not trivial, because outcomes defined under the ISGLS framework for PHLF [3,4] are not equivalent to other postoperative complication frameworks such as the ISGLS bile leakage definition [52], and inconsistent endpoint construction inevitably complicates model benchmarking.

Several priorities therefore emerge for the next phase of PHLF model development. First, future studies should adopt standardized and clinically meaningful outcome definitions, ideally with transparent reporting of grade B or C, or otherwise clinically significant, PHLF [3,4,5]. Second, externally validated multicenter models should be prioritized over further proliferation of isolated single-center derivation studies, in line with existing prediction model guidance [26,27,48,55]. Third, multimodal approaches deserve continued attention, particularly when they combine conventional liver-reserve markers with imaging, operative data, and short-horizon postoperative trajectories [18,19,20,24,32,35,37,38,39,40,41,42,53]. Fourth, comparative studies should benchmark ML tools against strong conventional baselines such as MELD- and ALBI-based stratification rather than only weak single-variable comparators [12,13,44]. Fifth, interpretability should be embedded into model design rather than added secondarily because clinicians will need understandable explanations and user-facing communication if these tools are to influence operative strategy or postoperative monitoring [45,46,50,54]. Finally, implementation-oriented studies are required to determine whether EHR-embedded prediction systems, analogous to scalable clinical AI frameworks described more broadly in medicine and surgery [21,22,23,24,25,51], can actually change decisions, improve rescue pathways, and reduce clinically relevant liver failure after resection.

4.2. Study Limitations

This review has several limitations. First, formal quantitative pooling was not appropriate because the included studies differed substantially in outcome definitions, input modalities, validation strategies, and performance reporting. Second, the literature remains geographically concentrated in East Asia, with limited Western validation. Third, several studies reported discrimination more completely than calibration, and sensitivity or specificity data were not uniformly available, limiting cross-study comparability. Fourth, image-based, radiomics-based, and temporal perioperative models are not directly interchangeable, even when all target PHLF. Finally, rapid publication in this field means that the evidence base is evolving quickly and periodic updating of the review will be necessary.

4.3. Identified Gaps in the Current Evidence Base

Beyond the methodological caveats summarized in Section 4.2, several specific evidence gaps emerge from the synthesis. (i) Outcome heterogeneity. The included studies labeled the event variably as severe PHLF, clinically significant PHLF, general PHLF, or postoperative liver failure, which inflates apparent inter-study performance variability. (ii) Reporting completeness. Most studies reported AUC, but sensitivity, specificity, positive and negative predictive values, calibration intercept/slope, decision curve analysis, and aggregated confusion matrices were inconsistently provided, which prevented a more granular meta-analytic synthesis (e.g., a pooled SROC) and limit clinical interpretability. (iii) Validation structure. Only four studies (Jin et al. [37], Tang et al. [39], Wang K et al. [41], Shen et al. [42]) used temporal, prospective, or external validation, and only Wang K et al. [41] explicitly tested transportability to a Western (MIMIC-IV) cohort, where AUC dropped from 0.952 to 0.654—a paradigmatic example of distribution shift after geographic transfer. (iv) Computational cost and deployability. None of the included studies systematically reported training time, inference latency, memory footprint, or GPU/CPU requirements, even though deep learning and radiomics-integrated models impose materially different infrastructure demands than tabular gradient boosting. Tabular gradient boosting models (LightGBM, XGBoost) are generally lightweight and feasible on a standard hospital workstation, while CNN- and temporal-DL-based pipelines require GPU acceleration for training and may need optimized inference deployment for real-time perioperative use. The absence of explicit time- and space-complexity reporting represents a real obstacle to clinical implementation in resource-constrained centers and is identified as a reporting gap. (v) Predictor-group ablation. No included study performed a complete leave-one-domain-out ablation across laboratory, coagulation, volumetry, radiomics, and perioperative–temporal predictor domains, which prevents a definitive ranking of indispensable feature groups within a unified framework. (vi) Distribution shift safety. No study formally implemented selective prediction or cost-aware deferral strategies that flag low-confidence cases for clinician review, despite the documented external validation drops.

4.4. Recommendations and Future Clinical Perspectives

Several recommendations follow directly from the gaps above and outline how the next generation of PHLF prediction models could realistically reach the bedside. First, future studies should converge on the ISGLS grade B/C definition as the primary outcome, while reporting alternative endpoints (e.g., 50-50 criteria, peak bilirubin > 7 mg/dL) as secondary so that head-to-head benchmarks against Child–Pugh, MELD, and ALBI can be conducted on harmonized labels. Second, performance reporting should be expanded beyond AUC to include sensitivity, specificity, PPV, NPV, calibration metrics (intercept, slope, calibration-in-the-large), decision curve analysis, and ideally aggregated confusion matrices, in line with TRIPOD/TRIPOD-AI. Third, multicenter, geographically diverse external validation—particularly transportability between East-Asian and Western, HCC- and CRLM-dominant cohorts—should be the rule rather than the exception and should be accompanied by transparent reporting of distribution shift indicators such as case mix differences and recalibration needs. Fourth, computational cost should be reported routinely (training time, inference latency, memory footprint, hardware) so that clinicians and hospital IT teams can judge feasibility for real-time perioperative deployment in resource-constrained settings. Fifth, multimodal architectures that fuse tabular liver-reserve markers, imaging-derived radiomics, and perioperative dynamic data are likely to outperform single-modality models, but they should be benchmarked against simpler, lower-cost baselines so that incremental gains justify added complexity. Sixth, interpretability (SHAP, feature importance, nomograms, model facts labels) should be embedded from design rather than added retrospectively, and hybrid ensemble strategies that combine boosting with representation learning components, as well as multimodal fusion approaches integrating visual, structural, and semantic representations, deserve continued exploration in this domain [56,57]. Seventh, given the marked external validation drops observed when models are transferred across geographic regions, deployment frameworks should incorporate cost-aware selective prediction and conformal deferral—methodological strategies developed for safe clinical triage under distribution shift are highly pertinent to this challenge [58]. Finally, prospective decision-impact studies are needed to demonstrate that ML-based PHLF prediction actually changes operative strategy, FLR-augmentation use, postoperative monitoring intensity, or rescue pathways rather than merely matching or beating traditional scores on retrospective AUC.

4.5. Methodological Contribution and Novelty Relative to Prior Reviews

Compared with earlier systematic reviews of PHLF prediction, which focused predominantly on traditional regression-based scores or the pre-2022 ML literature [26], the present review offers three distinguishable methodological advances. First, it applies the PROBAST risk-of-bias framework specifically to ML-based PHLF prediction studies, with an explicit, domain-by-domain assessment of participants, predictors, outcome, and analysis—rather than treating ML papers as a generic prediction-model category. Second, it introduces a concrete predictor taxonomy (laboratory liver-reserve markers, coagulation, platelet count, volumetry/operative variables, imaging-derived radiomics, and perioperative dynamic data) that is mapped onto study-level performance, allowing predictor domains rather than individual variables to be compared across heterogeneous models. Third, it foregrounds temporal and external validation as the primary signal of generalizability, identifying phase-specific boosting architectures (Shen et al. [42]) and perioperative dynamic data (Wang K et al. [41], Jin et al. [37]) as emerging predictor strategies whose generalization deserves explicit testing—including the documented MIMIC-IV-based AUC drop in [41], which is exactly the kind of distribution shift signal that is rarely highlighted in prior PHLF prediction reviews. These three elements—PROBAST-anchored quality assessment, a structured predictor taxonomy, and a generalization-first reading of the validation literature—jointly differentiate this review from earlier systematic syntheses in the field.

5. Conclusions

Machine-learning-based prediction of post-hepatectomy liver failure remains a promising and rapidly evolving field. Gradient boosting models, ANN or deep-learning approaches, radiomics-integrated pipelines, and time-phased perioperative architectures all showed clinically meaningful discriminatory performance, particularly when models incorporated core liver-reserve markers together with imaging or operative data. However, the evidence should be interpreted cautiously because most studies were retrospective, single-center, and heterogeneous in outcome definition and validation design. The most credible path toward clinical implementation is the development of transparent, externally validated, multicenter models that report discrimination, calibration, and decision utility within standardized PHLF outcome frameworks.

Author Contributions

Conceptualization, C.M., V.G. and R.C.V.; methodology, C.M., S.A.S. and R.C.V.; software, S.A.S.; validation, C.M., V.G. and R.C.V.; formal analysis, C.M. and S.A.S.; investigation, C.M. and A.M.F.; resources, V.G., R.C.V. and C.V.I.F.; data curation, C.M. and A.M.F.; writing—original draft preparation, C.M.; writing—review and editing, V.G., R.C.V., S.A.S. and C.V.I.F.; visualization, S.A.S. and C.M.; project administration, V.G.; supervision, V.G., R.C.V. and C.V.I.F. All authors have read and agreed to the published version of the manuscript.

Funding

The article processing charge was paid by the Victor Babes University of Medicine and Pharmacy Timisoara.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The authors used ChatGPT v4.0, an AI language model developed by OpenAI (San Francisco, CA, USA), to exclusively improve the manuscript’s language and readability. All the scientific content, interpretations, and conclusions are the original work of the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Clavien, P.A.; Petrowsky, H.; DeOliveira, M.L.; Graf, R. Strategies for safer liver surgery and partial liver transplantation. N. Engl. J. Med. 2007, 356, 1545–1559. [Google Scholar] [CrossRef] [PubMed]
Jarnagin, W.R.; Gonen, M.; Fong, Y.; DeMatteo, R.P.; Ben-Porat, L.; Little, S.; Corvera, C.; Weber, S.; Blumgart, L.H. Improvement in perioperative outcome after hepatic resection: Analysis of 1803 consecutive cases over the past decade. Ann. Surg. 2002, 236, 397–406. [Google Scholar] [CrossRef]
Rahbari, N.N.; Garden, O.J.; Padbury, R.; Brooke-Smith, M.; Crawford, M.; Adam, R.; Koch, M.; Makuuchi, M.; Dematteo, R.P.; Christophi, C.; et al. Posthepatectomy liver failure: A definition and grading by the International Study Group of Liver Surgery (ISGLS). Surgery 2011, 149, 713–724. [Google Scholar] [CrossRef] [PubMed]
Morandi, A.; Risaliti, M.; Montori, M.; Buccianti, S.; Bartolini, I.; Moraldi, L. Predicting Post-Hepatectomy Liver Failure in HCC Patients: A Review of Liver Function Assessment Based on Laboratory Tests Scores. Medicina 2023, 59, 1099. [Google Scholar] [CrossRef]
Balzan, S.; Belghiti, J.; Farges, O.; Ogata, S.; Sauvanet, A.; Delefosse, D.; Durand, F. The “50-50 criteria” on postoperative day 5: An accurate predictor of liver failure and death after hepatectomy. Ann. Surg. 2005, 242, 824–829. [Google Scholar] [CrossRef]
Schindl, M.J.; Redhead, D.N.; Fearon, K.C.; Garden, O.J.; Wigmore, S.J. The value of residual liver volume as a predictor of hepatic dysfunction and infection after major liver resection. Gut 2005, 54, 289–296. [Google Scholar] [CrossRef] [PubMed]
Kauffmann, R.; Fong, Y. Post-hepatectomy liver failure. Hepatobiliary Surg. Nutr. 2014, 3, 238–246. [Google Scholar]
Cescon, M.; Colecchia, A.; Cucchetti, A.; Peri, E.; Montrone, L.; Ercolani, G.; Festi, D.; Pinna, A.D. Value of transient elastography measured with FibroScan in predicting the outcome of hepatic resection for hepatocellular carcinoma. Ann. Surg. 2012, 256, 706–713. [Google Scholar] [CrossRef]
Bekheit, M.; Grundy, L.; Salih, A.K.A.; Bucur, P.; Vibert, E.; Ghazanfar, M. Post-hepatectomy liver failure: A timeline centered review. Hepatobiliary Pancreat. Dis. Int. 2023, 22, 554–569. [Google Scholar] [CrossRef]
Shindoh, J.; Vauthey, J.N.; Zimmitti, G.; Curley, S.A.; Huang, S.Y.; Mahvash, A.; Gupta, S.; Wallace, M.J.; Aloia, T.A. Analysis of the efficacy of portal vein embolization for patients with extensive liver malignancy and very low future liver remnant volume. J. Am. Coll. Surg. 2013, 217, 126–133. [Google Scholar] [CrossRef]
Pugh, R.N.; Murray-Lyon, I.M.; Dawson, J.L.; Pietroni, M.C.; Williams, R. Transection of the oesophagus for bleeding oesophageal varices. Br. J. Surg. 1973, 60, 646–649. [Google Scholar] [CrossRef]
Kamath, P.S.; Wiesner, R.H.; Malinchoc, M.; Kremers, W.; Therneau, T.M.; Kosberg, C.L.; D’Amico, G.; Dickson, E.R.; Kim, W.R. A model to predict survival in patients with end-stage liver disease. Hepatology 2001, 33, 464–470. [Google Scholar] [CrossRef]
Marasco, G.; Alemanni, L.V.; Colecchia, A.; Festi, D.; Bazzoli, F.; Mazzella, G.; Montagnani, M.; Azzaroli, F. Prognostic value of the albumin-bilirubin grade for the prediction of post-hepatectomy liver failure: A systematic review and meta-analysis. J. Clin. Med. 2021, 10, 2011. [Google Scholar] [CrossRef]
Abdalla, E.K.; Denys, A.; Chevalier, P.; Nemr, R.A.; Vauthey, J.N. Total and segmental liver volume variations: Implications for liver surgery. Surgery 2004, 135, 404–410. [Google Scholar] [CrossRef]
de Graaf, W.; van Lienden, K.P.; Dinant, S.; Roelofs, J.J.T.H.; Busch, O.R.C.; Gouma, D.J.; Bennink, R.J.; van Gulik, T.M. Assessment of future remnant liver function using hepatobiliary scintigraphy in patients undergoing major liver resection. J. Gastrointest. Surg. 2010, 14, 369–378. [Google Scholar] [CrossRef]
Stockmann, M.; Lock, J.F.; Riecke, B.; Heyne, K.; Martus, P.; Fricke, M.; Lehmann, S.; Niehues, S.M.; Schwabe, M.; Lemke, A.-J.; et al. Prediction of postoperative outcome after hepatectomy with a new bedside test for maximal liver function capacity. Ann. Surg. 2009, 250, 119–125. [Google Scholar] [CrossRef] [PubMed]
Makuuchi, M.; Kosuge, T.; Takayama, T.; Yamazaki, S.; Kakazu, T.; Miyagawa, S.; Kawasaki, S. Surgery for small liver cancers. Semin. Surg. Oncol. 1993, 9, 298–304. [Google Scholar] [CrossRef] [PubMed]
Granieri, S.; Bracchetti, G.; Kersik, A.; Frassini, S.; Germini, A.; Bonomi, A.; Lomaglio, L.; Gjoni, E.; Frontali, A.; Bruno, F.; et al. Preoperative indocyanine green (ICG) clearance test for predicting post hepatectomy liver failure: A systematic review and meta-analysis. Photodiagnosis Photodyn. Ther. 2022, 40, 103170. [Google Scholar] [CrossRef] [PubMed]
Serenari, M.; Bonatti, C.; Zanoni, L.; Peta, G.; Tabacchi, E.; Cucchetti, A.; Ravaioli, M.; Pettinato, C.; Bagni, A.; Siniscalchi, A.; et al. The role of hepatobiliary scintigraphy combined with SPECT/CT in predicting severity of liver failure before major hepatectomy. Updates Surg. 2021, 73, 197–208. [Google Scholar] [CrossRef]
Soreide, J.A.; Deshpande, R. Post hepatectomy liver failure (PHLF)—Recent advances in prevention and clinical management. Eur. J. Surg. Oncol. 2021, 47, 216–224. [Google Scholar] [CrossRef]
Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Hashimoto, D.A.; Rosman, G.; Rus, D.; Meireles, O.R. Artificial intelligence in surgery: Promises and perils. Ann. Surg. 2018, 268, 70–76. [Google Scholar] [CrossRef] [PubMed]
Veerankutty, F.H.; Jayan, G.; Yadav, M.K.; Manoj, K.S.; Yadav, A.; Nair, S.R.S.; Shabeerali, T.U.; Yeldho, V.; Sasidharan, M.; Rather, S.A. Artificial intelligence in hepatology, liver surgery and transplantation. World J. Hepatol. 2021, 13, 1977–1990. [Google Scholar] [CrossRef] [PubMed]
Kang, C.M.; Ku, H.J.; Moon, H.H.; Kim, S.-E.; Jo, J.H.; Choi, Y.I.; Shin, D.H. Predicting Safe Liver Resection Volume for Major Hepatectomy Using Artificial Intelligence. J. Clin. Med. 2024, 13, 381. [Google Scholar] [CrossRef]
Yoshino, K.; Yoh, T.; Taura, K.; Seo, S.; Ciria, R.; Briceno-Delgado, J. A systematic review of prediction models for post-hepatectomy liver failure. HPB 2021, 23, 1311–1320. [Google Scholar] [CrossRef]
Collins, G.S.; Reitsma, J.B.; Altman, D.G.; Moons, K.G.M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD). Ann. Intern. Med. 2015, 162, 55–63; W1–W33. [Google Scholar]
Wolff, R.F.; Moons, K.G.M.; Riley, R.D.; Whiting, P.F.; Westwood, M.; Collins, G.S.; Reitsma, J.B.; Kleijnen, J.; Mallett, S. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 2019, 170, 51–58. [Google Scholar] [CrossRef]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Mai, R.Y.; Lu, H.Z.; Bai, T.; Liang, R.; Lin, Y.; Ma, L.; Xiang, B.D.; Wu, G.B.; Li, L.Q.; Ye, J.Z. Artificial neural network model for preoperative prediction of severe liver failure after hemihepatectomy in patients with hepatocellular carcinoma. Surgery 2020, 168, 643–652. [Google Scholar] [CrossRef]
Zhu, W.S.; Shi, S.Y.; Yang, Z.H.; Song, C.; Shen, J. Radiomics model based on preoperative gadoxetic acid-enhanced MRI for predicting liver failure. World J. Gastroenterol. 2020, 26, 1208–1220. [Google Scholar] [CrossRef]
Wang, J.; Zheng, T.; Liao, Y.; Geng, S.; Li, J.; Zhang, Z.; Shang, D.; Liu, C.; Yu, P.; Huang, Y.; et al. Machine learning prediction model for post-hepatectomy liver failure in hepatocellular carcinoma: A multicenter study. Front. Oncol. 2022, 12, 986867. [Google Scholar] [CrossRef]
Xu, X.; Xing, Z.; Xu, Z.; Tong, Y.; Wang, S.; Liu, X.; Ren, Y.; Liang, X.; Yu, Y.; Ying, H. A deep learning model for prediction of post hepatectomy liver failure after hemihepatectomy using preoperative contrast-enhanced computed tomography: A retrospective study. Front. Med. 2023, 10, 1154314. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Wang, Q.; Zou, M.; Cai, P.; Li, X.; Feng, K.; Zhang, L.; Sparrelid, E.; Brismar, T.B.; Ma, K. A radiomics model based on preoperative gadoxetic acid-enhanced magnetic resonance imaging for predicting post-hepatectomy liver failure in patients with hepatocellular carcinoma. Front. Oncol. 2023, 13, 1164739. [Google Scholar] [CrossRef] [PubMed]
Tashiro, H.; Onoe, T.; Tanimine, N.; Tazuma, S.; Shibata, Y.; Sudo, T.; Sada, H.; Shimada, N.; Tazawa, H.; Suzuki, T.; et al. Utility of machine learning in the prediction of post-hepatectomy liver failure in liver cancer. J. Hepatocell. Carcinoma 2024, 11, 1323–1330. [Google Scholar] [CrossRef]
Jin, Y.; Li, W.; Wu, Y.; Wang, Q.; Xiang, Z.; Long, Z.; Liang, H.; Zou, J.; Zhu, Z.; Dai, X. Online interpretable dynamic prediction models for clinically significant posthepatectomy liver failure based on machine learning algorithms: A retrospective cohort study. Int. J. Surg. 2024, 110, 7047–7057. [Google Scholar] [CrossRef] [PubMed]
Famularo, S.; Maino, C.; Milana, F.; Ardito, F.; Rompianesi, G.; Ciulli, C.; Conci, S.; Gallotti, A.; La Barba, G.; Romano, M.; et al. Preoperative prediction of post hepatectomy liver failure after surgery for hepatocellular carcinoma on CT-scan by machine learning and radiomics analyses. Eur. J. Surg. Oncol. 2025, 51, 109462. [Google Scholar] [CrossRef]
Tang, T.; Guo, T.; Zhu, B.; Tian, Q.; Wu, Y.; Liu, Y. Interpretable machine learning model for predicting post-hepatectomy liver failure in hepatocellular carcinoma. Sci. Rep. 2025, 15, 15469. [Google Scholar] [CrossRef]
Yuan, J.; Zhang, R.Q.; Guo, Q.; Tuerganaili, A.; Shao, Y.M. Controlling nutritional status score predicts posthepatectomy liver failure: An online interpretable machine learning prediction model. Eur. J. Gastroenterol. Hepatol. 2025, 37, 875–884. [Google Scholar] [CrossRef]
Wang, K.; Yang, Q.; Li, K.; Tang, S.; Zhang, B.; Liao, X.; Du, S.; Fu, W.; Li, Z.; Chen, H.; et al. Learning-based early detection of post-hepatectomy liver failure using temporal perioperative data: A nationwide multicenter retrospective study in China. EClinicalMedicine 2025, 83, 103220. [Google Scholar] [CrossRef]
Shen, H.; Yuan, T.; Si, A.; Shen, Y.; Liu, J.; Jin, L.; Xie, Z.; Zhang, H.; Wei, W.; Dai, Y.; et al. Liver regeneration-associated machine learning architecture integrating time-phased predictions for post-hepatectomy liver failure. EClinicalMedicine 2025, 90, 103661. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Fagenson, A.M.; Gleeson, E.M.; Pitt, H.A.; Lau, K.N. Albumin-bilirubin score vs model for end-stage liver disease in predicting post-hepatectomy outcomes. J. Am. Coll. Surg. 2020, 230, 637–645. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
El-Serag, H.B. Hepatocellular carcinoma. N. Engl. J. Med. 2011, 365, 1118–1127. [Google Scholar] [CrossRef] [PubMed]
Riley, R.D.; Ensor, J.; Snell, K.I.E.; Harrell, F.E., Jr.; Martin, G.P.; Reitsma, J.B.; Moons, K.G.M.; Collins, G.; van Smeden, M. Calculating the sample size required for developing a clinical prediction model. BMJ 2020, 368, m441. [Google Scholar] [CrossRef]
Van Calster, B.; McLernon, D.J.; van Smeden, M.; Bottolo, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
Sendak, M.P.; Gao, M.; Brajer, N.; Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 2020, 3, 41. [Google Scholar] [CrossRef]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
Koch, M.; Garden, O.J.; Padbury, R.; Rahbari, N.N.; Adam, R.; Capussotti, L.; Fan, S.T.; Yokoyama, Y.; Crawford, M.; Makuuchi, M.; et al. Bile leakage after hepatobiliary and pancreatic surgery: A definition and grading of severity by the ISGLS. Surgery 2011, 149, 680–688. [Google Scholar] [CrossRef] [PubMed]
Lambin, P.; Leijenaar, R.T.H.; Deist, T.M.; Peerlings, J.; de Jong, E.E.C.; van Timmeren, J.; Sanduleanu, S.; Larue, R.T.H.M.; Even, A.J.G.; Jochems, A.; et al. Radiomics: The bridge between medical imaging and personalized medicine. Nat. Rev. Clin. Oncol. 2017, 14, 749–762. [Google Scholar] [CrossRef] [PubMed]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
Steyerberg, E.W.; Vergouwe, Y. Towards better clinical prediction models: Seven steps for development and an ABCD for validation. Eur. Heart J. 2014, 35, 1925–1931. [Google Scholar] [CrossRef]
Lee, J.; Kwon, H. Hybrid Ensemble Learning for Malicious URL Detection with BERT and Boosting Models. IEEE Access 2025, 14, 62045–62058. [Google Scholar] [CrossRef]
Kasri, W.; Himeur, Y.; Alkhazaleh, H.A.; Tarapiah, S.; Atalla, S.; Mansoor, W.; Al-Ahmad, H. From Vulnerability to Defense: The Role of Large Language Models in Enhancing Cybersecurity. Computation 2025, 13, 30. [Google Scholar] [CrossRef]
Kwon, H.; Kim, D.-J. Conformal Selective Prediction with Cost-Aware Deferral for Safe Clinical Triage under Distribution Shift. Sci. Rep. 2026, 16, 10016. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram illustrating the study selection process for the systematic review.

Figure 2. Best reported non-training AUC for each included study, with validation or external/prospective values labeled directly on the bars.

Figure 3. Total cohort size per included study in the 12-study evidence base.

Table 1. Baseline characteristics and study design features of the studies included in the final analysis.

Study [Ref]	Year	Country	Design	Centers	Total N	PHLF Rate (%)	Indication	PHLF Definition
Mai et al. [31]	2020	China	Retro	Single	353	24.9 (dev); 23.9 (val)	HCC hemihepatectomy	Severe PHLF
Zhu WS et al. [32]	2020	China	Retro	Single	101	NR	Cirrhotic HCC; major hepatectomy	Post-op liver failure
Wang J et al. [33]	2022	China	Retro	Multi (8)	875	NR	HCC	PHLF
Xu et al. [34]	2023	China	Retro	Single	265	NR	Mixed hemihepatectomy	ISGLS
Li et al. [35]	2023	China	Retro	Single	276	24.0	HCC	PHLF
Tashiro et al. [36]	2024	Japan	Retro	Single	334	9.3	Liver cancer (mixed)	PHLF
Jin et al. [37]	2024	China	Retro	Single	226	10.2	Mixed hepatectomy	Clinically significant PHLF
Famularo et al. [38]	2025	Italy/Europe	Retro	Multi (13)	500	3.4	HCC	PHLF
Tang et al. [39]	2025	China	Retro + Pro val	Single	374	NR	Resectable HCC	PHLF
Yuan et al. [40]	2025	China	Retro	Single	464	NR	HCC	PHLF
Wang K et al. [41]	2025	China + MIMIC	Retro	Multi (6 + ext)	2074	NR	Mixed hepatectomy	ISGLS
Shen et al. [42]	2025	China	Retro	Multi (3)	1071	NR	Major hepatectomy	PHLF

Table 2. Machine learning model characteristics, reported performance metrics, and validation strategies of the included studies.

Study [Ref]	Best ML Algorithm	Reported AUC	Sensitivity	Specificity	Accuracy	Validation Type	No. Predictors	Comparator/Note
Mai [31]	ANN	0.880 dev; 0.876 val	NR	NR	NR	Split-sample val	5	Outperformed LR/scores
Zhu [32]	Radiomics-based model	0.894	NR	NR	NR	Internal	ICG-R15 + 5 rad	Better than clin/rad alone
Wang [33]	LightGBM	0.944 train; 0.870 val; 0.822 test	NR	NR	NR	Multi-cohort split	11	Higher than non-invasive models
Xu [34]	CNN (DL)	0.7927	NR	NR	84.15%	5-fold CV	CT image model	NR
Li [35]	Combined radiomics model	0.84 train; 0.82 test	NR	NR	NR	Split-sample test	ALBI + ICG-R15 + 16 rad	Better than clin/rad alone
Tashiro [36]	XGBoost	0.863	55.6% recall	96.7%	93.1%	Train/test split	12	Higher than ALBI/FIB-4
Jin [37]	ANN	0.766 pre; 0.851 post; 0.720/0.731 temporal	NR	NR	NR	5-fold CV + temporal	3 pre; 4 post	Higher than MELD/FIB-4/ALBI/APRI
Famularo [38]	Averaging ensemble	0.901 test	80.0%	89.5%	89.2%	Train/test split	19 PCA + 5 clin	Better than RF/XGB alone
Tang [39]	XGBoost	0.983 train; 0.981 val; 0.942 pros	NR	NR	NR	Internal + prospective	3	Higher than traditional models
Yuan [40]	LightGBM	0.927 train; 0.703 test; 0.808 val	NR	NR	NR	Train/test/validation	CONUT-centered	Online model
Wang [41]	Temporal DL model	0.952 internal; 0.884 ext; 0.654 MIMIC	NR	NR	NR	External + Western ext	Periop temporal data	Higher than competing algos
Shen [42]	PILOT architecture	0.754–0.904 train; 0.740–0.895 val	NR	NR	NR	Two external cohorts	10/15/20 phase-specific	Better than traditional models

Table 3. Overview of input domains, validation strategy, interpretability, and major methodological features of the included studies.

Study	Lab	Img	Op	Dyn	Ctr	Int	Ext	Outcome	Comp	XAI	Strength	Limit
Mai [31]	Labs+FLR	No	Yes	No	1	Yes	No	sPHLF	LR score	Risk strata	Clear preop model	1 center
Zhu [32]	ICG+labs	MRI rad	Yes	No	1	Yes	No	Liver fail.	Clinical	Nomogram	Image fusion	n = 101
Wang [33]	Routine labs	No	Yes	No	Multi	Yes	No	PHLF	Scores	SHAP	Large cohort	No true external
Xu [34]	Minimal tab.	CT	No	No	1	Yes	No	PHLF	NR	DL workflow	Image-only	No external
Li [35]	ALBI+ICG	MRI rad	No	No	1	Yes	No	PHLF	Clin/Rad	Nomogram	Calibrated	1 center
Tashiro [36]	ALB/PT/ICG	No	Res. vol.	No	1	Yes	No	PHLF	ALBI/FIB-4	Feat. imp.	15-model comparison	Low recall
Jin [37]	Cr/TBIL/CP	No	Extent	Yes	1	Yes	Yes	csPHLF	MELD/FIB-4	SHAP+web	Temporal test	n = 226
Famularo [38]	MELD+clin.	CT rad	Major	No	Multi	Yes	No	PHLF	RF/XGB/SVM	Ensemble	13-center set	Low event rate
Tang [39]	TBIL/MELD/ICG	No	No	No	1	Yes	Yes	PHLF	Trad. mdl	SHAP	Prospective val.	1 center
Yuan [40]	Count+labs	No	No	No	1	Yes	Yes	PHLF	NR	SHAP+online	Simple deploy	Small ext. set
Wang [41]	Periop labs	No	Yes	Yes	Multi	Yes	Yes	PHLF	Competing ML	SHAP+assist	China+MIMIC	MIMIC drop
Shen [42]	Labs+regen	No	Yes	Yes	Multi	Yes	Yes	PHLF	Trad. mdl	SHAP+phase	Biomarker ext.	Complex inputs

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Muntean, C.; Gaborean, V.; Vonica, R.C.; Stefaniga, S.A.; Faur, A.M.; Feier, C.V.I. Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review. AI 2026, 7, 166. https://doi.org/10.3390/ai7050166

AMA Style

Muntean C, Gaborean V, Vonica RC, Stefaniga SA, Faur AM, Feier CVI. Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review. AI. 2026; 7(5):166. https://doi.org/10.3390/ai7050166

Chicago/Turabian Style

Muntean, Calin, Vasile Gaborean, Razvan Constantin Vonica, Sebastian Aurelian Stefaniga, Alaviana Monique Faur, and Catalin Vladut Ionut Feier. 2026. "Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review" AI 7, no. 5: 166. https://doi.org/10.3390/ai7050166

APA Style

Muntean, C., Gaborean, V., Vonica, R. C., Stefaniga, S. A., Faur, A. M., & Feier, C. V. I. (2026). Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review. AI, 7(5), 166. https://doi.org/10.3390/ai7050166

Article Menu

Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review

Abstract

1. Introduction

1.1. Clinical Background and Limitations of Conventional Risk Stratification

1.2. Machine Learning as a Computational Approach to PHLF Risk Prediction

1.3. Rationale, Aim, and Key Research Questions

2. Materials and Methods

2.1. Study Design and Protocol Registration

2.2. Search Strategy and Information Sources

2.3. Eligibility Criteria and Study Selection

2.4. Data Extraction and Quality Assessment

2.5. Data Synthesis and Analysis

3. Results

3.1. Study Selection and PRISMA Flow

3.2. Baseline Characteristics and Outcome Definitions

3.3. Machine Learning Algorithms, Predictors, and Discrimination Performance

3.4. Predictor Domains, Interpretability, and Comparison with Traditional Scoring Systems

4. Discussion

4.1. Analysis of Findings

4.2. Study Limitations

4.3. Identified Gaps in the Current Evidence Base

4.4. Recommendations and Future Clinical Perspectives

4.5. Methodological Contribution and Novelty Relative to Prior Reviews

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI