Enhancing Pulmonary Embolism Mortality Risk Stratification Using Machine Learning: The Role of the Neutrophil-to-Lymphocyte Ratio

(1) Background: Acute pulmonary embolism (PE) is a significant public health concern that requires efficient risk estimation to optimize patient care and resource allocation. The purpose of this retrospective study was to show the correlation of NLR (neutrophil-to-lymphocyte ratio) and PESI (pulmonary embolism severity index)/sPESI (simplified PESI) in determining the risk of in-hospital mortality in patients with pulmonary thromboembolism. (2) Methods: A total of 160 patients admitted at the County Clinical Emergency Hospital of Sibiu from 2019 to 2022 were included and their hospital records were analyzed. (3) Results: Elevated NLR values were significantly correlated with increased in-hospital mortality. Furthermore, elevated NLR was associated with PESI and sPESI scores and their categories, as well as the individual components of these parameters, namely increasing age, hypotension, hypoxemia, and altered mental status. We leveraged the advantages of machine learning algorithms to integrate elevated NLR into PE risk stratification. Utilizing two-step cluster analysis and CART (classification and regression trees), several distinct patient subgroups emerged with varying in-hospital mortality rates based on combinations of previously validated score categories or their defining elements and elevated NLR, WBC (white blood cell) count, or the presence COVID-19 infection. (4) Conclusion: The findings suggest that integrating these parameters in risk stratification can aid in improving predictive accuracy of estimating the in-hospital mortality of PE patients.


Introduction
Venous thromboembolism (VTE), comprising pulmonary embolism (PE) and deep venous thrombosis (DVT), is the third most prevalent acute cardiovascular syndrome, after myocardial infarction and stroke [1].VTE, with its potentially debilitating and often even fatal progression, poses a significant public health concern, particularly given its rising incidence in an aging population [2][3][4].
Accurate risk estimation in PE is of paramount importance for the efficient allocation of medical resources in an effort to enhance patient care.The pulmonary embolism severity index (PESI) stands as a prominent and validated tool for evaluating 30-day mortality risk, incorporating eleven distinct factors [5].Additionally, a condensed version, the simplified PESI (sPESI), has been developed, demonstrating a high efficiency [6].
Other measurements include those obtained by echocardiography.While echocardiographic parameters alone may not possess high specificity or sensitivity for PE, certain measurements can suggest compromised right ventricular function.These include an enlarged right ventricle, diminished pulmonary acceleration time, reduced tricuspid annular systolic plane excursion (TAPSE), and an elevated RV/LV ratio [7,8].
The COVID-19 pandemic has notably intensified the focus on managing patients with pulmonary embolism (PE), primarily due to the hypercoagulability associated with the virus [9].This heightened interest has also been driven by the well-established link between COVID-19 and a wide array of thrombotic complications [10][11][12].
Efforts to further refine risk stratification in acute PE are underway and certain parameters have shown promise in this regard.NLR, for example, has been established as a useful metric in predicting outcomes in PE patients.Elevated NLR levels have been shown to correlate with increased mortality and length of hospital stay, suggesting that NLR could augment traditional risk stratification scores like PESI and sPESI [13,14].
Furthermore, NLR has emerged as a potential prognostic marker across diverse conditions such as sepsis, pneumonia, COVID-19, and neoplastic diseases.Despite the absence of a consensus on the optimal NLR threshold, a higher value of this parameter has been recognized as an independent marker of immune system imbalance and mortality risk in both general and disease-specific cohorts [15].At its core, the NLR is postulated to mirror the balance between acute inflammation (neutrophil count) and adaptive immunity (lymphocyte count) [16], explaining its value in the realm of chronic diseases, where inflammation and immunity play pivotal roles.Moreover, inflammation and oxidative stress are acknowledged as key contributors to the pathogenesis of cardiovascular disease, spurring extensive research into inflammatory biomarkers.Particularly, elevated NLR levels have been independently and significantly linked with a more severe prognosis in a wide range of cardiovascular afflictions, as well as increased risks of all-cause mortality, cardiovascular mortality, and mortality from other causes [17][18][19].NLR stands out due to its cost-effectiveness, accessibility, and capability to enhance risk stratification beyond traditional scores, offering crucial insights into predicting both in-hospital and long-term mortality [20,21].
Emerging evidence underscores a significant overlap in the pathogenic mechanisms of hypercoagulability and inflammation in cases of COVID-19 and pulmonary embolism, mainly showcasing the involvement of cellular-mediated immunity and the use of NLR as a potential marker in this regard [22].The convergence of these mechanisms is particularly relevant in the pursuit of improved mortality prediction methods.
Although prior research has identified associations between NLR [14], COVID-19 [23], and mortality in PE, integrating these factors with established clinical scores has not been sufficiently explored.The purpose of the present study was to show the correlation of NLR and PESI/SPESI in determining the risk of in-hospital mortality in patients with pulmonary thromboembolism.In addition, we sought to investigate the impact of COVID-19 infection and other parameters on the aforementioned outcome.By employing machine learning techniques, specifically two-step cluster analysis and classification and regression trees, we aimed to refine patient risk stratification and enhance the predictive accuracy of existing tools.

Study Design and Data Collection
We conducted a retrospective analysis of data extracted from the hospital records of 160 patients admitted to the County Clinical Emergency Hospital of Sibiu diagnosed with acute pulmonary embolism between January 2019 and December 2021.To achieve this, the records were searched within the primary and secondary diagnoses fields for the diagnosis codes corresponding to acute pulmonary embolism according to the International Classification of Disease-ICD-10 (I26.0;I26.9).Only entries containing acute events were included in the study; i.e., instances where the aforementioned diagnosis codes referred to previous events in the patient's history were excluded.Diagnostic criteria were checked according to the most recent guidelines published by the European Society of Cardiology (ESC) on the diagnosis and management of acute pulmonary embolism [8] and relied mainly on imaging confirmation of pulmonary embolism via computed tomography pulmonary angiography.
Characteristics describing patient demographics (age, gender), medical history, and clinical presentation (including vital parameters on arrival), as well as laboratory and imaging findings were extracted.PESI and SPESI scores were subsequently calculated retrospectively according to the instructions in the guidelines mentioned above.Retrospective computation of these scores has been validated in previous, more extensive retrospective cohort studies, providing reliable results [24].Classification into risk classes according to PESI and sPESI risk scores was implemented using the following cut-offs, as endorsed by the same guidelines published by the ESC [8]: Very low risk: ≤65 points Low risk: 66-85 points Intermediate risk: 86-105 points.High risk: 106-125 points Very high risk: >125 points

• sPESI
Low risk: 0 points High risk: ≥1 point(s) In addition, the presence or absence of concomitant deep vein thrombosis was noted, as well as the presence of COVID-19 infection either on admission or in the 14 days prior to the patient's presentation as documented according to local hospital protocols, which were implemented during the first wave of the COVID-19 pandemic.The neutrophil-tolymphocyte ratio was computed from the first available blood sample drawn within the first 14 days of hospital admission.Due to the known variability in NLR over time, patients were stratified according to the timeframe within which the first complete blood count was available, and a subanalysis of patients with a CBC available in the first 24 h of PE diagnosis was performed.Analysis of NLR values recorded in the first 24 h after PE diagnoses is a similar approach to the one implemented by Efros et al. [14].Patients with no available blood samples were excluded from the study.
Blood tests were performed after venous blood sample collection.CBCs, including total white blood cell, neutrophil, and lymphocyte counts, were computed using fluorescent flow cytometry on an automatic hematology analyzer.NLR was calculated as the ratio between the absolute number of neutrophil granulocytes and the absolute number of lymphocytes, as described previously [16,17].Leukocytosis was defined as a WBC (white blood cell) count above 10 × 10 3 /µL, similarly to Afzal et al. [25].Variables that presented missing data within the study group were excluded from the analysis.Among other determinations, this included C-reactive protein levels, which were not routinely measured.
Echocardiographic data were also extracted, as all the patients admitted with the diagnosis of PE had undergone echocardiographic evaluation at the time of diagnosis.The presence of dilated right ventricle (parasternal long axis proximal RVOT diameter above 30 mm), altered right ventricle function (tricuspid annular plane systolic excursion under 16 mm), dilated inferior vena cava (above 20 mm), or the presence intracavitary thrombus were documented.These measurements and their cut-offs were based on the current recommendations published by the European Society of Cardiology (ESC) on the diagnosis and management of acute pulmonary embolism [8] and current guidelines on echocardiographic chamber quantification [26].

Data Processing
Elevated NLR was defined similarly to Efros et al. [14], whereby, in the absence of a unanimously accepted cut-off value for NLR for predicting PE outcomes, patients with an NLR above the median of the collected sample were compared to those with values below this value, essentially providing a dichotomous variable in this regard.Consequently, patient stratification according to the timeframe of blood sample collection (i.e., within the first 24 h of PE diagnosis vs. all patients, regardless of the moment in which the first CBC was acquired within the first 14 days of hospital admission) yielded different cut-offs in our stratified analysis.Our primary outcome variable was in-hospital mortality.Statistical analysis was executed utilizing the IBM SPSS Statistics 21 software package.Numerical variables were described by their mean, median, standard deviation, 95% confidence interval for the mean, minimum, maximum, and interquartile range values.To evaluate the normality of continuous variables, the Shapiro-Wilk and Kolmogorov-Smirnoff tests were utilized where appropriate, together with the evaluation of the skewness and kurtosis of the data.Categorical variables were described by computing their frequency distribution.

Bivariate Analysis
For continuous variables conforming to a normal distribution, t-Student tests were applied for comparative analysis.Otherwise, a Mann-Whitney U test was implemented.Chi-square or Fisher exact tests were used to identify significant associations between categorical variables.A p-value less than 0.05 was regarded as indicative of statistical significance.

Multivariate Logistic Regression and ROC Curve Comparison
In order to quantify the impact of each predictor for in-hospital mortality identified in the initial bivariate analysis, binary logistic regression was performed.
The methodology employed involved iterative inclusion or exclusion of variables to identify the best-fitting regression model.Numerical variables were mean-centered to mitigate multicollinearity, while categorical variables were transformed into dummy variables.Bootstrapping with 1000 samples was performed to determine 95% confidence intervals for the regression coefficients using the bias-corrected and accelerated (BCa) method.Variables that significantly influenced in-hospital mortality prediction (p < 0.05 and both limits of the 95% confidence interval for coefficients being positive) were retained in the model.
In addition, ROC curves were computed for numerical variables, in order to further illustrate their comparative accuracy in predicting in-hospital mortality.

Machine Learning Algorithms
To further enhance our findings, we employed two machine learning methods, namely a two-step cluster analysis and a classification and regression tree algorithm.This approach was undertaken to discern complex patterns and relationships within the dataset.In our iterative process, variables demonstrating significant correlations with in-hospital mortality were systematically incorporated and subsequently eliminated from the models, aiming to identify the most effective combinations of variables that could reliably predict our target outcome.Two-step clustering used the k-means algorithm and hierarchical agglomerative clustering to delineate groups of patients with similar characteristics regarding the variables employed.While this is an unsupervised method, by feeding the algorithm with variables that correlate with a specific outcome (in our case, in-hospital mortality), the traits of each resulting cluster can converge with respect to this outcome, yielding distinct populations in this regard.We used Akaike's information criterion (AIC) to determine the optimal model fit and allowed for automatic selection for the number of clusters.We selected models with an average silhouette of cohesion separation of at least 0.5 to indicate their robustness.In addition, variables with a predictor importance under 0.5 were discarded from the models to enhance their quality.
CART decision trees also delineate between different patient groups based on specified characteristics.This technique is, however, supervised, whereby the outcome variable is predefined, thus enabling such algorithms to provide prediction models for the investigated outcome.In addition, it delivers a visual model to illustrate the complex interplay between predictors and outcomes, without attempting, however, to provide a causal explanation for the defined rule set.
The construction of the model is executed from the primary root and expands through branching until further division is no longer feasible, correlating all predictors to anticipate the investigated outcome (in our case, in-hospital mortality).Branching is guided by conditions (internal nodes) imposed on predictor variables, which iteratively segment the data.The endpoint of a branch (referred to as a "leaf" or child node) signifies the conclusive decision of the algorithm.The defining parameters involved in tree growth in CART decision trees are based on the principle of entropy, whereby data segmentation across nodes is governed by the reduction in node impurity from one split to the next.The primary objective is to pinpoint the optimal split point (cut-off value) for a predictor variable.Division criteria are optimized based on the Gini index and the Twoing impurity metrics for categorical variables or the LSD (least squares deviation) impurity measure for continuous variables.The algorithm then ascertains the best node division by choosing the predictor that optimizes the division criterion, culminating in the maximal decrease in node impurity, repeating the process of each "child" node until no further enhancement is feasible or pre-established stopping criteria are met.The CART decision tree is characterized by its adaptability for managing various data types and distributions, resilience against outliers, and efficient treatment of missing values through surrogate divisions.
Following the tree's full expansion, pruning trims the tree (eliminating nodes that contribute minimal additional information) to the most compact subtree with an acceptable risk level.This mitigates the risk of overfitting the model to the input data and enhances its stability.
In this study, CART models were computed in pruning mode, considering variables that correlated with in-hospital mortality.To grow the decision tree model, we allowed for automatic selection of maximum growth levels (5 by default), with 5 as the minimum number of cases for parent nodes and 3 for child nodes.For the Gini impurity measure, we selected a minimum change in improvement of 0.0001, and the maximum difference in risk in standard errors was set to 0.
Both techniques are adept at analyzing continuous and categorical variables, despite employing different underlying mathematical constructs.In addition, they have demonstrated their usefulness in enhancing insights from clinical data, even in small sample sizes.Due to the different approaches of the two algorithms towards classifying data, the results they yield are complementary to each other, offering valuable perspectives on patient categorization.These aspects were described in more detail in our previous work [27].

Bivariate Analysis
There were 160 patients included in our study, 76 (47.5%) of whom were female and 84 (52.5%) male.Table 1 shows the distribution of patients according to the first available CBC timeframe.Tables 2 and 3 show the characteristics of the studied group across genders.No cases with a body temperature under 36 • C were recorded within our study group.Median NLR was 3.7 when considering all patients and 4.69 when analyzing the subgroup of patients with a CBC available within the first 24 h of PE diagnosis.Patients with NLR values above the median were categorized as having elevated NLR.Patients with a CBC available within the first 24 h of PE diagnosis were recategorized according to the median of their group when subanalyzed.
Tables 4 and 5 provide information on the distribution of variables across NLR categories.Tables 6 and 7 exhibit the distribution of variables in reference to in-hospital mortality.

Multivariate Binary Logistic Regression and ROC Curves
Four numerical and ten categorical variables showed significant correlations with in-hospital mortality within the entire group.WBC count and the presence of chronic heart failure, chronic pulmonary disease, or a respiratory rate above 30 breaths/min on admission correlated with mortality when considering the entire group, but not within the <24 h CBC subanalysis.Binary regression was performed to identify the strongest predictors for in-hospital mortality.When analyzing the group as a whole, an adequate binary logistic regression model for predicting in-hospital mortality was obtained when retaining the variables defining the presence of COVID-19 infection, elevated NLR, and altered mental status or hypoxemia on admission.
The model had an overall efficiency of 90.6% (57.7% for predicting in-hospital mortality and 97% for predicting survival) and satisfactory goodness of fit (Hosmer-Lemeshow p-value = 0.589).The results containing the statistical significance of the selected variables and the 95% confidence intervals for the regression coefficients calculated via the BCa method are presented in Table 8.
ROC curves for numerical variables found to correlate with in-hospital mortality in the bivariate analysis are illustrated in Figure 1, and the areas under the resulting ROC curves are presented in Table 9.  Areas under the ROC curves are displayed in Table 9.A subanalysis of the group with a CBC available in the first 24 h after PE diagnosis was also performed; however, an adequate binary logistic regression model for predicting in-hospital mortality could not be obtained.ROC curves for numerical variables found to Areas under the ROC curves are displayed in Table 9.A subanalysis of the group with a CBC available in the first 24 h after PE diagnosis was also performed; however, an adequate binary logistic regression model for predicting in-hospital mortality could not be obtained.ROC curves for numerical variables found to correlate with in-hospital mortality in bivariate analysis are illustrated in Figure 2 and the areas under the resulting ROC curves are presented in Table 10.

Two-Step Cluster Analysis
We performed a two-step cluster analysis to enhance the understanding of the interplay between traditional risk scores, the presence of COVID-19 infection, and NLR values.
Of the tested models, a robust variant with an average silhouette of cohesion separation of approximately 1.0 was obtained by using the sPESI category, NLR category (i.e., above or below median), and COVID-19 coinfection.The distribution of variables within the model and its clusters is presented in Table 11, and a visual representation of the model is illustrated in Figure 3.

Two-Step Cluster Analysis
We performed a two-step cluster analysis to enhance the understanding of the interplay between traditional risk scores, the presence of COVID-19 infection, and NLR values.
Of the tested models, a robust variant with an average silhouette of cohesion separation of approximately 1.0 was obtained by using the sPESI category, NLR category (i.e., above or below median), and COVID-19 coinfection.The distribution of variables within the model and its clusters is presented in Table 11, and a visual representation of the model is illustrated in Figure 3.  NLR-neutrophil-to-lymphocyte ratio; SPESI-simplified pulmonary embolism severity index.The frequency of in-hospital mortality across resulting clusters is presented in Figure 4.The differences observed were statistically significant (p < 0.01).
Cluster 5, with the highest mortality, was exclusively comprised of COVID-19 patients, who were classified into the high-risk sPESI category and had elevated NLR.Clusters 1-4 were mainly composed of non-COVID-19 patients (a single case in cluster 3).Cluster 4 contained patients classified into the high-risk sPESI category, which had elevated NLR, while patients in cluster 1 (which showed the lowest in-hospital mortality) were categorized as low risk according to the sPESI score and did not have elevated NLR values.Cluster 2 contained patients classified as high risk according to sPESI score without having elevated NLR values, while Cluster 3 was comprised of patients categorized as low-risk according to sPESI score but who had elevated NLR values.
When analyzing the subgroup of patients with a CBC available in the first 24 h after admission, two-step cluster analysis based on the same variables yielded a model containing only four clusters, with an average silhouette of cohesion separation of 0.9.The distribution of variables within the model and its clusters is presented in Table 12.The frequency of in-hospital mortality across resulting clusters is presented in Figure 4.The differences observed were statistically significant (p < 0.01).The frequency of in-hospital mortality across resulting clusters is presented in Figure 5, with the differences being statistically significant (p < 0.01).
Similar clustering tendencies were observed in the subanalysis, with one cluster comprised exclusively of COVID-19 patients (Cluster 4a) classified as high-risk according to SPESI score, while the rest of the clusters (Clusters 1a-3a) were composed of non-COVID-19 patients.Cluster 3 contained patients who were both classified as high-risk sPESI category and additionally presented elevated NLR levels, while patients in cluster 1a (which showed the lowest in-hospital mortality) were all categorized as low-risk according to the sPESI score, in addition to most frequently not having elevated NLR values.Cluster 2a contained patients categorized as high-risk according to SPESI, while not exhibiting elevated NLR levels, and showed an intermediary value between cluster 1a and 3a regarding in-hospital mortality.
The frequency of in-hospital mortality across resulting clusters is presented in Figure 5, with the differences being statistically significant (p < 0.01).Similar clustering tendencies were observed in the subanalysis, with one cluster comprised exclusively of COVID-19 patients (Cluster 4a) classified as high-risk according to

CART Decision Tree
A cart decision tree was generated using the following variables: the presence of COVID-19 infection on admission or in the 14 days prior, arterial oxyhemoglobin saturation <90%, the presence of altered mental status, and the presence of NLR above the median.The resulting model is presented in Figure 6.
The CART decision tree showed an overall accuracy of 90% (97.3% for predicting survival and 53.8% for predicting in-hospital death).The decision paths in the algorithm distinguished between several patient groups with distinct characteristics regarding the presence of COVID-19 infection, elevated NLR, and particular definitory elements of the PESI/sPESI scores.Notably, the following patient subgroups emerged: • A group of 9 patients with arterial oxyhemoglobin saturation <90% infected with COVID-19, with an 88.9% prediction chance of in-hospital mortality • A group of 9 patients without COVID-19 presented with altered mental status, arterial oxyhemoglobin saturation <90%, and elevated NLR and had a 66.7% prediction chance of in-hospital mortality.• A group of 18 patients without COVID-19 who presented with arterial oxyhemoglobin saturation <90% and elevated NLR and had a 27.8% prediction chance of in-hospital mortality • A group of 15 patients without COVID-19 who presented with arterial oxyhemoglobin saturation <90% but did not have elevated NLR.These patients had a 6.7% chance of predicted in-hospital mortality.• A group of 109 patients who presented with normal arterial oxyhemoglobin saturation.These patients were not further stratified and had a 5.5% predicted chance of inhospital mortality.
When analyzing the subgroup of patients with a CBC available in the first 24 h, compared to the whole group analysis, a more robust model was obtained when implementing mostly numerical variables concerning traditional PE risk estimation strategies.The result is presented in Figure 7.
A cart decision tree was generated using the following variables: the presenc COVID-19 infection on admission or in the 14 days prior, arterial oxyhemoglobin sat tion <90%, the presence of altered mental status, and the presence of NLR above the dian.The resulting model is presented in Figure 6.This iteration delivered an overall accuracy of 94% (95.5% for predicting survival and 88.2% for predicting in-hospital death).Based on the presence of COVID-19 infection, NLR levels, and PESI score, the algorithm identified the following patient subgroups: • A group of 5 patients infected with COVID-19 and a PESI score above 131 with a 100% prediction chance of in-hospital mortality.• A group of 3 patients without COVID-19 who presented with a WBC count above 18.975 × 10 3 /µL and an NLR above 14.525 with a 100% prediction chance of in-hospital mortality.
• A group of 4 patients without COVID-19 who had a WBC count up to 18.975 × 10 3 /µL and an NLR up to 14.525 but presented a PESI score above 189.In this group, the predicted chance of in-hospital mortality was 75%.• A group of 6 patients without COVID-19 and with a WBC up to 18.975 × 10 3 /µL but with a PESI score above 131 and an NLR above 14.525.This group was predicted to have a 66.7% chance of in-hospital mortality.• A group of 52 patients with a PESI score under 131, who had a 3.8% chance of in- hospital mortality.• A group of 13 patients with a PESI score between 131 and 189, who had a WBC count up to 18.975 × 10 3 /µL and an NLR up to 14.525.This group had a 0% predicted chance of in-hospital mortality.

Discussion
We conducted a retrospective analysis of 160 patients presenting with acute pulmonary embolism to investigate the significance of NLR concerning in-hospital mortality and its correlation with established prognostic tools, particularly PESI and sPESI scores.
In our study group, males were more susceptible to malignancies and chronic pulmonary diseases.These findings are in agreement with previously described results [28,29].The inclusion of gender-based analysis in our study stemmed from recognized differences in pulmonary embolism (PE) presentation, risk factors, and outcomes between genders, as substantiated by the existing literature [30,31].Acknowledging gender as a potential confounding factor, we aimed to ensure the comprehensive applicability of our findings across both genders.
Elevated NLR, defined by values above the median of the studied group, was significantly associated with a wide array of characteristics correlated with poor prognosis in pulmonary embolism.Importantly, our data demonstrated statistical significance in the association between elevated NLR and in-hospital mortality, as well as higher PESI and sPESI scores.This finding reinforces the previously described results, which support the idea that a high NLR is a reliable predictor of mortality in pulmonary embolism [32].Moreover, elevated NLR showed significant correlations with a series of individual parameters used in the PESI and sPESI scores, known to influence outcomes independently in PE [33].Namely, more advanced age, the presence of neoplasms, arterial hypotension, altered mental status, and oxygen desaturation were associated with elevated NLR.Similar findings have been reported in the literature concerning the link between NLR and age [19] and with cancer [34].
During the COVID-19 pandemic, NLR gained recognition for its potential to identify immune and inflammatory imbalances.Though preliminary in nature, due to the small sample size (16 COVID-19 patients), our data indicated a significant correlation between elevated NLR and COVID-19 infection.Both of these entities have been linked to increased mortality rates among hospitalized patients [35].
The association between COVID-19 and pulmonary embolism has been a subject of considerable attention in previous research [36,37] and the deleterious impact of COVID-19 infection on the mortality of PE patients has been thoroughly documented [23].Our study showed similar correlations.
PESI and sPESI scores were, as anticipated, significantly increased in patients who experienced a fatal outcome.With regards to the individual elements of the PESI/sPESI scores, chronic heart failure, chronic pulmonary disease, arterial hypotension, tachypnoea, altered mental status, and hypoxemia were all correlated with increased in-hospital mor-tality, as also described in the original article by Aujesky et al. that first defined the PESI score [33].
To explore the specific role of NLR in risk stratification for in-hospital mortality, we utilized a range of machine learning algorithms as a novel methodology in this area of study.
Our two-step cluster analysis yielded a highly robust model that categorized patients based on the presence of COVID-19, sPESI classification, and elevated NLR.This model delineated distinct clusters with significantly disparate in-hospital mortality rates.Notably, COVID-19 emerged as a differentiating factor, identifying a subset within cluster 5, which contained nearly all the COVID-19-positive patients of our study group and had a mortality of 60%.In addition, they displayed elevated NLR, and most were classified as high-risk according to sPESI.In the remaining patient groups, the cluster characterized by both increased NLR and high-risk sPESI (cluster 4) exhibited the highest mortality rate.In contrast, patients with low NLR and low-risk sPESI class (Cluster 1) experienced a 0% mortality rate.The transition to clusters 2 and 3 underscores the potential modulatory effect of NLR on risk stratification.Despite cluster 3 patients being classified as low-risk according to sPESI, they exhibited more frequent elevated NLR levels and were associated with a significantly higher mortality rate compared to patients in cluster 2, who were deemed high-risk based on sPESI criteria.It is important to note that although cluster 3 included a COVID-19-positive patient, this individual did not succumb to the illness, suggesting that factors other than COVID-19 status, such as elevated NLR, maintain their validity in mortality prediction.
The CART algorithm further nuanced the role of elevated NLR in this regard, showing an overall accuracy of 90% based on hypoxemia, the presence of COVID-19 infection, elevated NLR, and altered mental status, while offering a visual framework in this regard.The algorithm's performance was particularly high in predicting survival (97.3%), while showing a more modest performance for in-hospital death prediction (53.8%).This discrepancy highlights the potential utility of the algorithm in developing screening tools that could expediently stratify patients, particularly low-risk individuals, utilizing readily available data.
The decision tree identified oxygen saturation below 90% as the primary stratifying factor, significantly correlating with increased in-hospital mortality rates.Subsequent bifurcations in the tree revealed that the presence of COVID-19 infection may influence the risk of in-hospital mortality.Further divisions within the tree highlighted elevated NLR as a modulatory factor, suggesting its utility as a prognostic marker in the hierarchical assessment of patient risk.The recognition of elevated NLR as a significant predictor of mortality invites further investigation into its pathophysiological roles and potential integration into comprehensive risk assessment models.Ultimately, this decision tree provides a data-driven approach model for prioritizing clinical interventions resource allocation.
To address potential variations in NLR measurements due to timing, a focused subanalysis was conducted on patients with a complete blood count (CBC) obtained within the first 24 h after PE diagnosis.This timeframe has been previously validated in the literature, exploring the significance of NLR in PE prognosis estimation [14].The subanalysis recalibrated the elevated NLR threshold based on the median of this subset, revealing a minor deviation in the cut-off value (4.69 compared to the initial 3.7).While slight variations in the correlation between elevated NLR and PE prognostic factors were noted, the fundamental prognostic significance of NLR, particularly in its association with in-hospital mortality when integrated with conventional risk scores, was reaffirmed.
In addition, cluster analysis revealed consistent patient stratification patterns, distinguishing COVID-19 patients automatically and highlighting their distinct prognosis in the entire group and also in the subanalyzed group.The clustering of non-COVID-19 patients according to SPESI category and elevated NLR showed similar patterns with the analysis of the whole group, iteratively showcasing the modulatory impact of NLR on patient prognosis, whereby mortality differed significantly across clusters in a consistent manner across the analyzed study groups.
The CART-based algorithm, however, displayed enhanced precision with numerical variables in the subanalysis, while in the analysis of the entire group, the use of categorical variables yielded a satisfactory model.Despite employing an adjusted NLR threshold for patient sub-stratification, this algorithm nonetheless remained consistent in illustrating the overall predictive relevance of NLR for in-hospital mortality across both the complete and subanalyzed groups.

Strengths and Limitations
The current study utilized well-established prognostic tools, notably PESI and sPESI scores, which have been critically validated for evaluating mortality risk.This alignment with clinical standards underpins the methodological robustness of the research.Incorporating advanced machine learning techniques, such as two-step cluster analysis and classification and regression trees (CART), introduces a novel element to the field.These techniques offer new insights into complex patterns that may not be apparent through traditional statistical methods, offering significant advantages in pathologies that are influenced by a wide array of variables, such as pulmonary embolism.The analysis of the interplay between diverse parameters, focusing on NLR and COVID-19 status, provides comprehensive risk stratification insights relevant to clinical practice, highlighting potential avenues for optimizing patient care and resource allocation.
While this study provides valuable insights into the prognostic utility of NLR and its integration with PESI/sPESI scores in the context of pulmonary embolism, particularly during the COVID-19 pandemic, we must acknowledge its limitations.The retrospective, single-center design may limit the generalizability of our findings.Data collected retrospectively can introduce biases that prospectively designed studies might avoid, such as selection bias and information bias.Our findings are reflective of a single institution's patient population and practices, which may not be representative of broader clinical settings.In addition to the study's reliance on data from a single center, the absence of randomization is a further limitation to be considered, which may hinder the generalizability of our findings.Furthermore, the relatively modest sample size constrained our ability to detect smaller effect sizes and may limit the statistical power of our analyses.Future studies should aim to validate our findings through multicentric, prospective research designs, which could provide a more diverse patient population and reduce potential institutional biases.Additionally, larger sample sizes would enhance the reliability of the machine learning models developed and provide a more robust predictive framework for clinical use.
Notwithstanding, the consistency of the results with findings from previous research provides a measure of validation, lending credibility to the data presented.In addition, the novel approaches described can serve as a framework for future larger studies spanning across multiple centers that could make use of randomized sample selection and prospective data collection.
A notable limitation is the undefined optimal threshold for NLR, which, in this study, was based on the sample's median.While practical, this may not be the optimal threshold for broader patient populations and different clinical environments.However, this approach aligns with methodologies from other large-scale studies that have yielded significant results [14].The NLR thresholds used in this study (i.e., 3.7; 4.69) were relatively close to the range of cut-off values identified in the literature.In particular, one metaanalysis mentioned NLR cut-off values for mortality prediction in PE varying between 5.4 and 9.2 [38].The split identified by the CART algorithm, however, in the subanalysis of patients with a CBC available in the 24 h after PE diagnosis (14.525), further raised concerns regarding the optimal interpretation of this parameter, particularly in the context of time-sensitivity.Furthermore, more extensive studies should explore the impact of this parameter's variation, as well as its definitive cut-off points for predicting cardiovascular outcomes.
Additionally, while the use of machine learning algorithms is innovative, there is a risk of overfitting the models to the particular dataset, which could reduce their predictive accuracy when applied to other populations.We mitigated this risk with the CART algorithm, however, by pruning the resulting trees.

Conclusions
The current study presents a novel contribution to pulmonary embolism risk stratification by incorporating advanced machine learning techniques, which have elucidated complex patterns in patient data, particularly emphasizing the prognostic significance of elevated NLR in PE patients.While the study was retrospective in nature and based on data from a single center, the findings underscore the additive value of NLR in enhancing the predictive accuracy of existing tools in pulmonary embolism, while providing a nuanced perspective on patient risk assessment.Our results emphasize the possibility of refining risk prediction in PE based on NLR values, as well as additional parameters such as WBC count and COVID-19 infection status, setting a precedent for future studies to build upon its findings and methodologies.

Figure 1 .
Figure 1.ROC curves for numerical variables predicting in-hospital mortality.

Figure 1 .
Figure 1.ROC curves for numerical variables predicting in-hospital mortality.

Figure 3 .
Figure 3. Cluster comparison.The frequency of in-hospital mortality across resulting clusters is presented in Figure4.The differences observed were statistically significant (p < 0.01).

Table 1 .
First available CBC timeframe.

Table 6 .
Numerical variables and in-hospital mortality.

Table 7 .
Categorical variables and in-hospital mortality (% of column categories).

Table 9 .
Area under ROC curves for numerical variables.

Table 9 .
Area under ROC curves for numerical variables.

Table 11 .
Two-step cluster analysis results.

Table 11 .
Two-step cluster analysis results.

Table 12 .
Two-step cluster analysis results.