Comparison of Machine Learning Techniques for Mortality Prediction in a Prospective Cohort of Older Adults

As global demographics change, ageing is a global phenomenon which is increasingly of interest in our modern and rapidly changing society. Thus, the application of proper prognostic indices in clinical decisions regarding mortality prediction has assumed a significant importance for personalized risk management (i.e., identifying patients who are at high or low risk of death) and to help ensure effective healthcare services to patients. Consequently, prognostic modelling expressed as all-cause mortality prediction is an important step for effective patient management. Machine learning has the potential to transform prognostic modelling. In this paper, results on the development of machine learning models for all-cause mortality prediction in a cohort of healthy older adults are reported. The models are based on features covering anthropometric variables, physical and lab examinations, questionnaires, and lifestyles, as well as wearable data collected in free-living settings, obtained for the “Healthy Ageing Initiative” study conducted on 2291 recruited participants. Several machine learning techniques including feature engineering, feature selection, data augmentation and resampling were investigated for this purpose. A detailed empirical comparison of the impact of the different techniques is presented and discussed. The achieved performances were also compared with a standard epidemiological model. This investigation showed that, for the dataset under consideration, the best results were achieved with Random UnderSampling in conjunction with Random Forest (either with or without probability calibration). However, while including probability calibration slightly reduced the average performance, it increased the model robustness, as indicated by the lower 95% confidence intervals. The analysis showed that machine learning models could provide comparable results to standard epidemiological models while being completely data-driven and disease-agnostic, thus demonstrating the opportunity for building machine learning models on health records data for research and clinical practice. However, further testing is required to significantly improve the model performance and its robustness.


Introduction
According to a UK study carried out in 2010, 15% of the European population (those aged >65 years) consumed approximately 60% of healthcare resources, and the estimated This work investigates the development of an ML model able to predict mortality in a two to seven year time frame in a cohort of healthy older adults, and aims to provide a comprehensive comparison on the impact shown by the standard techniques for imbalanced datasets on the overall model. For this purpose, several ML techniques including feature engineering, feature selection, data augmentation, resampling, and so on were investigated. The predictive model is based on features covering anthropometric variables, physical and lab examinations, questionnaires, and lifestyles, as well as wearable data collected in free-living settings. Finally, a comparison of the achieved performance with a standard epidemiological model was also undertaken.

Dataset Description
The dataset used in this investigation was provided by the "Healthy Ageing Initiative" (HAI) study [38], conducted in Umeå, Sweden. HAI is an ongoing primary prevention study conducted at a single clinic with the aim of identifying traditional and potentially new risk factors for cardiovascular disorders, falls, and fractures among 70-year-olds in Umeå [39]. The eligibility criteria were: residence in Umeå municipality and an age of exactly 70. There were no exclusion criteria, and population registers were used for recruitment.
For this work, the data collected in the period from January 2013 to December 2017 were considered. The data collection involved a 3-h health examination for each participant, who was then asked to wear an ActiGraph GT3X+ (ActiGraph, LLC, Pensacola, FL, USA) [40] on the hip for one week. The status of the subjects was monitored using population registers in order to know which patients passed away in the time between their data collection and the end of the study date (31 December 2019).
The study aimed to generate a dataset including various and heterogeneous features to evaluate all the possible aspects that influenced older people's daily life, which are summarized below: • Anthropometry: gender, height, weight, hip and waist circumference, body mass index (BMI); • Medication/Medical history: diabetes, rheumatoid arthritis, secondary osteoporosis, stroke, heart infarction, blood pressure medication, statins, glucocorticoids, previous fracture, previous fall, parent fractured hip; • Lab analysis: systolic-diastolic blood pressure, plasma glucose, total-HDL-LDL cholesterol, triglycerides, heart rate, manual muscle test (MMT), peak expiratory flow, hand grip strength non-dominant hand, time up and go (TUG); • Questionnaires/Lifestyles: Mental health/depression (Geriatric Depression Scale-GDS [41]), tobacco and alcohol consumption, AUDIT-C score [42], physical activity and exercise (IPAQ-International Physical Activity Questionnaire [43]); • Lab tools: GAITRite [44] gait analysis data (i.e., step time, step length, etc.), balance test (sway with full and no vision), bone scans of the non-dominant cortical, trabecular, radius and tibia (via computed tomography-pQCT), total T-score, and analysis of fat and lean mass of various body parts (via X-ray absorptiometry-DXA); • ActiGraph 1-week accelerometer data (e.g., steps taken, time in light, sedentary, moderate, vigorous activities, energy expenditure, etc.). For the data to be acceptable the minimum wear time per day was 600 min, for at least four days.
In contrast with standard datasets on mortality prediction, this dataset also includes data collected from wearable sensors based on the suggestion by Burnham et al. in [45] that data obtained from wearable technology could be predictive of, or significantly associated with, health outcomes.
The overall dataset consisted of 156 parameters for 2291 recruited participants. Of those participants, 92 subjects (approx. 4%) died in the two to seven years follow-up period. For this reason, several imbalanced techniques have been considered in the implementation of the ML modelling.

Machine Learning Modelling
The following section describes a number of approaches considered during the implementation of the ML modelling for this highly imbalanced dataset [46]. The experiments were implemented in Python 3 (Python Software Foundation, Wilmington, DE, USA). The dataset was split into three partitions: training, validation, and test sets. The test was obtained from the 30% of the whole available data, while the remaining 70% was split again into 30% assigned to the validation set and 70% to the training set. The division was stratified to guarantee that the proportion between positive ("subjects who passed away by the end of the study") and negative ("subjects who were still alive") cases was the same in every set.

Data Pre-Processing
Categorical features (e.g., "IPAQ" results) were transformed utilizing one-hot encoding. Those features characterized by "True/False" values were transformed according to a binary association (True was set to 1, False to 0). Age was discarded from the analysis as all the recruited subjects were 70 years old. Moreover, since the variables presented different scales, a normalization process was performed by estimating the mean and the standard deviation of each feature and with the normalized variable obtained by subtracting the feature mean and dividing it by the correspondent standard deviation. The means and standard deviations were calculated for the training set and then used on the validation and test sets to avoid any possible leakage.
Finally, in case of missing entries in the dataset (e.g., because a subject did not complete a specific test), imputation was performed using the mean of the feature for continuous variables and the mode for categorical ones. Again, the calculated means/modes were obtained for the training set and then used on the validation and test sets to avoid any possible leakage. Only 0.124% of the data was missing overall, which is a negligible value. The features with the most missing values were LDL cholesterol (data missing from 0.35% of subjects), peak expiratory flow (data missing from 0.26% of subjects), and AUDIT-C score (data missing from 11.78% of subjects).

Feature Engineering
Feature engineering is the process of using domain knowledge to create new variables to increase the predictive power and accuracy of the model. Given the advanced age of the participants, a first analysis on the implementation of a frailty index was conducted from the existing variables. Frailty is the clinically recognized state of increased vulnerability, related to the ageing process, which is manifested through a decline in the body's physical and psychological reserves. In order to quantify frailty state, it is required to construct a frailty index (FI), based on the accumulation of a variety of functional deficits. Many frailty indices are available in the literature [47]; however, this work adopted the FI implemented by Williams et al. [48] on the UK Biobank dataset, showing the use of multiple features to quantify the FI. The FI is obtained by a combination of variables related to deficits, where for each condition a value is assigned (0-absence, 1-severe presence). The final FI value of each participant is computed as the sum of deficits accrued, divided by the total number of deficits considered. While the original study considered 49 traits covering health, the presence of diseases and disabilities, and mental well-being, in this paper the index was created based on only the 10 common traits which were also included in the available dataset (e.g., MMT score, GDS, diabetes, heart infarction, high blood pressure, high cholesterol, total T-score, secondary osteoporosis, rheumatoid arthritis, and previous fractures).
A Mortality Index (MI) was also developed based on the work of Kim et al. [49]. MI can be seen as a lifestyle index generated to detect the effects on health of different behavioral factors. Each feature used for developing the index represented a risk factor with a certain point score, depending on the value assumed by the feature itself or by the characteristics of the participants. The MI for each participant was evaluated by adding up these points, obtaining a total risk score ranging from 0 to 21 points. While the study described in [49] was based on nine risk factors (e.g., age, male sex, smoking, diabetes, systolic blood pressure, triglyceride, total cholesterol, white blood cell count, and hemoglobin), in this work the MI was developed considering only the first seven factors, as the last two were not included in the dataset under consideration in this analysis.

Models and Hyper-Parameters Tuning
The following models with the related hyper-parameters were considered for the analysis: • A grid search was employed on the training set to attain optimal values for the hyper-parameters. For each combination of hyper-parameters' values, a fivefold stratified cross-validation (CV) procedure was carried out. The combination of hyper-parameters that returned estimates with the higher score was considered to be the optimum. The models were evaluated on the validation set to show that they were able to generalize their predictions with the selected set of hyper-parameters' values without over-fitting. Consecutively, training and validation sets were merged into a single new training set, and the models with the selected hyper-parameters' values were re-fit on this new training set and finally evaluated on the test set. The prediction scores for both training and test sets were calculated.
The scoring metric utilized to optimize the overall model performance is the AUC-ROC; however, given the highly imbalanced dataset available, other useful metrics (AUC-PR, Brier score, F1 score, accuracy, precision, and recall) are also provided for evaluation.

Feature Selection and Outlier Removal
Feature selection is essential to reduce the risk of over-fitting, especially when using a dataset with very high dimensionality. Among all the well-known feature selection methods [50], the algorithm chosen in this work was the Forward Selection Component Analysis (FSCA) [51]. FSCA is a technique performing variable selection and dimensionality reduction at the same time. As shown in [52], FSCA can also be successfully adopted to build interpretable and robust systems for anomaly detection. This is possible because FSCA works differently from other feature selection techniques since it focuses on selecting those features that can discriminate more easily between the two different classes. The pseudo-code for FSCA is shown in Figure S1A (Supplementary Data 1). The main limitation of the method is represented by the need to define the value of features to select K a priori. For this analysis, values of K between five and 20 were taken into account.
Outlier removal is also an important step required to minimize over-fitting. In this work, Isolation Forest was used to detect and remove possible participant outliers in the training set. Isolation Forest is a simple but effective method for identifying possible anomalies in the dataset requiring few parameters [53]. In particular, the contamination level was set to 0.1, while a total of 50 decision trees have been used.

Monte Carlo Data Augmentation
In case of highly imbalanced datasets, it could be helpful to generate new synthetic samples. This can be achieved by means of data augmentation, a process that increases the amount of training data using information available from the training data itself. While data augmentation is well-known in image-related problems, its application to tabular data or electronic health records (EHRs) is less obvious. This is because it is challenging to create records for new synthetic patients which could be recognized as data describing real patients [54].
In this work, the data augmentation algorithm chosen was the Monte Carlo approach proposed in [55]. This approach has been investigated for its ease-of-use and optimal results achieved in other contexts. In particular, the algorithm computes an initial matrix indicated as P, having as many rows as the desired new synthetic samples, and where each feature value is randomly obtained within a range defined by the minimum and maximum of the original feature itself. To preserve the fundamental characteristics of the original dataset, the algorithm then uses geometric distances and K-Nearest Neighbors to compute new samples labels. Once each row of the matrix is properly labelled, the algorithm returns P as final matrix containing all the new synthetic samples. Finally, the generated variables have been properly treated to increase the realistic simulation of the synthetic data (e.g., by forcing the categorical variables to remain categorical). Figure S2A (Supplementary Data 1) shows the pseudo-code for the process.

Over/Under-Sampling, Cost-Sensitive Learning, and Probability Calibration
A technique generally used to deal with highly imbalanced datasets is resampling. When resampling is applied, the data used for model training change by under-sampling the majority class, over-sampling the minority class, or both. In this work, several techniques have been considered covering over-sampling (e.g., SMOTE [56], ADASYN [57]), under-sampling (e.g., RUS [58]), and over/under-sampling (e.g., SMOTEENN [59]). SMOTE, ADASYN, RUS, and SMOTEENN were chosen as those techniques are the most widespread sampling techniques in literature.
While generally standard ML classifiers consider equally all the misclassification errors computed by the model, in an imbalanced classification it is common to rely on costsensitive learning approaches to consider all the misclassification costs while training the ML model [60]. A simple approach to include cost-sensitive learning in the base classifiers is by introducing a "class_weight" hyper-parameter which controls the ratio of class weight between samples of the majority class and samples of the minority class. A higher ratio gives more emphasis to the minority class. In this work the ratios considered were 1:10, 1:20, and 1:100. The combination of sampling technique and cost-sensitive learning is usually performed to handle the data imbalance while achieving high recall and reasonable precision [61].
Finally, another important aspect to be considered is the ability of ML models to predict a probability or probability-like score for class samples. It is usually desired that the distribution of the predicted probability is similar to the distribution of the observed one in the training data. If this condition is not feasible, the model may be over-confident in some cases and under-confident in other cases, especially in cases of highly imbalanced datasets. Hence, probability calibration is required and sometimes needs to be forced by rescaling the achieved probability values to match the distribution observed in the training data [62]. Calibrating probability may be even more essential if resampling techniques are used, as sampling can introduce a bias in the posterior probability [63]. In this work, isotonic regression was used to achieve this purpose via a 3-fold CV.

Results
All the techniques previously discussed are analyzed in this section. All the possible combinations across the techniques and models have been investigated to ensure a thorough performance evaluation. All the tested models have in common the data pre-processing steps, feature engineering, FSCA, and Isolation Forest, to prevent overfitting. The number of chosen features K selected by FSCA has been changed properly, together with the models' hyper-parameters, again to prevent over-fitting while achieving the optimal performance. Furthermore, the analysis has been repeated six times with different data split across training, validation, and test sets to show the repeatability of the model performance. The results in this section are reported as the mean performance of every model for every metric considered as well as its 95% confidence interval (C.I.). It is important to underline that special attention was paid to make sure that none of the developed model was affected by overfitting, as it is evident from the limited difference in performance between training and testing results.

Epidemiological Model
A standard epidemiological model based on a multivariate Cox proportional-hazards model was developed to provide a benchmark for the evaluation of the ML models implemented. The model was developed based on the methodology reported in [36]. Firstly, all the variables that in the training set reported a p-value larger than 0.1 when compared against the label were removed. Moreover, to eliminate any multi-collinearity in the training dataset, a stepwise approach was adopted to remove features with a Variance Inflation Factor (VIF) larger than 5. This process preserved only 28 variables from the dataset. Finally, a systematic backwards elimination method was performed based on the p-value of each feature, removing at every step the feature with the largest p-value. At each iteration, the AUC-ROC was checked, and the process was repeated until a significant loss in performance occurred. The optimal model was identified as the model with the least number of variables without a significant decrease in model performance. Moreover, the model was also validated on the test set to ensure over-fitting did not occur. Results of the Cox model are shown in Table 1

Base Learners
Firstly, base versions of LR, DT, RF, and AdaBoost models have been considered. No further techniques of re-sampling, probability calibration, or cost-sensitive learning have been performed. The results are reported in Table 2, with results on both test and training set. The best performance was achieved by AdaBoost (AUC-ROC: 0.512), closely followed by the other models.

Enhanced Base Learners
All the previous base models have been combined with the different techniques described in the Methods Section.
The results for the LR model are shown in Table S1 (Supplementary Data 2). The best result has been achieved when using ADASYN with probability calibration (AUC-ROC: 0.573). Results with only over/under-sampling techniques have an AUC-ROC between 0.512 and 0.532. However, when probability calibration is applied, the general performance tended to increase (AUC-ROC between 0.510 and 0.573) with respect to the models without. Furthermore, applying cost-sensitive learning does not produce significant improvements (AUC-ROC: 0.510-0.536).
The results for the DT model are shown in Table S2 (Supplementary Data 2). The model performance is slightly worse compared to the LR case, with the highest AUC-ROC (0.541) achieved with SMOTE alone, even though the difference with the other over/undersampling techniques was minimal (minimum AUC-ROC: 0.511). Again, the cost-sensitive learning approach slightly decreased the results (AUC-ROC between 0.507 and 0.535), while the results achieved by using probability calibration have ranked in the middle (AUC-ROC between 0.508-0.536).
The results for the RF model are shown in Table S3 (Supplementary Data 2). In this case, the highest AUC-ROC value (0.642) was reached with RUS without probability calibration. With RF models probability calibration decreased the large variability across the different sampling techniques (AUC-ROC between 0.532 and 0.606, while it was between 0.516 and 0.642 without probability calibration). Cost-sensitive learning, instead, generally decreased the overall performance (AUC-ROC: 0.506-0.534).
The results for the AdaBoost model are shown in Table S4 (Supplementary Data 2). No cost-sensitive learning was tested with AdaBoost as this classifier does not provide the "class_weight" hyper-parameter. The performance between with and without probability calibration is not significantly different (AUC-ROC: 0.518-0.541 without probability calibration, AUC-ROC: 0.519-0.543 with probability calibration), with the best result obtained with ADASYN.
In summary, among all the possible combinations, Random Forest showed the best performance. A statistical analysis between the best model (RF + RUS with/without probability calibration) and the best base learner shows a statistical difference only in the case with probability calibration; e.g., mean difference: 0.13, (95% C.I.: −0.00851, 0.2685), p-value: 0.065 (without calibration), mean difference: 0.094, (95% C.I.: 0.084, 0.1032), pvalue: <0.001 (with calibration). Cost-sensitive learning and probability calibration, used in conjunction with resampling, generally reduced the variability of the model performance, except for LR. Using the different re-sampling techniques tended to improve the performance of some base learners; however, there was no clear winner between the different over/under-sampling techniques which did not show significant differences in general. A graphical depiction of the results is shown in Figure 1.

Enhanced Base Classifiers with Monte Carlo Data Augmentation
To compensate for the high imbalance of the dataset, the Monte Carlo data augmentation technique was adopted by creating synthetic data for both classes, after the adoption of the FSCA and Isolation Forest techniques. Again, all the possible combinations of models with probability calibration, cost-sensitive learning, and sampling techniques have been investigated.
The results for the LR model are shown in Table S5 (Supplementary Data 2). The highest AUC-ROC was achieved with SMOTE and ADASYN without probability calibration (AUC-ROC: 0.539). The higher performance was obtained with over/under-sampling techniques alone with AUC-ROC between 0.515 and 0.539, while probability calibration resulted in an AUC-ROC between 0.507 and 0.530, and cost-sensitive learning between 0.514 and 0.535.
The results for the DT model are shown in Table S6 (Supplementary Data 2). In this case, the adoption of sampling techniques alone provided the worst results (AUC-ROC between 0.510 and 0.519, obtained with ADASYN). Results tended to be much higher when probability calibration was adopted (AUC-ROC: 0.514-0.530), and with cost-sensitive learning (AUC-ROC: 0.521-0.529).
The results for the RF model are shown in Table S7 (Supplementary Data 2). The best performance was obtained with SMOTE combined with probability calibration (AUC-ROC: 0.535). In this case, cost-sensitive learning tended to slightly reduce the overall performance (AUC-ROC: 0.503-0.530) compared to using sampling techniques only (AUC-ROC 0.511-0.530), with the best results obtained when using probability calibration (AUC-ROC 0.511-0.535).
The results for the AdaBoost model are shown in Table S8 (Supplementary Data 2). The performance of the models without probability calibration ranged between 0.507 and 0.521, while they were between 0.510 and 0.529 (using SMOTE) with probability calibration. A graphical depiction of the results is shown in Figure 2. Comparing Figures 1 and 2, as well as Tables S1-S4 and S5-S8, it can be noticed that including synthetic data generated by the Monte Carlo process negatively affected the performance of every model. Across the different techniques, the reduction in performance for LR averaged 0.0055 points (max 0.049), for DT it was 0.004 points (max 0.027), for RF it was 0.028 points (max 0.13), and for AdaBoost it was 0.016 points (max 0.033). When comparing the best models obtained with and without data augmentation, there is a statistically significant difference in favor of the model without Monte Carlo; e.g., mean difference: 0.067, (95% C.I.: 0.04, 0.093), p-value: <0.001.
To investigate the validity of the synthesized samples, a simple but effective test was performed via the development of a model which can classify original vs. synthetic samples. The process for the test was as follows:

•
Given the original dataset of 2291 participants, new synthetic data in the same amount have been obtained with the Monte Carlo data augmentation technique and merged into the original dataset • The labels related to the mortality prediction problem were eliminated for this test from every subject (both original and synthetic) • All the data belonging to the original dataset were re-labelled as class 1, while all the synthetic data were re-labelled as class 0 • The dataset was divided in 70/30 (as test set), with the 70% again split in 70/30 for training and validation purposes, with stratification being applied • A standard RF classifier was used to discriminate between the original and the synthetic data. The model was trained on the training set with the hyper-parameters tuned on the validation set. The optimized model was finally evaluated on the test set. RF was adopted, as it was the model which presented the largest difference in performance when using the Monte Carlo technique The accuracy achieved by the model was 99% (the relevant confusion matrix is shown in Figure 3), indicating the ease for the classifier to discriminate between original and synthetic samples.

Discussion
All-cause mortality prediction is significant for the development of personalized risk management. ML has shown in literature the possibility to outperform existing models for predicting all-cause mortality not only in subjects with specific health conditions but also in prospective studies [36]. This work developed an ML model able to predict a two to seven year all-cause mortality in a cohort of healthy older adults based on features covering anthropometric variables, physical and lab examinations, questionnaires and lifestyles, and wearable data collected in free-living settings, and provided a comprehensive comparison on the impact shown by the standard techniques for imbalanced datasets on the overall model. For this purpose, several ML techniques (e.g., data augmentation, resampling, and so on) were investigated. A summary of the main results shown in Tables S1-S8, limited to AUC-ROC, AUC-PR, precision, and recall, is illustrated in Table 3 to better support the reader. Table 3. Enhanced base learners performance (selected cases summary). In terms of data augmentation, a Monte Carlo approach proposed in [55] has been investigated for its ease-of-use and optimal results achieved in other contexts (i.e., industrial processes). However, with the present dataset, this approach delivered unsatisfactory results, consistently underperforming when compared to models without data augmentation. This was due to the creation of new synthetic patient records which could be easily distinguishable from real patients' data, as proven in the present investigation. Those synthetic records were easily distinguishable because of the approach used for their generation in which values were generated randomly for each feature and with this process repeated independently for each feature. This was a clear signal of the limitations of this Monte Carlo data augmentation technique despite the attention paid in making sure the synthetic data met the criteria of the original dataset (e.g., by forcing categorical variables). Despite this technique being introduced in the literature as a method for augmenting data in predictive fault detection, it seems evident this Monte Carlo technique may not be feasible for health-related datasets of this type.

AUC-ROC AUC-PR Precision
For this reason, it may be useful to investigate the adoption of other data augmentation techniques in health-related datasets, such as Bayesian networks, a mixture of product of multinomials, or Generative Adversarial Networks (GAN) [64]. The generation of realistic synthetic EHR patient data is still an open question which can potentially mitigate the challenges associated with limited sample sizes [65]. Yet this is an under investigated area in the field.
Moreover, when dealing with a high-class imbalanced dataset, different techniques at data-level (e.g., over/under-sampling) or algorithm-level (e.g., cost-sensitive) are popular in addressing the imbalance. The results obtained show that, as expected, those techniques provide an overall improvement in the performance when compared to the base learners' models. However, the obtained results did not show significant differences across the different techniques, thus confirming the findings shown in [66], namely that there is essentially no performance difference between the over-sampling techniques. As indicated in [67], which compared 85 algorithms carried out on 104 datasets, no sampling technique provided consistently superior performance on each model, therefore indicating that sampling algorithms need to be designed to accommodate the intrinsic characteristics of the dataset under consideration, and that model selection and the sampler's hyperparameters selection are a challenging dataset-dependent process.
This investigation showed that, for the dataset under consideration, the best results were achieved with RUS in conjunction with RF (either with or without probability calibration). However, while including probability calibration slightly reduced the average performance, it increased the model robustness, as indicated by the lower 95% confidence intervals, and provided a better AUC-PR score. The 95% C.I. of the model without probability calibration were large (0.504-0.781), highlighting a possible concern in ML models associated with a potential lack of reproducibility. This may be even more evident in small size datasets, as those results could be more significantly affected by a particular training/test data split. Therefore, this still presents a challenge for the adoption of ML models into clinical practice. Moreover, the large 95% C.I. shows that the ML model could potentially outperform the Cox epidemiological model used as a baseline; however, on average, Cox still performed better. Those results confirmed the findings reported in [68], e.g., that random forests provided good results, comparable to expert-based models, but still did not outperform Cox models. This is despite their inherent ability to accommodate nonlinearities and interactions, even if trained on a large dataset of a sample size of over 80,000 patients. Therefore, more complex approaches, such as ensemble models, should be investigated in the future to further improve the performance when dealing with this highly imbalanced small size dataset.
However, RUS + RF without probability calibration presented an AUC-PR of 0.337 and its variant with probability calibration an AUC-PR of 0.467, which were two to three times larger than the baseline obtained with the standard Cox model (AUC-PR: 0.172). Given the similar AUC-ROC performance between the models, the fact that ML provides better AUC-PR compared to the standard Cox model indicates that ML shows a lower misclassification of the examples in the minority positive class compared to baseline [69] (e.g., lower false negatives), thus highlighting the benefits of using ML compared to epidemiological models. Moreover, as is evident in Table S3, RUS + RF with probability calibration showed a high recall (>0.87) with a low precision (0.05), indicating that the number of true positives detected by the model is very high and the number of false positives is significantly larger compared to the number of false negatives. Namely, the model correctly identifies >87% of the dead participants in the dataset as such, while presenting a large number of false positives (e.g., subjects predicted to die while actually being still alive). As in clinical practice, the main goal is to detect as accurately as possible "high mortality risk" subjects, so minimizing the false negatives is more important than minimizing the false positives (especially when dealing with imbalanced data); therefore, a high recall is essential and this is achieved by the developed model. While these results are promising, further work is still required to improve the overall precision of the model for it to be acceptable for clinical practice.
The presented results are comparable to the outcomes reported in similar studies ([70], for example, with the best mortality prediction model showing an AUC-ROC of 0.69, 95% CI: 0.61-0.76, in a sepsis cohort of 2510 patients with 11.5% positive cases). However, as indicated in another example [71], investigating 90-days mortality prediction models in a cohort of 800 patients (8% positive class), AUC-ROC can portray an overly-optimistic performance of a classifier risk score when applied to imbalanced data and AUC-PR provides better insight about the performance of a classifier by focusing on the minority class. Even though this trend is well reported in the clinical literature, AUC-ROC is still the main metric generally adopted. For this reason, the present study presents both AUC-ROC and AUC-PR. When comparing [71] with the results achieved in the present analysis, our model can provide comparable if not better results despite dealing with an even more imbalanced dataset (AUC-PR: 0.467 vs. 0.43 in [71]).
It is worth highlighting that most of the models considered in this study can only provide binary indications on the mortality prediction of the subjects under test, and behave as a black-box, thus without indicating the possible rationale behind those indications. The lack of interpretability of the model could represent a limiting factor to its adoption in clinical settings, as clinicians are focusing also on model interpretation, which is critical to obtain domain knowledge from the computational analysis. However, decision tree-based models can also provide domain knowledge which is easy to understand for clinicians (unlike ensembles, for example), but compared to the other considered models they are generally affected by lower performance (as shown in Tables S1-S8). On the other hand, Cox models can provide this domain knowledge despite the limitations described before. As interpretability is now becoming a significant factor to take into account when developing predictive modelling in healthcare [72], it should be considered for future research in the area. A recent example in the field is represented by [73], which highlighted how explainable ML could generate explicit knowledge of how models make their predictions, and that ML-based models can perform at least as good as classical Cox models and in some cases even better than them While there is a large number of papers comparing the predictive performance of ML models in literature, the strength of the study is, in particular, in the breadth of techniques taken into consideration, such as over/under-sampling, cost-sensitive learning, data augmentation, probability calibration, etc., and in the methodology adopted, which guarantees a fair comparison between the models. None of the developed models was affected by overfitting, as is evident from the limited difference in performance between training and testing results. Even though past studies have looked at comparing different approaches in the same technique (for example, comparing several data augmentation approaches with each other [64], or different over/under-sampling techniques [67]), very few studies have actually investigated the impact that the combination of these techniques (e.g., oversampling with probability calibration and data augmentation) could have on the mode's performance. However, despite the numerous techniques adopted to represent over/under-sampling in this study, those are still a fraction compared to the huge number of sampling variants currently proposed in the literature [67], and further work will be required to also include those methodologies.
Moreover, while a number of recent studies have investigated the comparison of ML models in mortality prediction [74][75][76], those works have only taken into consideration acute patients' data samples in ICUs and not the more complex scenario of mortality prediction in the general population. Moreover, [74][75][76] have not investigated the impact that several techniques for handling imbalanced datasets had on the overall results. This study, therefore, provides an empirical baseline for the field and provides possible indications for future research in the area of ML applied to the problem of mortality prediction in the general population, such as the generation of realistic synthetic EHR patient data, improvements of model robustness for an effective adoption into clinical practice, model interpretability, and the development of more complex models (i.e., ensembles) able to outperform expert-based models.
Finally, another limitation of this study is that those considerations are obtained from a single study dataset, thus replication in other settings (i.e., ICU) is required to further prove the generalizability of the findings. Likewise, the dataset was collected from older adults in Sweden, therefore it is unclear how generalizable those results are to other populations worldwide.

Conclusions
Ageing is a global phenomenon of much relevance to our rapidly changing society, and all-cause mortality prediction represents an important tool for essential healthcare resource allocation and risk management. In this paper, results on the development of a ML model for all-cause mortality prediction in a cohort of healthy older adults are reported. The analysis is based on features covering anthropometric variables, physical and lab examinations, questionnaires, and lifestyles, as well as wearable data collected in freeliving settings, which are not generally included in survival analysis. Several techniques, such as feature engineering, feature selection, data augmentation, over/under-sampling, probability calibration, and a cost-sensitive approach were discussed and investigated for this purpose. The models were also compared to a Cox epidemiological model as a reference. Considerations were drawn on different aspects related to the performance of the data augmentation technique, resampling and models selected in this paper for investigation. It was demonstrated that ML models could provide comparable results to standard epidemiological models while being completely data-driven and disease-agnostic; however, further testing is required to significantly improve the model performance and robustness. Future steps include the investigation of ensemble models that could tackle the highly imbalanced small sample size problem of the present dataset, as well as the application to disease-specific sub-cohorts.   Data Availability Statement: The datasets analyzed during the current study are not publicly available due ethical and national regulations, but are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors have no competing interests to declare.