A Systematic Review of Machine Learning Techniques in Hematopoietic Stem Cell Transplantation (HSCT)

Machine learning techniques are widely used nowadays in the healthcare domain for the diagnosis, prognosis, and treatment of diseases. These techniques have applications in the field of hematopoietic cell transplantation (HCT), which is a potentially curative therapy for hematological malignancies. Herein, a systematic review of the application of machine learning (ML) techniques in the HCT setting was conducted. We examined the type of data streams included, specific ML techniques used, and type of clinical outcomes measured. A systematic review of English articles using PubMed, Scopus, Web of Science, and IEEE Xplore databases was performed. Search terms included “hematopoietic cell transplantation (HCT),” “autologous HCT,” “allogeneic HCT,” “machine learning,” and “artificial intelligence.” Only full-text studies reported between January 2015 and July 2020 were included. Data were extracted by two authors using predefined data fields. Following PRISMA guidelines, a total of 242 studies were identified, of which 27 studies met the inclusion criteria. These studies were sub-categorized into three broad topics and the type of ML techniques used included ensemble learning (63%), regression (44%), Bayesian learning (30%), and support vector machine (30%). The majority of studies examined models to predict HCT outcomes (e.g., survival, relapse, graft-versus-host disease). Clinical and genetic data were the most commonly used predictors in the modeling process. Overall, this review provided a systematic review of ML techniques applied in the context of HCT. The evidence is not sufficiently robust to determine the optimal ML technique to use in the HCT setting and/or what minimal data variables are required.

A decision tree learning technique that produces a classification or regression tree based on the type of dependent variable (i.e., categorical or numerical) Alternating Decision Trees (ADT) [2] A generalized version of decision trees that uses boosting and generate smaller and interpretable rules.
Naïve Bayes (NB) [3] A supervised ML technique based on Bayes theorem considering independence assumption between the features Bayesian Network (BN) [4] Probabilistic graphical models using Bayesian inference Random Survival Forest (RSF) [5] An ensemble learning technique applicable to survival data and is an extension of random forest.
Random Forest (RF) [6] An ensemble learning ML technique that fits multiple trees on random samples of input data and predicts the class based on the combined predictions Adaptive boosting (Adaboost) [7] A supervised ML technique based on boosting that convert a set of weak classifiers to strong for classification

Boosted Regression Trees (BRT) [8]/ Boosted Decision Trees (BDT) [9]
An ensemble learning technique that combines regression trees with boosting by building and combining multiple fits to improve performance. Gradient Boosting Machine (GBM) [10] An ensemble learning technique that uses boosting to convert weak learners to strong using gradients.
Super Learner (SL) [11] An ensemble learning technique that combines the predictions of multiple ML techniques using cross validation and then produce a weighted average of those model predictions.
Stacked Learning [12] An ensemble technique to build first level of predictions from a base ML technique and then use those predictions to predict the outcome. Bayesian Additive Regression Tree (BART) [13] An ensemble learning technique that sums the contribution of weak learners.
Ensemble Learning [14] A set of ML techniques where multiple models are combined to predict a given outcome.
Artificial Neural Network (ANN) [15] A supervised ML technique that consists of three layers i.e., input, hidden and output layers for processing and mimics human brain structure. Multilayer perceptron (MLP) [16] A supervised ML technique based on feedforward neural networks Support vector machines (SVM) [17] A supervised ML technique that transforms the input finitedimensional space into higher dimensional (hyperplane) by linear or non-linear transformations and can be used for classification or regression Multivariate Adaptive Regression Spline (MARS) [18] A flexible adaptive regression technique that captures nonlinearity in the data automatically and applicable to high dimensional data.
Ridge Regression (RR) [19] A regression technique that is used for multivariate regression in data having multicollinearity among variables Elastic Net Regression (ENR) [20] A regularized regression method that combines LASSO and Ridge penalties.
Logistic Regression [21] A type of regression which determines the probability/odds of outcome based on the combination of predictors k-nearest neighbor (k-NN) [23] A supervised ML Technique that labels the given input data point based on the labels of the nearest neighbors defined by k. Linear discriminant Analysis (LDA) [24] A dimensionality reduction technique that determines a linear combination of features to maximize the class separation Shrinkage Discriminant Analysis (SDA) [25] A dimensionality reduction technique that adds a shrinkage parameter to linear discriminant analysis for its applicability in high dimensions.
k-means [26] An unsupervised ML technique that clusters the given data set by using the distance metrics.
Reinforcement Learning [27] Type of ML technique that produces a sequence of decisions and continually learns based on the prior decisions to maximize the reward.
Decision Trees (DT) [28] A supervised ML technique that extracts a set of rules to predict the labels for a given input data 1 Total number of studies are not equal to total number of reviewed studies (27) here since multiple ML Techniques were used in each study. 2 Percentage is calculated as the number of studies coming under each broad category divided by the total number of studies (27).