The Role of Genetic Factors in Characterizing Extra-Intestinal Manifestations in Crohn’s Disease Patients: Are Bayesian Machine Learning Methods Improving Outcome Predictions?

(1) Background: The high heterogeneity of inflammatory bowel disease (IBD) makes the study of this condition challenging. In subjects affected by Crohn’s disease (CD), extra-intestinal manifestations (EIMs) have a remarkable potential impact on health status. Increasing numbers of patient characteristics and the small size of analyzed samples make EIMs prediction very difficult. Under such constraints, Bayesian machine learning techniques (BMLTs) have been proposed as a robust alternative to classical models for outcome prediction. This study aims to determine whether BMLT could improve EIM prediction and statistical support for the decision-making process of clinicians. (2) Methods: Three of the most popular BMLTs were employed in this study: Naϊve Bayes (NB), Bayesian Network (BN) and Bayesian Additive Regression Trees (BART). They were applied to a retrospective observational Italian study of IBD genetics. (3) Results: The performance of the model is strongly affected by the features of the dataset, and BMLTs poorly classify EIM appearance. (4) Conclusions: This study shows that BMLTs perform worse than expected in classifying the presence of EIMs compared to classical statistical tools in a context where mixed genetic and clinical data are available but relevant data are also missing, as often occurs in clinical practice.


Introduction
Inflammatory bowel disease (IBD) can have various clinical presentations, making the classification of patients very challenging. In subjects affected by Crohn's disease (CD), extraintestinal manifestations (EIMs) significantly impact health status. Several European studies report EIMs are present in 20-40% of Crohn's disease (CD) patients, are more common in females than males and are associated with a duration of disease >10 years (48.9% vs. 29.9% in patients with diseases <10 years) [1][2][3]. Joint manifestations (peripheral or axial arthropathies) are the most common EIMs in IBD and occur in 20-30% of patients, with symptoms ranging from noninflammatory arthralgia to acute arthritis with painful swollen joints, usually associated with active IBD [4]. Articular manifestations may be more frequent in CD patients with stenosing/penetrating disease [4].

Introduction
Inflammatory bowel disease (IBD) can have various clinical presentations, making the classification of patients very challenging. In subjects affected by Crohn's disease (CD), extra-intestinal manifestations (EIMs) significantly impact health status. Several European studies report EIMs are present in 20-40% of Crohn's disease (CD) patients, are more common in females than males and are associated with a duration of disease >10 years (48.9% vs. 29.9% in patients with diseases <10 years) [1][2][3]. Joint manifestations (peripheral or axial arthropathies) are the most common EIMs in IBD and occur in 20-30% of patients, with symptoms ranging from noninflammatory arthralgia to acute arthritis with painful swollen joints, usually associated with active IBD [4]. Articular manifestations 2 of 14 may be more frequent in CD patients with stenosing/penetrating disease [4]. Erythema nodosum, pyoderma gangrenosum, and aphthous stomatitis are the most common cutaneous manifestations in IBD. Prevalence rates are 5-15% in CD patients with female predominance, and they parallel disease activity in up to 92% of episodes and may recur in approximately 20-30% of patients [5]. The most common ocular manifestations include uveitis and episcleritis. Their prevalence is between 3% and 6% in CD, and they are associated with disease activity in up to 78% of episodes, with recurrent episodes in approximately 30% of patients [5]. Ocular manifestations may frequently occur together with other (joint or cutaneous) EIMs [6]. Primary sclerosing cholangitis (PSC) may be present in 0.7-2% patients with CD; however, if small-duct PSC is included, the overall prevalence ranges from 2.4 to 11% [4,7].
Many strategies have been proposed to appraise the role played by genetic factors in the risk of EIMs occurrence [8]. Knowledge of the risk of EIMs associated with a patient's characteristics is important for better planning and more tailored treatment strategies [9]. Nevertheless, the relationship between the genetic characteristics of the patient and the occurrence of EIMs remains unclear [10].
Predicting the occurrence of EIMs on the basis of a patient's characteristics is a challenging task for two main reasons: there is an increasing amount of information on subjects' characteristics and a limited number of patients for whom data are available [11][12][13]. The high complexity of EIM prediction requires more powerful and complex statistical methods that provide valuable information from a clinical point of view. Indeed, traditional statistical methods, such as simple/multivariate logistic regression, may not represent the best solutions to characterize such a complex risk structure, which may often be characterized by nonlinear relationships between outcome and predictors and interactions among covariates.
Traditional Machine Learning Techniques (MLTs) have been promoted as a promising approach for modeling the role of genetic factors in EIM prediction [14]. The integration of the Bayesian frameworks in the MLTs field has been recently proposed and the use of Bayesian machine learning techniques (BMLTs) is rapidly becoming popular in the medical setting. The ability of these methods to embrace the modeling flexibility of MLTs [15][16][17] with the advantages of Bayesian inference makes them potentially robust in situations in which standard methods may fail, such as handling missing data, robustness to overfitting in presence of small sample size and embedding of external information [18,19].
Some of the most used BMLTs in the medical field are Na  29.9% in patients with diseases <10 t manifestations (peripheral or axial arthropathies) are the most common EIMs in n 20-30% of patients, with symptoms ranging from noninflammatory arthralgia to with painful swollen joints, usually associated with active IBD [4]. Articular may be more frequent in CD patients with stenosing/penetrating disease [4].
The goal of this study is to understand whether the prediction of EIMs using patients' clinical and genetic characteristics is enhanced by using novel and existing BMLT approaches. Moreover, we aim to compare the predictive performances of these methods to provide suggestions and support in terms of implementation of modeling strategies.

Material and Methods
The dataset employed in this study is the same set used in Giachino et al. [14]. It refers to an observational study on the role of genetic factors for IBD. Information on subjects with CD and ulcerative colitis (UC) was retrieved in cooperation with three gastroenterology units in Torino, Italy. Overall, 152 subjects were enrolled in the study. Clinical and familial information on patients was acquired and labeled following the Vienna classification employed at the time of the study. EIMs were defined as manifestations of rheumatological, dermatological, ocular, liver, biliary, and amyloidosis symptoms. Patient characteristics that could potentially affect EIM occurrence were retrieved and divided into two main groups: (i) characteristics of the disease and factors that are known to be associated with the clinical endpoint-age at onset (age), location (location), behavior (behavior), presentation of the disease (onset), gender (gender), smoker status (smoker), family history (history); and (ii) genetic polymorphisms of the following genes: NOD2, CD14, TNF, IL12B and IL1RN.
According to the Vienna classification, patients are defined as A1 if the diagnosis is made before 40 years of age and A2 if it was made after 40 years of age. Regarding location, L1 corresponds to disease located in the terminal ileum and possibly involving the caecum, L2 corresponds to colonic disease, L3 to ileocolonic disease, and L4 to CD involving the upper gastrointestinal tract (irrespective of the other locations of the disease). Regarding behavior, B1 corresponds to a non-structured and nonpenetrating disease, B2 to a structured disease, and B3 to a penetrating disease.
Presentation of the disease (onset) refers to medical when a medical diagnosis related to signs and symptoms that provided a suggestion of Crohn's disease to a medical doctor was made without previous bowel surgery. Surgical refers to cases when the first event of CD was surgery; i.e., a patient was admitted to a hospital for a surgical indication, and a diagnosis of Crohn's disease was made during the surgical operation or was based on histological analysis of the removed bowel. Family history refers to having a relative with CD (the strongest risk factor for CD, i.e., first-degree relatives of patients with CD have a 12-to 15-fold greater risk of developing CD than do people of comparable age in the general population).

BMLTs
NB, BN, and BART are marked by high flexibility and good prediction ability, which make them very popular among BMLTs and widely applied in the medical field [26].
The NB classifier can be represented as a simple Directed Acyclic Graph (DAG), and is very easy to build. In graphical notation, each variable is represented by a node, and the relationships between variables are described by arcs with edges from node to node. The simplicity of NB lies in its naïve assumptions, i.e., the outcome variable is considered the parent node of all other nodes, which are also called child nodes, and there is no connection between child nodes [27]. The fixed structure of this graph makes its implementation very simple. Once the structure of the graph is built, classification is carried out using Bayes' theorem by computing the posterior predictive distribution of the probability of observing the outcome of interest given the posterior distributions of the parameters. In the literature, the NB classifier has been shown to perform well [28], especially with small sample size datasets [29].
BNs are DAGs [30] that model the joint probability distribution of a set of variables. The graphical structure of a BN can be implemented by imposing the relationships between nodes with expert opinions on the phenomenon under study or by defining the connections between the variables with learning algorithms. Bayes' theorem is then used to compute the conditional probabilities of the parameters of the models given the values of the variables, which quantify the relationships among nodes connected. Like NBs, BNs provide information on the probability that an event of interest will occur, given the values of the nodes to which the outcome node is connected. From a clinical point of view, this can be very helpful for a researcher to profile patients given the predicted event risks associated with several characteristics.
BART was introduced by Chipman et al. (2010) [31]. It belongs to the popular family of "ensemble-of-trees" methods, such as Bagging, Random Forest, and Boosting. BART is essentially a sum-of-trees model: it describes the relationship between an outcome of interest and a set of covariates as a sum of many regression or classification trees plus a random component. The contribution of each tree to the total sum is weakened by imposing regularizing prior distributions on the parameters that control the sum-of-trees. Regularizing priors are then able to prevent the contribution of a single tree from dominating the total sum, avoiding the problem of overfitting. BART is then able to provide information on the probability of occurrence of an event of interest given subject characteristics by flexibly relating the clinical endpoint to the potential explanatory variables.
In summary, NB and BN are both graphical models. The former imposes a fixed structure on the network and, despite its strong unrealistic assumptions, it has performed very well in several situations [32]. The latter learns the structure of the network using several algorithms or by relying on expert opinions, which makes its classification ability highly dependent on the learning procedure. Several studies compared NB and BN, showing that both techniques have similar performance [33][34][35]. Moreover, many studies compared the performance of NB and BN with other classical MLTs, showing that they have similar predictive performances with respect to the other classical methods [36][37][38][39][40]. BART belongs to the family of ensemble-of-trees models, and its regularizing prior distributions and ability to put together weak classifiers avoid overfitting. It is a good alternative to other more popular MLTs, such as lasso regression, random forest, and neural networks [31].

Statistical Analysis
Patients were classified according to the presence or absence of EIMs if the predicted probability of the event was higher than 0.5.
Missing data were inferred with the same strategies used in the study of Giachino et al. [14]: the median of available sample values for continuous variables and an additional level indicating the presence of missingness for categorical variables.
BMLTs were compared with the statistical approaches employed in the study of Giachino et al. [14], i.e., logistic regression (LR), generalized additive model (GAM), projection pursuit regression (PPR), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and artificial neural networks (ANN). Statistical analysis was implemented following the same model validation approach adopted in the study mentioned above to perform a fair comparison. Parameters of the models were chosen using repeated k-fold cross-validation, setting the number of folds equal to 10 and repeating the process 10 times. Regarding BART models, the Gibbs sampler was used to draw from the posterior distribution using 2000 draws and discarding the first 500 as "burn-in" draws. Convergence of the posterior distribution was assessed by inspecting the acceptance percentage of MCMC proposals across the trees, the average number of leaves, and tree depth at each MCMC iteration after the "burn-in" draws. All of the patients in the sample were considered in the training set. Since no external data for testing the models were available, model validation was carried out using 1000 bootstrapped samples of the original dataset [41], a procedure that has been shown to give unbiased estimates of the error rate [42]. The ability to correctly classify the presence or absence of EIMs was evaluated by comparing several indicators: Somers' Dxy (Somers' D), positive and negative predicted values of model predictions (PPV and NPV, respectively), overall misclassification error (MCR, which corresponds to the percentage of observations wrongly classified) and the area under the ROC curve (AUC). Moreover, the sensitivity and specificity of BMLTs were also assessed to provide more information on their predictive ability. We did not use the discrimination index-defined as the likelihood ratio divided by the sample size [42]-to compare the techniques as in Giachino et al. [14]. Rather, following Gelman and Rubin [43], we preferred to explore their posterior distributions by means of their ability to correctly classify patients according to the indicators mentioned above.
For each model, two analyses were performed: one including only the first group of potential explanatory variables as covariates, and one also including genetic variables. Such a strategy was conducted to evaluate the impact that genetic factors have on the overall ability of the models to predict the presence or absence of EIMs.
Regarding the analysis with BN, two elucidations needed to be done. First, several algorithms can be used to learn the structure of the network. The choice of the algorithm is nontrivial, and currently, no rigorous method covers this issue. Since the definition of a procedure for choosing the best algorithm is beyond the scope of this study, the network that achieved the lowest MCR was selected. Second, to implement a more robust model, the structure of the network was chosen with a model averaging approach for each algorithm [44]. The structural learning process was repeated on 1000 bootstrap replicates of the original sample, and the structure of the network was created according to the approach in Scutari and Nagarajan (2013) [45].
All analyses were performed using version 3.4.3 of R software [46] on a HP ProDesk 490 G3 MT Business PC with a Inter(R) Core(TM) i7-6700 CPU @ 3.40GHz processor. NB and BN were implemented using the "bnlearn" package [47], while BART was implemented using the "bartMachine" package [48]. Default prior distributions of the "bnlearn" package were used for NBs and BNs. The data that support the findings of this study are available from the corresponding author upon request. The R code to simulate the data and reproduce the analysis is available on Github (https://github.com/UBESP-DCTV/ibd-bmlt).

Results
The distributions of the patient's characteristics by absence/presence of EIMs are reported in Table 1. Overall, 75 subjects (nearly 49%) had EIMs, whereas 77 patients (nearly 51%) did not have EIMs. Onset, behavior, location, age, family history, and polymorphism of NOD2 and CD14 showed similar distributions across patients with and without EIMs, whereas the distributions of the other variables were more unbalanced. MCRs of the BNs learned with different algorithms are reported in Table 2. All the available algorithms in the "bnlearn" package were used to learn the structure of the network. Among the models without genetic variables, the network learned with the SI-HITON-PC algorithm showed the lowest MCR, whereas the poorest performance was observed for the networks learned with the IAMB, Fast-IAMB, and TS algorithms. Among the models with the genetic variables, the best performances were observed for the networks learned with the TS and the HC algorithms, whereas the network learned with the IAMB algorithm showed the highest MCR. Table 2. MCR of the networks learned with different algorithms. The selected algorithms were the ones implemented in the "bnlearn" R package. Performances for networks with and without genetic variables are reported. Performance indexes are reported in Table 3. MCRs of LR, GAM, PPR, LDA, QDA, and ANN are taken from the study of Giachino et al. [14].  When considering genetic variables, NB slightly improved the performance, with higher indicator values compared to those achieved without genetic variables. With respect to the techniques in Giachino et al. [14], it showed a similar MCR value (MCR = 0.33), lower PPV, AUC and Somer's D values (PPV = 0.69, AUC = 0.75, Somer's D = 0.51) and a higher NPV value (NPV = 0.66). Sensitivity and specificity were equal to 0.65 and 0.69, respectively. Among the BNs, the network learned with Tabu-Search (TS) and Hill-Climbing (HC) algorithms showed the same structure achieving the lowest MCRs. For comparison purposes, the network learned by TS was selected. Predictive performance was not satisfactory: despite higher values with respect to the BN without genetic variables, BN performance metrics were still lower than those reported in Giachino et al. [14] (PPV = 0.68, NPV = 0.65, AUC = 0.67, Somer's D = 0.33); however, MCR showed a similar value (0.34). A sensitivity and a specificity of 0.64 and 0.69, respectively, were observed. Like NB, BART improved its performances when also considering genetic factors. With respect to the previous results in Giachino et al. [14], it showed similar MCR, AUC and Somer's D values, a lower PPV value and a higher NPV value (MCR = 0.32, sensitivity = 0.66, specificity = 0.69, PPV = 0.67, NPV = 0.69, AUC = 0.78, Somer's D = 0.56). Convergence of the posterior distribution was reached for both BART models: the acceptance percentage of MCMC proposals across the trees, the average number of leaves and tree depth at each MCMC iteration after the "burn-in" period showed stationary processes.
The analyses with NB and BN took less than 2 min. The analysis carried out with BART was computationally expensive and took approximately 7 h.

Discussion
In the current study, BMLTs did not show any major improvements in classification accuracy compared to the methods reported in Giachino et al. [14]. In many scenarios, especially when genetic information was also considered, they performed worse. Models with genetic variables showed better performance metrics than models without genetic information. Nevertheless, the improvement did not seem as strong as that shown in Giachino et al. [14], with the exception of BN.
Very low performances of BN without genetic factors were observed because the network was not able to associate the node of the clinical endpoint (i.e., the appearance of EIMs) to any other node. Indeed, since the outcome node was not present in the structure of the network, the BN was not able to discriminate between the presence and absence of EIMs, and it classified all of the individuals with a presence of EIMs (EIMs+). By adding genetic factors to the BN, the outcome was included in the network (Figure 1), which allowed it to achieve a higher predictive performance.
percentage of MCMC proposals across the trees, the average number of leaves and tree depth at each MCMC iteration after the "burn-in" period showed stationary processes.
The analyses with NB and BN took less than 2 min. The analysis carried out with BART was computationally expensive and took approximately 7 h.

Discussion
In the current study, BMLTs did not show any major improvements in classification accuracy compared to the methods reported in Giachino et al. [14]. In many scenarios, especially when genetic information was also considered, they performed worse. Models with genetic variables showed better performance metrics than models without genetic information. Nevertheless, the improvement did not seem as strong as that shown in Giachino et al. [14], with the exception of BN.
Very low performances of BN without genetic factors were observed because the network was not able to associate the node of the clinical endpoint (i.e., the appearance of EIMs) to any other node. Indeed, since the outcome node was not present in the structure of the network, the BN was not able to discriminate between the presence and absence of EIMs, and it classified all of the individuals with a presence of EIMs (EIMs+). By adding genetic factors to the BN, the outcome was included in the network (Figure 1), which allowed it to achieve a higher predictive performance. Even though the BN structure with genetic factors was much more articulated and complex than the BN structure without genetic factors, only a few connections between the variables were found in both networks. The structure of the networks learned with all of the other algorithms was further investigated for both scenarios with and without genetic factors. Regarding BN without genetic factors, most of the networks showed an identical structure (the same structure of the network used in the analysis). Even with a more complicated structure, the node of the clinical endpoint was not Even though the BN structure with genetic factors was much more articulated and complex than the BN structure without genetic factors, only a few connections between the variables were found in both networks. The structure of the networks learned with all of the other algorithms was further investigated for both scenarios with and without genetic factors. Regarding BN without genetic factors, most of the networks showed an identical structure (the same structure of the network used in the analysis). Even with a more complicated structure, the node of the clinical endpoint was not included in the structure. Adding genetic factors, only networks learned with TS and HC algorithms showed some degree of complexity in their structures, which also included the outcome node.
The predictive performances of BMLTs were investigated by looking at the individual predicted probabilities of the presence of EIMs assigned by each model averaged across every bootstrap sample (Table 4). Two interesting findings were observed. First, all BMLTs predicted individual probabilities of the presence of EIMs less than 0.5 from the first patient to the 72nd patient. From the 73rd patient to the last patient, all BMLTs assigned individual probabilities over 0.5. This was probably because after the 73rd patient, all of the information for the genes IL12B, TNFA-308, TNFA-238, and IL1RN was systematically missing, indicating that this portion of missing values could be considered as missing not at random (MNAR). This missing pattern likely affected the performance of the models: patients with missing information for the genes IL12B, TNFA-308, TNFA-238, and IL1RN were classified as having EIMs, since a probability higher than 0.5 was assigned to them, whereas the opposite predictive pattern was observed for patients without missing information for the genes. Indeed, looking at Figure 2, IL12B and TNFA-238, two of the four variables that were MNAR, played a crucial role in the BN structure and therefore in predicting the presence of EIMs. Figure 2 displays BART variable importance, indicating the proportion of times a variable is considered in the definition of the splitting rule on each of the trees constructed by the model. The "Missing" category of gene TNFA-238 played an important role in the definition of the model.
From a clinical perspective, the relationship between TNFA-308 and gender depicted in Figure 2 raises some issues since it is somewhat counterintuitive. The unexpected link between TNF-α single nucleotide polymorphism (-308G>A, rs1800629) and gender may be somehow explained by the increased prevalence of association between ADA (22G>A) and TNF-α (-308G>A) polymorphisms in Italian males investigated by Napolioni and Predazzi [49].   Table 1. On the y-axis, the percentage of times for each variable was used to determine a splitting rule.

Study Limitations
Systematic missing entries for the IL12B, TNFA-308, TNFA-238, and IL1RN genes in some patients highly affected the performance of the models.
Moreover, the predictive posterior risk of EIMs was dichotomized to identify the presence of EIMs (presence of EIMs if the risk was higher than 0.5) as in the study from Giachino et al. [14] to make the comparison as fair as possible. As noted in many studies [50][51][52][53], the dichotomization of continuous variables results in a loss of information. The potential gains obtained by exploiting the full posterior predictive probability of EIMs will be explored in future research studies.
Finally, the procedure adopted to choose which BN to use during the analysis was quite simplistic, although it represents a default approach in terms of BN tuning. Robust methods that can define the best algorithm in terms of structure learning should be investigated.

Final Remarks
This study shows that emerging BMLTs do not provide a major improvement in correctly classifying the presence of EIMs compared to the classical statistical tools. When genetic variables are considered in the models, they show even lower performances with respect to the classical methods employed in the study from Giachino et al. [14]. The limited sample size of the datasets and absence  Table 1. On the y-axis, the percentage of times for each variable was used to determine a splitting rule.

Study Limitations
Systematic missing entries for the IL12B, TNFA-308, TNFA-238, and IL1RN genes in some patients highly affected the performance of the models.
Moreover, the predictive posterior risk of EIMs was dichotomized to identify the presence of EIMs (presence of EIMs if the risk was higher than 0.5) as in the study from Giachino et al. [14] to make the comparison as fair as possible. As noted in many studies [50][51][52][53], the dichotomization of continuous variables results in a loss of information. The potential gains obtained by exploiting the full posterior predictive probability of EIMs will be explored in future research studies.
Finally, the procedure adopted to choose which BN to use during the analysis was quite simplistic, although it represents a default approach in terms of BN tuning. Robust methods that can define the best algorithm in terms of structure learning should be investigated.

Final Remarks
This study shows that emerging BMLTs do not provide a major improvement in correctly classifying the presence of EIMs compared to the classical statistical tools. When genetic variables are considered in the models, they show even lower performances with respect to the classical methods employed in the study from Giachino et al. [14]. The limited sample size of the datasets and absence of an external source of data most likely limited the validation process and the predictive ability of the models. Nevertheless, BMLTs were expected to improve risk prediction in situations where the available amount of data and information is not optimal to build predictive models, as in our study. Our findings did not support the expectations, and some issues may arise around preference for BMLTs over other classical methods.