2.1. Bayesian Rule Learning
A classifier is learned from gene expression data to explain disease states from historical data. The variable of interest that is predicted is called the target variable (or simply the target), and the variables used for prediction are called the predictor variables (or simply features).
Rule-based classifiers are a class of easily comprehensible supervised machine learning models that explain the distribution of the target, in the observed data, using a set of IF-THEN rules described using predictor variables. The ‘IF’ part of the rule specifies a condition, also known as the rule antecedent, which, if met, fires the ‘THEN’ part of the rule, known as the rule consequent. The rule consequent makes a decision on the class label, given the value assignments of the predictor variables met by the rule antecedent. A set of rules is called a rule base, which is a type of knowledge base. The C4.5 algorithm learns a decision tree, where each path in the decision tree (from root of the tree to each leaf) can be interpreted as a rule. Here, the variables selected in the path compose the rule antecedents as conjunctions of predictive variables and value assignments to those variables. We infer a rule based on the distribution of instances over the target that match this rule antecedent.
Bayesian Rule Learning (BRL) infers a rule base from a learned Bayesian network (BN). BN is a probabilistic graphical model with two components—a graphical structure and a set of probability parameters [12
]. The graphical structure consists of a directed acyclic graph. Here, the nodes represent variables and variables are related to each other by directed arcs that do not form any directed cycles. When there is a directed arc from node A
to node B
, node B
is said to be the child node
, and node A
is said to be the parent node
. A probability distribution is associated with each node, X
, in the graphical structure, given the state of its parent nodes,
represents the different discrete value assignments of the parents of node X
. This probability distribution is generally called a conditional probability distribution
). For discrete-valued random variables, the CPD can be represented in the form of a table called conditional probability table
). Furthermore, any CPT can be represented as a rule base. Here, we consider only the CPT for the target variable. Each possible value assignment of the parents represents a different rule in the rule base. The evidence in the form of the distribution of instances, for each target value in the training data, helps infer the consequent rule. The resulting rule base consists of rules that are mutually exclusive and exhaustive. In other words, at least one rule from the rule base matches a given instance and only one rule matches that instance.
We learn a BN from a training dataset using a heuristic search of the decision tree that results from the CPT described above. We evaluate how likely our learned BN generated the observed data using the Bayesian score (the K2 metric [13
]). We demonstrated this process in our previous work [7
Decision trees are popular compact representations of the CPT of a node in a BN. Most of the BN literature is dedicated to learning global independence constraints in the domain. The global constraints only capture the dependent and independent variables that are parents to the node in the graphical representation. The number of parameters needed to describe the CPT is the number of joint assignments for the different parent variables of the node. The size of this CPT grows combinatorially as the number of parents of the node.
As an example, consider a node representing a disease state. Let there be 10 genes (henceforth, when we mention a gene as a variable, we are referring to its expression) that lead to the change of disease state. Let each gene take two discrete values (UP: upregulated, DOWN: downregulated). This requires 210 = 1024 parameters to be represented by the CPT. Consequently, our rule base has 1024 rules, one for each value assignment of the parent variables. Biomedical research, especially gene expression data, rarely have enough training data to provide sufficient evidence to make class inference from the 1024 rules in our example scenario. It is therefore important to derive a more efficient representation of the CPT.
2.1.1. Bayesian Rule Learning-Global Structure Search (BRL-GSS)
We constrained our model to only those models with variables being a direct parent of the target variable. BRL uses breadth-first marker propagation (BFMP) for this algorithm, which provides significant speed up since database look-up is an expensive operation [14
]. BFMP [14
] permits bi-directional look-up using vectors of pointers by linking a sample to its respective variable-values, and the variable value to those samples that have it. It enables efficient generation of counts of matches for all possible specializations of a rule using these pointers.
depicts a BN (Figure 1
a), the corresponding global CPT representation using decision tree (Figure 1
b), and the rules corresponding to the decision tree (Figure 1
c). The BN in Figure 1
a contains one child variable that is the binary target, D
, and two parent variables (the predictor variables), Gene A
and Gene B
. Each predictor variable, Gene A
and Gene B
is binary. When the gene is upregulated, they take the value UP
. When the gene is downregulated, they take the value DOWN
. Figure 1
b shows the CPT represented as a decision tree with global constraints. Since both of the predictor variables are binary, the decision tree has 22
= 4 parameters, each represented by a leaf in the decision tree. Each leaf of the decision tree is a parameter, the conditional probability distribution over the target, given the values assigned to the predictor variables that are in the path in the tree. This distribution for target D
is shown in the leaf node. For example, given that Gene A
takes value UP
and Gene B
takes value UP
, the probability of D
is 0.89 and the probability of D
c depicts a decision tree represented as a rule base. The rule antecedent (IF part) contains a conjunction of predictor variable assignments as shown in the path of the decision tree. The rule consequent is the conditional probability distribution over the target values (in square brackets) followed by the the distribution of the instances from the training data for each target value that matches the rule antecedent. In rule 1, we see the evidence to be (50, 5) where there are 50 instances in the training dataset (that matches the rule antecedent) that have value D
. There are only five instances that match the rule antecedent that have value D
. They are then smoothed with a factor α
, set to 1 as a default. This simplifies the posterior odds to the ratio of
, where TP is the number of true positives for the rule (where both the antecedent and consequent match with the test instance) and FP is the number of false positives (antecedent matches, but consequent does not match the test instance).
During prediction, the class is determined simply by the higher conditional probability. In our example, since D = true has a probability of 0.89, the prediction for a test case that matches this rule antecedent is D = true. If there is a tie, by default, the value of the majority class is the prediction.
We developed and tested two variants of global structure search using BRL,
]. The subscripts indicate the number of BN models that are kept in memory during the best-first search. We concluded that
was statistically significantly better than
and C4.5 on Balanced Accuracy, and RCI (relative classifier information). For this paper, we choose the
version of the algorithm and rename it BRL-GSS to be consistent with nomenclature for the local structure search algorithm that we present in the next section. For a dataset with n
variables and m
instances , where each variable i
discrete values and
, the worst-case time-complexity is
. If we constrain the maximum number of discrete values that a variable can take (for example, assume all variables are binary-valued), then the time-complexity reduces to
2.1.2. Bayesian Rule Learning-Local Structure Search (BRL-LSS)
We adapted the method developed by [15
] that can be used for developing an entire global network based on local structure. In Figure 1
a, we see the same BN with two parents as the one we saw in BRL-GSS. Figure 1
d shows the local decision tree structure. In Figure 1
, we saw that the distribution of the target, when Gene A
, is the same regardless of the value of Gene B
. To be precise,
. The more general representation in Figure 1
d merges the two redundant leaves into a single leaf. As a result, Figure 1
e, reduces the number of rules to three down from four. Thus, Figure 1
e provides a more parsimonious representation of the data when compared to Figure 1
Next, we describe our algorithmic implementation to learn local decision trees as seen in Figure 1
d. At a high level, our algorithm initializes a model with a single variable (gene) node as the root. For each unique variable in the dataset, there can be a unique root at the decision tree. A leaf in the initial model represents a specific value assignment of the root variable. By observing the classes of instances in the dataset that match this variable value assignment, we infer the likely class of a matching test instance. To evaluate the overall model, we use the Bayesian Score to obtain the likelihood that this model generated the observed data. The algorithm then iteratively explores further specialized models by adding other variables as nodes to one of the leaves of the decision tree. The model is then re-evaluated using the Bayesian Score. The model space here is huge at
. Our algorithm adds some greedy constraints to reduce the space. In the following paragraphs, we specify how we constrain the search.
Algorithm 1 is the pseudocode of the local structure search module in the BRL. This algorithm takes as input the data D and two parameters maxConj and beamWidth similar to BRL-GSS. We also used the heuristic of a maximum number of parents (maxConj) to prevent overspecialization as well as to reduce the running time (default is set to eight variables per path). The beamWidth parameter is the size of the priority queue (beam) that limits the number of BNs that the search algorithm stores in memory at a given step of the search. This beam sorts the BNs in reducing order of their Bayesian score. Line 2 initializes this beam with singleton models. These BNs have a single child and a single parent. The child is fixed to be the target. The parent is set iteratively to all the predictor variables in the training data D. During the search, this initial parent variable is set as the root node of the tree. A variable node is split in two ways: (1) binary split and (2) complete split. In binary split, the variable is split into two values. If the variable has more than two discrete values (say ), the binary split creates different combinations of local decision trees. The complete split generates different paths, one for each discrete value of the variable.
In line 3, the search algorithm specializes each model on the beam by adding a new parent variable as a candidate conjunct for each leaf in the decision tree. The best models from this specialization step are added to the final beam (line 6), which keeps track of the best models seen by the search algorithm so far. Line 7 checks to ensure that any candidate models for further specialization do not exceed the maxConj
limit for the number of parents of the target in the BN. The loop at line 8 iterates through each unexplored variable in D
for specialization. The loop in line 10 iterates through all the leaves of the local structure decision tree inferred from the BN. From lines 11 through 17, the algorithm performs a binary and complete split using the variable currently being explored at the specific leaf of the decision tree. It stores only the best model (as determined by the Bayesian Score) seen in this iteration.
|Algorithm 1: Bayesian Local Structure Search|
Lines 18 through 21 check if the specialization process led to an improvement (better Bayesian score) to the model it started with. If the score improves, the new model is queued for further specialization in the subsequent iterations of the search algorithm.
Finally, in line 23, the best model seen during the search so far is returned by the search algorithm. This best-first search algorithm uses a beam to search through a space of local structured CPTs of BNs. As described in Figure 1
e, BRL interprets this decision tree as a rule base.
The worst-case time-complexity of BRL-LSS remains as with BRL-GSS. We achieve this by the same global constraint on the maximum number of parents that the model can have in line 7 of the algorithm. However, in practice, BRL-LSS tends to be generally slower than BRL-GSS. This is because, in BRL-GSS, we keep track of the variables already explored for the entire beam. In BRL-LSS, we keep track of the explored variables for each model separately. We still only have a constant number of models as constrained by the beam width. As a result, the worst-case time-complexity remains the same as BRL-GSS. If we restrict the maximum number of discrete values that each variable can take, the complexity reduces to . As a result, with BRL-LSS, we now explore a much richer space of models with the same time-complexity as BRL-GSS.
2.2. Experimental Design
For each biomedical dataset, we split the data into train and test sets using cross-validation split. BRL-GSS and BRL-LSS require discrete data so the training dataset is discretized. After learning the discretization scheme for each of the features from the training data, we apply the discretization scheme on those features in the test dataset. Finally, we learn a rule model from our different algorithms on the training data. We use this model to predict on the test data and we evaluate our performance. The detailed description on the cross-validation design, discretization method, classification algorithms, and performance metrics used for evaluation is described below.
2.2.1. Classification Algorithms
We test three algorithms in the modeling step of the experimental design framework in order to generate our rule models: (1) the BRL-GSS
, which was significantly the best model from our previous study [7
] comparing other state-of-the-art rule models; (2) BRL-LSS
, which is our proposed method in this paper with a promise on model parsimony; and, finally, (3) as we have shown previously [7
] that C4.5 outperforms other readily available rule learners we consider C4.5 as state-of-the-art for the purposes of comparison in this paper. Decision trees can be translated into a rule base by inferring a rule from each path in the decision tree. C4.5 [11
] is the most popular decision tree based method. It was an extension to an earlier ID3 algorithm.
Both the BRL methods take in two parameters—maxConj
(maximum number of features used in the Bayesian network model) and beamWidth
(maximum number of models stored in the search memory). For both BRL-GSS and BRL-LSS, we set maxConj
= 8, and beamWidth
= 1000. These were arbitrary choices that we use as defaults for the BRL models. The C4.5 uses default parameters as specified by Weka (Version 3.8) [16
We performed our experiments on 10 binary class, high-throughput, biomedical sets of data. Each of the 10 datasets chosen here represent a cancer diagnostic problem of distinguishing cancer patients from normal patients using their gene expression profile. The gene expression data is generated from high-throughput microarray technology. Table 1
shows the dataset dimensions and sources for the 10 datasets.
In addition to the high-throughput microarray technology data for gene expression used in our experiments, we also conduct a case study using data generated from the newer RNA-sequencing (RNA-Seq) technology for gene expression. We obtain Illumina HiSeq 2000, RNA-Seq Version 2, normalized, gene expression data of patients with Kidney Renal Clear Cell Carcinoma (KIRC), processed using the RNA-Seq Expectation Maximization (RSEM) pipeline from The Cancer Genome Atlas (TCGA) [25
]. The samples are primary nephrectomy specimens obtained from patients with histologically confirmed clear cell renal cell carcinoma, and the specimens conform to the requirements for genomic study by TCGA. We develop a model to differentiate the gene expression in tumor samples from matched normal samples (normal samples from patients with tumors). This KIRC dataset has 606 samples (534 tumors, 72 normal) and 20, 531 mapped genes.
We pre-process the KIRC dataset by removing genes with sparse expression (more than 50% of the samples have value 0). We are left with 17,694 genes. As recommended in RNA-Seq analysis literature [26
], we use Limma’s voom transformation [27
] to remove heteroscedasticity from RNA-Seq count data and to be unaffected by outliers in the data. In this case study, described in the results Section 3.1
, we demonstrate the feasibility of our rule learning methods in analyzing RNA-Seq data.
As described in the experimental design framework, our datasets need to be discretized for applying our algorithms. All the biomedical datasets in Table 1
contain continuous measurements of the markers. Each training fold of data is discretized using the efficient Bayesian discretization method (EBD) [28
] with a default parameter,
, which controls the expected number of cut-points for each variable in the dataset.
For each of the 10 datasets, we performed a 10-fold stratified cross-validation for sampling from a dataset. We measure each performance metric (described below) for each fold in the cross-validation and then average that metric across the 10 folds to get an estimate of that performance metric.
We measure four performance metrics. We use two metrics to evaluate the classifier predictive performance and another two to evaluate model parsimony. The first metric for classifier predictive performance is the Area Under the Receiver Operator Characteristic Curve (AUC). It indicates the class discrimination ability of the algorithm for each dataset. It ranges from 0.0 to 1.0. Higher value indicates a better predictive classifier. The second metric for classifier performance is Accuracy, given as a percentage. Again, higher value indicates a better predictor.
The first parsimony metric is the Number of Rules. All the algorithms tested have rule bases that are mutually exclusive and exhaustive. This means that each instance in the dataset is covered by at least one rule, and exactly one rule. A small number of rules in the rule base indicates greater coverage by individual rules. The coverage of a rule is the fraction of the instances in the training data that satisfies the antecedent of the rule. A large number of rules indicates that each rule has small coverage, and, as a result, lesser evidence. A small number here is attractive because a parsimonious model with few rules to describe all the observed data indicates generalized rules with stronger evidence per rule. The second parsimony metric is the Number of Variables. Typically in a biomarker discovery task that involves a high-dimensional gene expression data, we would like fewer variables to describe the observed data. This is because the validation of those markers is time consuming and expensive. Having fewer variables to verify is appealing in this domain. Therefore, we prefer a smaller number of variables that give us the best predictor.