Knowledge Generation with Rule Induction in Cancer Omics

The explosion of omics data availability in cancer research has boosted the knowledge of the molecular basis of cancer, although the strategies for its definitive resolution are still not well established. The complexity of cancer biology, given by the high heterogeneity of cancer cells, leads to the development of pharmacoresistance for many patients, hampering the efficacy of therapeutic approaches. Machine learning techniques have been implemented to extract knowledge from cancer omics data in order to address fundamental issues in cancer research, as well as the classification of clinically relevant sub-groups of patients and for the identification of biomarkers for disease risk and prognosis. Rule induction algorithms are a group of pattern discovery approaches that represents discovered relationships in the form of human readable associative rules. The application of such techniques to the modern plethora of collected cancer omics data can effectively boost our understanding of cancer-related mechanisms. In fact, the capability of these methods to extract a huge amount of human readable knowledge will eventually help to uncover unknown relationships between molecular attributes and the malignant phenotype. In this review, we describe applications and strategies for the usage of rule induction approaches in cancer omics data analysis. In particular, we explore the canonical applications and the future challenges and opportunities posed by multi-omics integration problems.

its complexity by applying a pruning step, aiming to remove branches with minimum contribution to the overall accuracy. Once the tree has been pruned, knowledge can be extracted and presented in the form of (if-then) rules.

Grow & prune algorithms
RIPPER RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [3] is one of the most efficient and used rule learning algorithms. It implements a divide-and-conquer strategy to rule induction.
Ripper applies the so called Incremental Reduced Error Pruning (IREP) to compile an initial set of rules for each class. Then, an additional optimization step considers each rule in the current set in turn and creates two alternative rules from them: a replacement rule and a revision rule. After that, a decision is made on whether the model should keep the original rule, the replacement, or the revision rule, based on the minimum description length criterion.

PART
PART (Projective Adaptive Resonance Theory) [4] is a partial decision tree algorithm. In particular, PART generates a set of rules according to the divide-and-conquer strategy, removes all instances from the training collection that are covered by this rule and proceeds recursively until no instance remains. To generate a single rule, PART builds a partial C4.5 decision tree for the current set of instances and selects the leaf with the largest coverage as the new rule. Afterwards, the partial decision tree along with the instances covered by the new rule are removed from the training data, in order to avoid early generalization. The process is repeated until all instances are covered by extracted rules.

CAMUR
CAMUR (Classifier with Alternative and MUltiple Rule-based models) [5,6] is based on the RIPPER algorithm. It extracts multiple and equivalent rule bases by iteratively computing a rule-based classification model. CAMUR includes an ad-hoc knowledge repository (database) and querying tool.
CAMUR is implemented as a standalone java application and a web application available at http://dmb.iasi.cnr.it/camur.php. Implementation (language): (Java/Web).

BIGBIOCL
BIGBIOCL (CAMUR improved implementation) [7] is an improved version of CAMUR designed to handle hundreds of thousands of features. According to CAMUR strategy, it is designed to learn multiple alternative and equivalent classification models through iterative deletion of selected features. BIGBIOCL is implemented as a standalone Java application. Implementation (language): (Java).

Based on fuzzification:
FURIA FURIA (Fuzzy Unordered Rule Induction Algorithm) [8] is an improved version of the RIPPER algorithm. FURIA uses a modified RIPPER algorithm as a basis and learns fuzzy rules and unordered rule set. The main strength of this algorithm is the rule stretching method, that solves the pressing problem of new records that when classified could be outside the space covered by the previously induced rules. The representation of fuzzy rules is also advanced, essentially, a fuzzy rule is obtained through replacing intervals by fuzzy intervals, namely fuzzy sets with trapezoidal membership function.
FURIA is implemented in Java as a module of the WEKA suite. Implementation (language): Weka (Java).

Based on probability estimation
MLRules MLRules (Maximum Likelihood Rule Ensembles) [9] is an induction algorithm for solving classification problems via probability estimation. The basic idea is to exploit the single rules as individual classifiers and then implement upon them an ensemble classification system. Differently from classic sequential covering procedures (also known as divide-and-conquer approaches), new rules are added without adjusting those that have already been added. The main advantage of the MLRules algorithm is given by the fact that a simple and powerful statistical technique is used to induce the rules, which in turn lead to final ensembles with very high prediction accuracy. MLRules is implemented in Java as a module of the WEKA suite. Implementation (language): Weka (Java).

Rough set theory LERS (LEM1, LEM2, MLEM2)
LERS (Learning from Examples using Rough Sets) [10] uses the rough set theory to handle inconsistencies in decision rules. The rough set theory is used to obtain the approximation of lower and upper spaces of a crispy set representing a concept. Then, these approximations are used to build two different sets of rules: certain and possible. LERS applies a bottom-up strategy in order to define rules incrementally. At each iteration, it identifies the certain rules and combine the remain possible rules to get the next set of certain and possible rules. This process ends when there are no possible rules to be built. LERS represents a general approach to in knowledge acquisition problem. However, its use in machine learning needs and additional module or computational step aiming to learn a discriminant description [13], i.e., to learn the smallest set of minimal rules, describing the concept.
Currently, there are three algorithms for this step: LEM1, LEM2 and MLEM2.

TSP (Top Scoring Pair) [11] is a rule induction technique based on relative values between pairs of
features. TSP has been developed for microarray data and build rules on a feature space constituted by pairwise comparisons of gene expression levels. The main advantage of the TSP approach is that, being based on relative values leverages the problem of integrating data from different source that is potentially represented in different scales and can suffer from batch effects. In addition, the TSP classifier provides decision rules that are easy to interpret since they involve relative values between pairs of features (genes in its case). TSP is implemented in R language and available from the tspair Bioconductor package. Implementation (language): R/tspair package (R).

k-TSP
K-TSP [12] is an extension of the TSP algorithm, which uses exactly k pairs of genes for classifying gene expression data. Instead of using a single comparison a literal, K-TSP uses groups of k comparisons and applies a majority voting among them to decide the truthiness of the complex literal.
When k = 1, the algorithm is equivalent to the TSP algorithm. k-TSP is implemented in R language and available from the switchbox Bioconductor package. Implementation (language): R/switchbox package (R).

Genetic Algorithm based BIOHEL
BioHEL (Bioinformatics-Oriented Hierarchical Learning) [13] is an evolutionary machine learning system designed to handle with large-scale bioinformatic datasets. BioHEL employs the Iterative Rule Learning (IRL) paradigm. The IRL procedure begins with an empty rule set and the complete set of observations as input and evolves rules one at the time using a genetic algorithm. Each time a rule is evolved by system it is added to the current rule set and all observations covered by the rule are removed from the training set. By iterating this process, rules are added to the set of rules until all the samples in the training set are covered. BioHel is implemented as a standalone tool in C++ language and can be run by serial execution in CPU mode or by the parallel execution on GPUs. Implementation (language): http://icos.cs.nott.ac.uk/software/biohel. html (C++).

CN2-SD
CN2-SD algorithm [14] finds rules covering subsets of the population that are sufficiently large and "statistically unusual".
It works iteratively, searching in each iteration for a set of relationships between features (a complex) that covers a large number of examples of a single class and/or other classes. Having found a good complex, the algorithm removes covered observations form the training set and adds the corresponding rule(s) to the rule set. The procedure is repeated iterates until no more satisfactory complexes can be found. CN2-SD is implemented in Java as a module of the KEEL suite. Implementation (language): KEEL (Java).

SDEFSR (Subgroup Discovery with Evolutionary Fuzzy System) [15] is a collection of rule induction
algorithms based on subgroup discovery that make use of fuzzy logic improve the interpretability of results. SDEFSR algorithms are able to evolve fuzzy rules and use fuzzy set definitions. SDEFSR algorithms are implemented in R language and available from the SDEFSR CRAN package. Implementation (language): R/ SDEFSR package (R).