2.2. Software Packages to Be Evaluated
For multiclass classification, we can construct multiple binary classification models (inclass versus outofclass), one for each class. Each model can estimate the probability of a class and assign the membership to the class with the largest probability. Furthermore, many celltype classification problems are binary (e.g., stimulated versus wildtype).
So for simplicity, in this study, we are only considering binary classification performance. When describing classification procedures, we use 1 to denote True and 0 to denote False. The software packages to be evaluated are listed as below.
rpart [
32] (Recursive Partitioning and Regression Trees) is the most commonly used R package that perform both classification and regression. It uses Gini impurity where the probability of any randomly selected elements from a data set incorrectly classified is calculated. The probability score is calculated by summing all probability and subtracting from 1, that is,
ctree [
33] (Conditional Inference Trees) is part of
partykit R package. The tree’s algorithm tests a condition or a hypothesis between variables and then splits if the hypothesis was rejected.
evtree [
34] (Evolutionary Tree) is based on an evolutionary algorithm. Using the package, a random tree is created at first and then updated periodically after each iteration. The algorithm halts when no change to the most current model is required.
tree package is used for classification or regression. It is based on measuring the impurity and deciding where to create a split. Eventually, splitting halts when the split is not worth following.
C5.0 is a package that extends the C4.5 classification algorithm which itself is an extension of the ID3 algorithm. The algorithm creates a tree by calculating the entropy of samples under testing and splits the branch accordingly.
2.3. The Benchmark Data Sets
The data sets under testing were extracted from the project
Conquer (consistent quantification of external RNASequencing data) repository developed by C. Soneson and M. Robinson at the University of Zurich, Switzerland [
31]. Three organisms were included in the repository—Homo sapiens, Mus musculus and Danio rerio. Each data set contains a different number of cells. Six protocols were followed to obtain cell sequences—SMARTer C1, SmartSeq2, SMARTSeq, Tang, Fluidigm C1Auto prep and SMARTer.
We have explored all data sets in the Conquer repository. Unfortunately, not all data sets suit our test. For a data set to be suitable for our testing methodology, we have to make sure that its samples can be divided into two groups based on a common phenotype for each group; thus, we can identify that phenotype with 1s or 0s. For example, the data set GSE80032 was excluded because all its samples have the same phenotype thus we have no bases to divide them. Also, each group of samples must not be so small that, when testing, they generate a misclassification or execution errors. We found that both sample groups must have at least 30 samples to avoid errors and misclassification.
There are 22 data sets that fit our test. These data sets are listed in
Table 1 as they were presented in the
Conquer repository along with the protocol type used.
Accessing information within these data sets is not a straightforward process. Thus, before proceeding in our test, each data set has to be prepared and normalized before fitting into our methods.
In order to access an abundance of genes, we closely followed procedural steps provided by the Conquer repository authors. By using R programming language, we retrieved ExperimentList instances that contained RangedSummarizedExperiment for genelevel object. This allowed us to access abundances which included TPM (Transcripts per million) abundance for each specific gene, gene count, lengthscaled TPMs as well as average of transcripts length found in each sample for each particular gene. For our test, we chose genes TPM abundance and used them as input predictor matrix ${X}_{i,j}$, where i represents samples and j represents genes.
At this point, we had full access to the desired type of information; nevertheless, we had to normalize the matrix ${X}_{i,j}$ to fit into our test. At first, we rotated the dimensions of X so samples became rows and genes were the columns; thus, ${X}_{j,i}$.
We then looked into the phenotype associated with the data set, using table(colData (dataset)) R command, in order to find two distinguished phenotypical characteristics associated with samples group in X. We denoted the first characteristic with 1 and the second with 0. We replaced samples IDs with either 1 or 0 so we could distinguish them within our classified model. For example, in the data set EMTAB2805, there are different phenotypes associated with each group of samples. We picked two phenotypes that were based on stages of cell cycle stage, $G1$ and $G2M$. Thus, samples associated with the first stage, $G1$, were replaced by 1, True, and samples associated with the second stage, $G2M$ were replaced by 0, false. We eliminated any extra samples associated with other cell cycle stages, if any.
At this point, j in ${X}_{j,i}$ is a binary value where the first group of sample with one distinguished phenotype is represented by 1 and the other group is represented by 0.
Table 2 includes all the aforementioned data sets but after identifying the appropriate phenotypes that were going to be used in our test.
Finally, we conducted a Wilcoxon Test on the data of X in order to find out pvalues associated with each gene.
Due to problems related to memory exhaustion and time for excessive executions we had to trim our matrix; thus, we picked the first 1000 genes with the lowest pvalue to include in X. Consequently, the final form of our matrix was ${X}_{j,i}$ where i ranged from 1 to 1000 and j ranged from 1 to n and separated into two groups of 1 s and 0 s.
At this point, we could proceed with testing and easily split our matrix. At the beginning of each testing iteration, we shuffled the rows of matrix (samples that were denoted by 1 and 0) to prevent any biases that may result from having identical samples IDs.
We proceeded into our 10fold crossvalidation testing. We took 10% of the samples (rows) and set them as our testing set. We took the rest of the samples as our training set. After we measured system’s time (method’s start time) we fitted the training set into the first method, $rpart$. Immediately after the method stopped its execution, we measured the system’s time (the method’s end time). We subtracted the starting time from the ending time so the result is considered to be the method’s Runtime.
After we constructed our model, we used this model for prediction using our testing set. We first extracted the confusion matrix (a $2\times 2$ matrix). In a few cases we had only a one dimension matrix due to the fact that no misclassification occurred, but we programmatically forced the formation of a $2\times 2$ matrix in order to be able to always calculate the true positive values and false positive values. We then calculated the Precision by dividing the total number of true positive values over the total number of true positive values and the total number of false positive values. The result is the method’s Precision. We also calculated the method’s recall by dividing the total number of true positive values over the total number of true positive values and the total number of false negative values from the same confusion matrix.
Since we had the method’s precision and recall we used them to calculate the value of ${F}_{1}$ score that is the method’s precision multiplied by the method’s recall over the total value of both method’s precision and recall, multiplied by 2.
We then proceeded to compute the receiver operating characteristic (ROC) using package pROC and, consequently, we obtained the AUC measurement. At the end, we constructed our model tree and computed the number of nodes and leaves (the complexity score).
We proceeded into testing the second method, $ctree$. We repeated the same steps applied on the first method, and we collected scores for $ctree$ Precision, Recall, $F+1$ score, AuC, Complexity and Runtime. We then proceeded to test $evtree$, $tree$ and $C5.0$, respectively.
In the second iteration of our crossvalidation testing, we took the second fold (the adjacent 10% of the data set) as our testing set and the rest of the samples (including what was in the previously tested fold) as our training set. We repeated the same steps applied previously using these newly created testing and training sets. At the end of this 10fold crossvalidation testing, we had 10 scores for each method’s Precision, Recall, ${F}_{1}$ score, AUC, Complexity and Runtime. We collected these scores and took their $mean$ as the final score.
The previously explained testing procedure (from shuffling samples to collecting mean scores of method’s Precision, Recall, ${F}_{1}$ score, AUC, Complexity and Runtime) is repeated 100 times. Each time we collect mean scores and we plot our final results thus we proceed to analysis.
2.4. Design of Evaluation Experiments
In this study, we compared the performance of classification using a Crossvalidation approach. Each data set was randomly split into different folds. At each iteration, one fold was taken as a testing set (usually represents 10% of the total data set) and the rest of the remaining folds were taken as a training set (90% of the total data set). At the following iteration, the adjacent fold was sat as a testing set and the rest of the remaining folds as a training set and so on. The scores were defined to describe the classification criteria, and the Arithmetic Mean of all scores collected from each iteration is the final score that was used to measure the intended performance. The six scores of evaluation criteria (Precision, Recall, ${F}_{1}$ score, AUC statistics, complexity and Runtime) are defined as below.
We define
Precision as the number of correctly predicted true signals
$TP$ over the total number of false signals that are classified as true signals
$FP$ and the total number of the correctly predicted true signals. That is denoted as:
We define
Recall as the number of correctly predicted true signals
$TP$ over the total number of false signals that are classified as false signals
$FN$ and the total number of the correctly predicted true signals. That is denoted as:
Consequently, we define the
${F}_{1}$ score as:
We define Complexity as the total number of all splits in the model tree, that is, the total number of all nodes and leaves, including the first node.
The Area Under the Curve (AUC) calculates the area under an ROC (Receiver Operating Characteristic) curve, which is basically a plot of scores calculated for both True Positive Rate (represented in the yaxis) and False Positive Rate (represented in the xaxis) at different variations of thresholds. In other words, an AUC measures the ability for a model to distinguish between positive signals and negative signals. The higher the AUC’s score, the better the model in prediction.
The summary of prediction performance of software depends on how the data sets are split into folds (or subsets) but this split effect is not what we intended to show in the evaluation of the methods. So, to remove impact of random split and generate more reliable results, we repeated the test 100 times—each time we randomized the matrix. Then we summarized the performance over 100 repeats and used these on the software.
Algorithm 1 describes the flow of the testing stages after we converted the matrix
X into the desirable form.
Algorithm 1 Procedure of Evaluating Software Using Crossvalidation 
 1:
$Data\phantom{\rule{0.222222em}{0ex}}set$ is a matrix X of two dimensions; 1st dimension for Samples and 2nd dimension for Genes, containing TPM genes abundance.
 2:
Separate samples into two groups based on selected phenotypes.  3:
In 1st group: replace
$Sample\_id\leftrightarrow \u201c1\u201d$  4:
In 2nd group: replace
$Sample\_id\leftrightarrow \u201c0\u201d$  5:
for$n\in \{1,...,length(M\left)\right\}$do ▹ Loop over genes  6:
Run WilcoxonTest on n  7:
Extract and Store pvalue  8:
end for  9:
Shorten X to include only 1000 genes with the lowest pvalue.  10:
for$k\in \{1,...,100\}$do ▹ General loop  11:
Shuffle rows in X  12:
for $x\in \{1,...,10\}$ do ▹ Loop over samples  13:
Set 10% of X as $Testing\_set$  14:
Set 90% of X as $Training\_set$  15:
Measure current $tim{e}_{start}$  16:
Fit $Training\_set$ into the first method.
 17:
Measure current $tim{e}_{end}$  18:
Predict model on $Testing\_set$  19:
Calculate method’s Runtime using $tim{e}_{end}tim{e}_{start}$  20:
Calculate method’s Precision score using $Confusion\_Matrix$  21:
Calculate method’s Recall score using $Confusion\_Matrix$  22:
Calculate method’s ${F}_{1}$ score score using method’s Precision and Recall.
 23:
Calculate method’s $AUC$ score using $ROC$ function
 24:
Calculate method’s $Complexity$ score using model tree length
 25:
Repeat steps 1522 on other packages  26:
end for  27:
Calculate $mean$ of all scores collected
 28:
end for  29:
Plot final scores
