Nonlinear Random Forest Classification, a Copula-Based Approach

: In this work, we use a copula-based approach to select the most important features for a random forest classiﬁcation. Based on associated copulas between these features, we carry out this feature selection. We then embed the selected features to a random forest algorithm to classify a label-valued outcome. Our algorithm enables us to select the most relevant features when the features are not necessarily connected by a linear function; also, we can stop the classiﬁcation when we reach the desired level of accuracy. We apply this method on a simulation study as well as a real dataset of COVID-19 and for a diabetes dataset.


Introduction
Dimension reduction is a major area of interest within the field of data mining and knowledge discovery, especially in high-dimensional analysis. Recently, the issue of machine learning has received considerable attention; hence, a number of researchers have sought to perform more accurate dimension reductions in this issue [1,2]. While dimension reduction tries to reduce the dimension of data by selecting some functions of the original dataset, feature selection is one of its special cases, which selects the most important features among all of them. There are many areas of statistics and machine learning that benefit from feature selection techniques. From the statistics point of view, Han and Liu et al. (2013) [3] and Basabi (2008) [4] have applied feature selection for multivariate time series. Debashis et al. (2008) [5] have investigated feature selection and regression in high-dimensional problems.
It is known that selecting the most important and relevant features is the main aim in decision tree/random forest algorithms. Although there are many classification approaches proposed in the literature, they rarely deal with the possible existence of nonlinear relations between attributes. On the other hand, note that mutual information-based filter methods have gained popularity due to their ability to capture the non-linear association between dependent and independent variables in a machine learning setting. Mutual information based on a copula function will be a good choice to carry out a feature selection in which the results are stable against noises and outliers [6,7]. So, one of the major aims of this work is using feature selection in a classification context based on a copula function, especially in random forest classification.
Random forests are commonly used machine learning algorithm, which are a combination of various independent decision trees that are trained independently on a random subset of data and use averaging to improve the predictive accuracy and control over-/ under-fitting [8][9][10][11]. In this work, in order to extract the most important features in random forest, we use associated copula of features. In this regard, the connection copula between the exploratory variables as well as the associated copula of exploratory attributes and the class labeled attribute are considered. The rest of the paper is organized as follows: we review preliminaries and introduce our method in the next section; we illustrate our algorithm considering simulated data as well as two real datasets in Section 3; finally, Section 4 is devoted to some concluding remarks.

Preliminaries and Related Works
The application of feature selection in machine learning and data mining techniques has been extensively considered in the literature. Kabir et al. (2020) [12] used a neural network to carry out a feature selection, while Zheng et al. (2020) [11] used a feature selection approach in a deep neural network. Li et al. (2017) [13] reviewed the feature selection techniques in data mining; see also the book of Lin and Motoda (2012) [14]. For more information, we refer to Hastie et al. (2009) [5] used feature selection for outcome prediction in medical sciences. Zhang and Zhou (2010) [20] investigated multi-label dimensionality reduction features by maximizing the dependence between the original feature description and the associated class labels; see also Zhong et al. (2018) [21]. Shin and Park (2011) [22] analyzed a correlation-based dimension reduction.
In this work, we use dependence structures between variables to find the best feature selection and construct an agglomerative information gain of random forest. We apply our algorithm to classify influenza and COVID-19 patients. Iwendi et al. (2020) [23] carried out a COVID-19 patient health prediction using a random forest algorithm.   [13] applied machine learning methods to generate a computational classification model for discriminating between COVID-19 patients and influenza patients only based on clinical variables. See also Wu et al. (2020) [24], Ceylan (2020) [25] and   [13] and references therein for more information. Azar et al. (2014) [26] applied a random forest classifier for lymph diseases. See also Subasi et al. (2017) [27] for chronic kidney disease diagnosis using random forest; Açıcı et al. (2017) [28] for a random forest method to detect Parkinson disease; Jabbar et al. (2016) [29] for a prediction of heart disease using random forest. Additionally, a review work of Remeseiro et al. (2019) [30] may be helpful regarding this subject. Sun et al. (2020) [31] have implemented a mutual information-based feature selection. Assume that F X 1 ,X 2 ,...,X d is the joint multivariate distribution function of the random vector X = (X 1 , X 2 , . . . , X d ) and F X i , i = 1, 2, . . . , d, are the related marginal distribution functions. A grounded d-increasing uniformly marginal function C : [0, 1] d → [0, 1] is called a copula of X whenever it couples the multivariate distribution function F X 1 ,X 2 ,...,X d to its marginals F X i , i = 1, 2, . . . , d, i.e., Note that if X is a continuous random vector, then the copula C X is unique. For more details concerning copulas, their families and association measures, we recommend Nelsen (2006) [32] and Durante and Sempi (2016) [33]. Merits of copulas and dependence measures in dimension reduction have been discussed in the literature. See  [40] and Kluppelberg and Kuhn (2009) [41] for copula functions used in dimension reduction.
A well-known measure of uncertainty in a probability distribution is its average Hartley information measure called (Shannon) entropy. For a discrete random variable X with values x 1 , x 2 , . . . , x n and mass density function p(.), its entropy is defined as: and for a continuous random variable X, its (differential) entropy is given by: where X is the support of X. Similarly, for a (continuous) multivariate random vector X of dimension k with the multivariate density p(X), the entropy is defined as: where is an k-integral on X . For two random variables X and Y with joint distribution p(x, y), the conventional information gain (IG) or mutual information (MI) is defined as: which is used to measure the amount of information shared by X and Y together, with convention 0 0 = 1. Moreover, one may generalize this concept to a continuous random vector X = (X 1 , X 1 , . . . , X k ) as: Ma and Sun (2011) [42] defined the concept of "copula entropy". Based on their definition, for a multivariate random vector X, which is associated with copula density c(u), its copula entropy is: Additionally, they have pointed out that the mutual information is a copula entropy. Indeed we have the following lemmas Lemma 1. Ref. [32] For a multivariate random vector X with the multivariate density p(X) and copula density c X (u), Finally, the conditional mutual information is useful to express the mutual information of two random vectors conditioned by a third random vector. If we have a k-dimensional random vector X, m-dimensional random vector Y and n-dimensional random vector Z, , then mutual information of X,Y given Z which is referred to as "conditional information gain" or "conditional mutual information" of variables X and Y given Z is obtained as:

Copula-Based Random Forest
The connection between mutual information and copula function has been investigated in the literature. We can also represent the conditional mutual information via the copula function through the following proposition: Proposition 1. If the random vector (X, Y, Z) is associated with copula C X,Y,Z (u, v, w) , then Equation (7) is Proof of Proposition 1. By an appropriate equivalent modification of the argument of the log function in the integrand of (7), we readily obtain: The last equality comes from one of the results of Ma and Sun (2011), which proves that for X = (X 1 , X 2 , . . . , X n ), In order to use the mutual information in the decision trees, assume we have a dataset D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}, where x i ∈ X is the i-th input or observation and y i ∈ Y is the corresponding outcome variable. In a machine learning approach, the major goal is constructing (or finding) a classification map f : X → Y which takes the features x ∈ X of a data point as its input and outputs a predicted label. The special case of the outcome variable is a class label y i ∈ {−1, 1}; i.e., it has two possible values, such as: negative/positive, pathogenic/benign, patient/normal, etc. The general objective function that must be maximized is: where I(X k , Y) measures the relation between term X k and target variable Y, I(X i , X k ) quantifies the redundancy between X i and X k ; while I(X i , X j Y) measures the complementarity between terms X i and X. Similar to Proposition 1, one may state the Equation (9) based on copula as: where the first term of the last part of equality refers to the relevancy of the new feature X k . Peng et al. (2019) [17] introduced the "Minimum Redundancy Maximum Relevance (mRMR)" criterion to set the value of β to be the reverse of the number of selected features and γ = 0. We generalize their results by simplifying J CMI (X k ). In particular, we have the following criterion, which we have to maximize: From this formula, it can be seen that mRMR tries to select features that have high correlation to the target variable while they are mutually far away from each other, and in this case, the couple of functions plays an important role in the connection between the input and class level variables.
Since decision trees are prone to overfitting and do not globally find an optimal solution, their generalization, random forests, are suggested to overcome these disadvantages. Our algorithm considers the dependence between attributes to provide the best feature selection set and embeds these selected features to a random forest procedure. In this approach, we first use the dependence between attributes to choose the max-dependent as well as the max-relevant features to the class label and eliminate the max-redundant features. From the point of view of these three criteria, our approach is equivalent to the method presented by Peng et al. (2019) [17].
The confusion matrix is a metric that is often used to measure the performance of a classification algorithm. It is also called a contingency table and in a binary classification it is a 2 × 2 table, as shown in Figure 1.
Speci f icity = TN TN + FP . We use our copula-based random forest to find the most relevant features and carry out a classification task. For this, using the copula function which connects input variables with each other as well as with the class variable, we find the most important variables by maximizing of these three criterions and then, based on their priorities, we embed them to a random forest approach to classify the class label feature. We continue our selection to find the most important feature until we reach the desired level of criteria. Traditional criteria can define some values of accuracy/sensitivity/specificity. Inspired by Snehalika et al. (2020) [34], Algorithm 1 presents a pseudo code of this method. Without loss of generality, we consider that the criterion is the accuracy. The algorithm for the sensitivity and specificity is the same.
6 Perform a random Forrest classification; 7 the accuracy of random forest classification using 13; 8 accuracy = Acc new + accuracy; 9 end

Numerical Results
A simulated dataset as well as real data analysis is presented to illustrate our method.

Simulation Study
In order to carry out a simulation study, we generated data from normal distribution with copula dependence. Our considered copulas were Gaussian, t and Gumbel copulas; see, e.g., Nelsen (2006) [32]. Using the copula library, we first generated n = 10,000 random samples x 1 , . . . , x 10 from a 10-variate Gaussian copula where all off-diagonal elements of their correlation matrix equal to ρ = 0.85 and their marginals follow the standard normal distribution. In a similar fashion, again, we generated another 10 variates x 11 , . . . , x 20 independent from the first 10 variables. Then, for simulating from t-copula, we generated 10-variates x 21 , . . . , x 30 from t-copula with all correlation values equal to ρ = 0.85, df = 19 [43] and their marginals follow the standard normal distribution. Finally, a bivariate Gumbel copula with θ = 5 [44] and normal marginals were generated and inserted into x 31 , x 32 . A schematic heatmap plot of these 32 features is shown in Figure 2. Using a linear combination, we added values of these feathers and made the outcome variable. In order to obtain a class-valued variable, we recoded the negative values of the outcome variable to "0" and other values to "1".
Using Algorithm 1, we started with n = 2 features. The most important features to classify y were x 26 and x 32 with sensitivity = 0.869, specificity = 0.867 and accuracy = 0.875. Continuing the selection of the most relevant features has led us to x 26 , x 32 and x 31 as the first three relevant features. In order to obtain unbiased results, we performed a 10-fold cross validation, and in each fold, we left out 1000 cases as a test group and the remainder for the train set. Averages of sensitivity, specificity and accuracy were calculated to assess the algorithm. Table 1 shows the most relevant and least redundant features with their evaluation scales sensitivity, specificity and accuracy. This table helps us to assess our algorithm by monitoring its running time as well as its comparison with other algorithms. Since, after selecting the features, we use the traditional random forest approach, it is reasonable that we compare our results with the results of the traditional random forest approach. Comparing the last two rows of Table 1, we deduce that the results of our algorithm and the traditional random forest algorithm are the same.  Additionally, from the running time point of view, as seen from the last column of the table, for a small number of attributes, the running time (based on seconds) is negligible, and by increasing the number of attributes, the running time increases significantly. From the pros and cons point of view of the proposed approach, as understood from this table, there is a design-of-experiment approach that physicians may encounter. They can regulate the number of desired attributes to carry out a reasonable random forest classification based on the percentage of accuracy, specificity and sensitivity. Evidently, as seen from the last column of Table 1, after selecting attributes using copula, such a classification algorithm will run fast for a small number of attributes; one may think this is an operation research problem. Specifically, the sample size, the number of attributes and the complexity of relationship between attributes play important roles in such a classification procedure. So, from the point of view of the practical implications, these results enable researchers to specify the number of attributes based on the desired levels of sensitivity, specificity and accuracy, and if the relationship between attributes is not complicated, one can choose a greater number of attributes and achieve more accuracy, and vice versa.

COVID-19 Dataset
Li et al. (2020) [13], in a meta-analysis, merged 151 datasets of COVID-19 including patient symptoms and routine test results. Nineteen clinical variables were included as explanatory inputs. The variables included age, sex, serum levels of neutrophil (continuous and ordinal), serum levels of leukocytes (continuous and ordinal), serum levels of lymphocytes (continuous and ordinal), results of CT scans, results of chest X-rays, reported symptoms (diarrhea, fever, coughing, sore throat, nausea and fatigue), body temperature, and underlying risk factors (renal diseases and diabetes) [13]. By applying machine learning methods, they reanalyzed these data and investigated correlation between explanatory variables and generated a computational classification model for discriminating between COVID-19 patients and influenza patients based on clinical variables alone.
In a COVID-19 patient case, an agglomerative approach test may help diagnosis of illness. We used our copula-based feature selection to identify the most effective attributes to make a discrimination between COVID-19 patients and influenza patients. We started with two attributes. The most relevant attributes were "age" and "fatigue". We then applied these two attributes to separate the COVID-19 and influenza patients and obtained evaluation values sensitivity, specificity and accuracy, respectively as 0.755, 0.864 and 0.836. Seeking the three most important classification attributes lead us to "age", "fatigue" and "nausea/vomiting" with sensitivity equaling 0.840, specificity equaling 0.886 and accuracy equaling 0.873. Table 2 summarizes the 10 most important features with their classification evaluation's scores. As understood from this table, there is a design-ofexperiment approach that a physician may encounter. In fact, the required percentage of information determines the number and types of tests of patients. For example, if there is a required 85% accuracy of classification only, then it is enough to know "age", "fatigue" and "nausea/vomiting" of patients, while for 91.4% accuracy, we need to test the 15 most important attributes.

Diabetes 130-US Hospitals Dataset
In this subsection, we assess our approach in a big data analysis. We apply our algorithm to classify the Diabetes 130-US hospitals dataset [45]. This dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It is comprised of 101,721 observations of 50 features representing patient and hospital outcomes. The data contains such attributes as "race", "gender", "age", "admission type", "time in hospital" and another 45 attributes. Detailed descriptions of all the attributes are provided in Strack et al. (2014).
We used the "diabetesMed" variable (0 and 1) as our response/target class variable and applied the other attributes to classify patients into two groups: no medical prescription needed and medical prescription needed. Similar to the previous subsection, the results are summarized in Table 3.

Conclusions
A copula-based algorithm has been employed in a random forest classification. In this regard, the most important features were extracted based on their associated copulas. The simulation study as well as real data analysis have shown that the proposed couple-based algorithm may be helpful when the explanatory variables are connected nonlinearly and when we are going to extract the most important features instead of all features.
The idea of this paper may be extended in some manners. One may use this idea in a multi-class random forest classification. Additionally, a random forest regression considering the connecting copula of features will be useful. Moreover, the associated copula of features in order classification tasks such as the support vector machine, discriminant analysis and naive Bayes classification will be of interest. Many extensions of random forest have been investigated by several authors, for example, boosted random forest, deep dynamic random forest, ensemble learning methods random forest, etc. Each extension of the random forest classification may be combined with our approach to obtain better results. We are going to extend these results in a longitudinal dataset in which the outcome variables are connected using some copulas.