A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery

: Knowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classiﬁer can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientiﬁc hypothesis may depend critically on the ability to classify cases better than randomly, without high classiﬁcation performance being the primary goal. Random forests are often chosen for knowledge-discovery tasks because they are considered a powerful classiﬁer that does not require sophisticated data transformation or hyperparameter tuning and can be regarded as a reference classiﬁer for tabular numerical data. Here, we report a case where the failure of random forests using the default hyperparameter settings in the standard implementations of R and Python would have led to the rejection of the hypothesis that the data contained structure relevant to the class structure. After tuning the hyperparameters, classiﬁcation performance increased from 56% to 65% balanced accuracy in R, and from 55% to 67% balanced accuracy in Python. More importantly, the 95% conﬁdence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classiﬁcation. Thus, tuning provided the desired evidence that the data structure supported the class structure of the data set. In this case, the tuning made more than a quantitative difference in the form of slightly better classiﬁcation accuracy, but signiﬁcantly changed the interpretation of the data set. This is especially true when classiﬁcation performance is low and a small improvement increases the balanced accuracy to over 50% when guessing.


Introduction
In biomedical research, knowledge is increasingly being discovered using machinelearning techniques. Unsupervised and supervised methods are used to analyze whether a data set contains a structure relevant to the research subject. Among the supervised methods, the classification algorithms are of particular interest and have been described in detail elsewhere for this purpose [1]. The approach assumes that if a classifier can be trained to assign a case to the correct class better than by guessing, then the data contain a structure that is relevant to the class structure and the variables the classifier needs to perform this task contain relevant information about the class structure being addressed. In this setting, high classification performance is not necessarily the primary goal, but rather the focus is on supporting the hypothesis that the variables "X" collected contain relevant information to support the given class structure "y". This hypothesis must be rejected if a reasonably chosen classifier, such as random forests for tabular numerical data, cannot assign cases to classes better than by guessing.
Random forests [2,3] is a tree-based bagging classification algorithm that creates a set of distinct, uncorrelated, and often very simple decision trees [3]. The splits of the features are random, and the classifier refers to the majority vote for class membership provided by hundreds of simple decision trees. It is considered a powerful classifier on tabular numeric data compared to deep learning neural networks [4,5] and outperforms logistic regression [6]. Increasing the number of trees in the forests does not tend toward overfitting [7]. Moreover, random forests can be used without prior complex parameter settings [8]; removing uninformative features has been advised to suffice the tuning [9]. These are key advantages to alternative classifiers, such as k-nearest neighbors [10] that requires a valid distance measure often difficult to define [11], or support vector machines [12] that have regularization as a critical hyperparameter [13], or deep learning layered artificial neural networks, that while considered universal classifiers [14], need the number of layers and the number of neurons in each layer to be set.
Its advantages make random forests a reference classifier for tabular numerical data. Initial investigation of whether the data contain a structure that reflects the class structure and can be used for classification is often performed with random forests. The present case study reports an occasional failure of random forests to classify a simple data set using the default hyperparameter settings of random forests implementations in the R [15] and Python [16] programming languages. The context of this observation was data exploration using unsupervised methods. As machine learning becomes more prevalent in biomedical research and "out-of-the-box" solutions become widely available, it is important to be aware of its pitfalls. The fact that an important decision about whether to accept or reject a research hypothesis may depend on the voting of random forests is the motivation for describing the following case.

Data Set
A biomedical data set was available from an assessment of pain sensitivity to various experimental noxious stimuli collected in a quantitative sensory testing study in n = 125 healthy Caucasian volunteers (69 men, 56 women, aged 18 to 46 years, mean 25 ± 4.4 years) [17]. From that study it was known that blunt pressure pain had a comparatively large effect size with respect to sex differences in pain perception ( Figure 1) with an estimate of Cohen's d [18] = 0.83. Reassessing the sex difference verified statistical significance (t-test [19]: t = 4.7025, degree of freedom = 122.93, p = 6.783 × 10 −6 ). In an actual reanalysis of this data set, a focus of assigning the subjects' sex from the acquired pain information was pursued, reversing the widely accepted fact that pain perception is gender specific [20].  [21]) and the library "ggplot2" (https://cran.r-project.org/package= ggplot2 (accessed on 9 October 2022) [22]).

Design of the Experiments
In a 1000-fold cross-validation scenario on training data subsets comprising 2/3 of the original data drawn by random Monte-Carlo re-sampling [31] from the original data. The trained classifiers were then applied to the remaining 1/3 of the cases that have not been used for training. The success of classifier training was quantified by calculating the balanced accuracy [32], which corresponds to the area under the receiver operating characteristic curve (roc-auc) [33]. These calculations were done using the R libraries "caret" and "pROC".

Random Forests Classification with Default Hyperparameters
An initial attempt to assign subjects to the correct sex based on their pain threshold to blunt pressure stimuli using a random forests classifier was performed using the default parameters of the R library "randomForest", i.e., n = 500 decision trees with sqrt(d) features = 1 and no restriction on the number of nodes or the depth of the trees (Listing 1).
Listing 1: R code for classification with random forests using the default settings of the library "ran-domForest" (https://cran.r-project.org/package=randomForest (accessed on 9 October 2022) [23]). This led to a 95 % confidence interval of both, balanced accuracy and roc-auc, that included the value of 0.5, i.e., the classification could not be considered as better than mere guessing (Table 1). This was unexpected given the significant difference determined by a t-test, the clear separation of the distributions of pain thresholds ( Figure 1) between the two sexes, and therefore prompted further investigation. Table 1. Performance of non-tuned and tuned random forests and of two ad hoc classifiers consisting of a simple brute force split rule or a Bayesian decision rule. Performance was evaluated as balanced accuracy (BA) and, for random forests, also as area under the receiver operator characteristic (aucroc). The latter was not calculated for the ad hoc classifiers because the assignment probability was either not implemented (split) or not queried (Bayes). Implementations of random forests in the R ("randomForest") and Python ("RandomForestClassifier") programming languages were both used.

Investigation of the Classification Failure
First, it was checked whether the R library "randomForest" accidentally failed on this data set. Therefore, the classification attempt was repeated using the Python implementation "RandomForestClassifier" from the "sklearn.ensemble" package (Textbox 2). Again, the balanced accuracy and the roc-auc were close to the value of 0.5 (50%), and their confidence interval spanned the value 0.5, indicating a failed classification (Table 1)).
Second, a simple splitting rule was built by brute force by testing each possible splitting of the pressure pain threshold of the training data set for the best accuracy for discriminating the sexes. The classification performance of these rules was evaluated on the test data subset. In addition, a Bayesian boundary [34] in the pain threshold between sexes was computed using our R library "AdaptGauss" (https://cran.r-project.org/package= AdaptGauss (accessed on 9 October 2022) [35]), and used as another simple classifier. These two ad hoc classifiers had no problem assigning sexes based on pressure pain threshold with balanced accuracy and a roc-auc better than guessing, that is, with a lower limit of the 95% confidence interval 0.5 (Table 1).

Random Forests Classification with Tuned Hyperparameters
Tuning the hyperparameters of the implementation of the package "randomForest" for the number of trees (100 to 1500 in increments of 100) and the number of nodes in each tree (1-8) (Figure 2) indicated that the classifier should be run with n = 1300 trees and a maximum of n = 3 nodes.
A rerun of the classifier training and testing with the tuned hyperparameters (code not shown) provided the desired result of balanced accuracy and roc-auc for sex assignment from the pain threshold variable better than guessing, i.e., with confidence intervals to the right of the 0.5 value ( Table 1). The Python implementation (Listing 2) resulted in n = 700 trees with a maximum depth of only one split. Again, the desired classification better than chance was achieved, almost exactly as with the R implementation "randomForest" ( Table 1).

Discussion
In this example case, random forests unexpectedly but consistently failed across programming environments on a seemingly simple classification problem when the default parameters of the software implementations were used. After tuning the hyperparameters, the expected classification success was achieved, i.e., classification performance exceeded guessing. Failure with the default settings can become particularly relevant when supervised learning is used to recognize data structures rather than to create a powerful classifier. A typical problem in biomedical data analysis is met in a set of individuals with a particular class label, e.g., a diagnosis, and some measurements for each case. The first question is whether these measurements have a structure that is relevant to the class structure and can be used to classify the subjects. To answer this question, a powerful machine-learned classifier can be trained on a subset of the data. If this classifier is able to classify the cases not used in learning such that this classification is better than guessing, this indicates that there is structure in the data that supports the class structure. Classification accuracy that exceeds guessing is sufficient for this structural assessment. This was thwarted in the present example case, where failure to classify would have led to false rejection of the hypothesis that the data contained a structure relevant to the subjects' sex. Only after tuning, the hypothesis could be accepted.
Thus, when random forests is used for supervised structure discovery in data sets and for feature selection [37], classification performance better than chance is the first requirement. Knowledge discovery by applying feature selection techniques assumes that if a classifier can be trained to assign a case to the correct class better than by guessing, the variables needed by the classifier to accomplish this task contain relevant information about the class structure being addressed. However, this depends on the success of the classification. It is not of interest in this context with which feature an unsuccessful classifier attempted its classification task. In this case study, tuning provided exactly the required classification success of the random forests classifier.
The most important parameter for tuning the random forests seemed to be the tree depth in the present data, which had to be limited to a few decisions of maxnodes = 3 and max_depth = 1 for the R and Python implementations, respectively. Both refer to a limit on tree growth. In contrast, the number of trees seemed less important (Figure 2). Changing the number of trees between 100 and 1500 had indeed little effect in additional cross-validated tuning attempts (not shown). This is consistent with previous results of assessing the impact of the number of trees in the forest [7]. However, with only one variable, this observation cannot be generalized from present experiments. Using random forests for only one variable seems unusual, but it is not discouraged, and it may well be that a single variable remains in a data set after feature selection.
In the present analyses, standard implementations of random forests in the R and Python programming languages were used with consistent results. This is the common data science environment. However, less frequently used modifications of random forests might be less affected, but were not tested here. An example is the combined use of the Kolmogorov-Gabor polynomial [38] and the random forests algorithm to increase classification accuracy [39]. There, each variable vector is represented as polynomial members and random forests are used to find the coefficients. However, the Python code mentioned by the authors is not freely available, and reimplementing the method would far exceed the purpose of this case report. Further modifications of random forests can also better avoid the present case of initial classification failure, such as several proposals summarized in [40]. However, the present work has highlighted a pitfall in standard data analysis when default parameters of common implementations of random forests are used for data exploration. The use of sophisticated modifications likely implies that parameters are carefully chosen rather than left at default values, so the problem presented here may not arise in these particular applications of random forests. However, for default implementations, the problem seems to occur with other implementations as well. Using the R package "ranger" (https://CRAN.R-project.org/package=ranger (accessed on 9 October 2022) [41]) instead of the package "randomForest", the default parameters provided a balanced accuracy of only 0.56 (95% confidence interval: 0.43-0.69). This underlines the relevance of the present observation performed with this case.

Concluding Remarks
The present case study showed that random forests occasionally need hyperparameter tuning. While tuning is implemented in data science workflows, encouraged by the reputation of random forests for being a robust classifier that does not require complicated parameter settings, tuning may be omitted to save time or when in a supervised approach to data structure exploration high classification performance is not the main goal while it is sufficient to prove that the data contain information relevant to the class structure if the classification is better than guessing. In this case study, tuning made more than a quantitative difference in terms of slightly better classification accuracy. In contrast, it made a qualitative difference in the interpretation of the data set, namely the difference between judging the input data structure as supporting or not supporting the class structure of the data set. This is especially relevant for data sets where classification performance is low and where a small improvement will increase balanced accuracy to over the 50% guessing limit. In the present case, tuning the R implementation of random forests increased the balanced accuracy of assigning subjects to their correct sex from 56% to 65%, and in the Python implementation the change was from 55% to 67%; however, the 95% confidence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classification. The tuning effect goes beyond quantitative improvement in classification performance but affects the interpretation of a data set based on supervised learning fundamentally. Therefore, the present report addresses the use of supervised data analysis for knowledge discovery, highlighting that tuning of random forests when used for supervised knowledge discovery in biomedical data can decide about the acceptance or the rejection of a research hypothesis.

Institutional Review Board Statement:
The study from which the data set originates followed the Declaration of Helsinki and was approved by the Ethics Committee of Medical Faculty of the Goethe-University, Frankfurt am Main, Germany (approval number 150/11). Permission for anonymized report of data analysis results obtained from the acquired information was included in the informed written consent.

Informed Consent Statement:
Permission for anonymized report of data analysis results obtained from the acquired information was included in the informed written consent.