2.3. Inner Consistency Analysis
The task of assessing the inner consistency of the p-tools faces the problem of the heterogeneity of their outputs. It is not simply a problem of outputs involving different scales but of their semantic meaning. Usually, p-tools provide categories to classify the impact of mutations on the functionality of a gene. However, these categories are not equally distributed through their original numerical scales, thus conversions made by the tools are not linear. Furthermore, different numerical scales are used, from probabilities and free-energy values, to ad-hoc scores. Hence, normalization approaches do not make sense. We note, however, that once categories are defined for a p-tool, they naturally induce an internal ranking for numerical predictions. Given a pair of p-tools and a dataset of mutations, the agreement between their internal rankings can be used to assess their inner consistency.
Under this baseline, we first considered the Kendall rank correlation coefficient (
) [
23] measuring the ordinal association between two measured quantities. Briefly, given a pair of p-tools and a set of target mutations, high values of
are expected whenever target mutations receive similar ranks in both tools. Formally, let
be a set of mutations with
n being the number of mutation sites multiplied by the number of allowed mutations per site. Also, let
denote the effect of mutation
m predicted by a given p-tool
S with
be the most informative scale provided by
S. In addition, let
be the
less-damaging-than relation induced by
S on mutations
and
so that
if
,
. Finally, to simplify the notation, for any p-tool
S, three orderings are possible for any pair of mutations
and
, namely,
,
, and
,
.
A concordant pair of predictions for p-tools
S and
P is accounted whenever
or
occurs for both
S and
P,
. Conversely, a discordant pair of predictions is accounted for p-tools
S and
P whenever
occurs for
and
occurs for
,
. Alternatively, if
occurs for either
S or
P, a neither concordant nor discordant pair of predictions is accounted,
. Based on these considerations, the Kendall
coefficient can be defined as follows:
P-tools with native numerical outputs provide convenient categorical outputs by the adoption of sharp thresholds. This common practice may induce false concordant/discordant pairs in the Kendall computation which misleads the comparison of p-tools. For example, let us consider [0, 0.4] being the support of the category label “Benign” with predictions in the [0, 1] range. Intuitively, prediction values of 0.39 and 0.41 are so close that we may not use them to differentiate categories of mutation effects. Hence, although the Kendall coefficient can be used with p-tools numerical outputs, its value for measuring the inner consistency of p-tools raises some concerns.
Furthermore, the numerical outputs of p-tools may differ due to computational precision issues, additionally inducing false concordant/discordant pairs in the Kendall
computation that further misleads the quantification of the inner consistency of p-tools. In brief, the Kendall
coefficient appears too “sensitive” to assess the inner consistency of p-tools with numerical outputs. To overcome this problem, let us first define a convenient function
characterizing the specific ordering assigned to mutations
and
,
, by any p-tool
S:
We now introduce a novel index, called
, able to properly account for all different prediction pairs issued by p-tools
S and
P:
For p-tools involving native categorical outputs, category labels are ordered based on their impact on gene functionality, e.g., for category labels
, the preference relation
is assumed. On the other hand, for p-tools involving numerical outputs, equality
thresholds are required to avoid the false counting of either concordant or discordant pairs. Let
S be a p-tool with a numerical output and an equality threshold
. Hence, the preference of
S on mutations
and
,
, is defined as follows:
Hence, . Since p-tools generally involve different prediction ranges, their thresholds must be set accordingly. In the absence of prior information, setting these thresholds to some predefined percentage of their prediction ranges appears as a fair approach. The problem becomes how to set that percentage. At first glance, the thresholds must be large enough to avoid small prediction differences and numerical errors to induce discordant counts, but also small enough to avoid the false counting of either concordant or discordant pairs.
To shed light on the percentage equality threshold trade-off problem, let us consider the mutations and , , and the predictions issued by the tools S and P. Let us consider first the case where holds for both tools. Also, let us define and , < . If , then and so that an agreement is counted for . However, if , then and , so that a disagreement is counted for . However, if , then and , so that an agreement is counted for again.
Similar counting arguments can be used to analyze the cases and . In all cases, as the percentage equality threshold is increased from 0%. first decreases and then increases monotonically until the percentage equality threshold reaches 100%. All mutations then become indistinguishable and reaches its maximum value (1). To summarize, does not show a monotonic behavior with respect to the percentage equality threshold. Supplementary studies were performed to asses the critical percentage equality threshold where accomplishes its minimum.
Two independent datasets of mutations, namely, the
DM-V dataset comprising reported mutations of the
Drosophila melanogaster vermilion (V) gene and the
CHKV-E2 dataset comprising reported mutations of the
Chikongunya virus
E2 gene, were used to evaluate the
index with respect to increasing values of the percentage equality threshold. All p-tools were analyzed except Panther as this only provides a categorical output. As a result (see
Figure 1), the percentage equality threshold was set to 5%, with an intermediate value between 0% (no threshold) and that value where (∼10%)
falls to its minimum.
Users of p-tools might be additionally interested in the identification of pairs of p-tools showing not only a considerable proportion of disagreements but a particular form of them, that involving opposite predictions, i.e.,
and
. In this case, the
index can be used:
While the index measures the proportion of pairs of predictions for which conflicting orderings are observed, the index focuses only on extreme conflicting orderings. In practice, users might use the index for the identification of similar p-tools looking for values close to one. Conversely, users might use the index for the identification of different p-tools looking for values close to zero. Beyond these considerations, the ranges and the directions of and are similar so that values closed to 1 indicate that pair of p-tools are likely to order all pairs of mutations in a similar way, while values closed to 0 indicate they are likely to order them differently. Similar counting arguments to those used with the index, can be used to asses the effect of percentage equality thresholds on the index. Differently from , a monotonic decreasing behaviour is observed for for increasing values of the percentage equality threshold. However, since we expect that only dissects the inner consistency information already provided by its more general counterpart, practical evaluations were performed with the percentage equality threshold derived from independent studies (5%).
Users of and are generally interested in the evaluation of inner consistency aspects of p-tools predictions. In this regard, both and rely on the consistency of preferences exhibited by pairs of p-tools across pairs of mutations. However, consistent preferences might hide quite different mutation effects. Without loss of generality, let us assume a common output scale for the p-tools S and P, and let us consider the mutations and , . In addition, let us assume pairs of predictions and issued by S, and and issued by P, so that holds. Although both S and P predict that is less damaging than , the pairs of predictions are in opposite ranges of the scale and involve quite different effects: While and might be benign according to S, they are both pathogenic according to P. This toy example points out that inner consistency measurements between pairs of p-tools may require the evaluation of multiple aspects, from the consistency of pairwise preferences to the consistency of the semantics behind individual predictions.
Aiming to shed light on the semantic aspect of p-tools inner consistency measurements, the Spearman’s rank correlation coefficient was considered. Briefly, the Spearman’s correlation [
24] between two variables equals the Pearson’s correlation between the rank values of the two variables. However, while the Pearson’s correlation assesses only linear relationships, the Spearman’s correlation assesses general monotonic relationships, whether linear or not. For
n distinct mutations, Spearman’s rank (
) correlation coefficient is associated to predictions issued by p-tools
S and
P can be computed using the following popular formula:
where
is the difference between the ranks assigned to the
i-th mutation by
S and
P,
. In the case of identical predictions, the average value of their ascending ranking positions is used. Although correlation coefficients are intended to measure the “strength of pairwise relationships”, they might be confused by unclear rankings like those induced by p-tools with numerical outputs. On the other hand, although neither the
nor the
indices consider the absolute position of p-tool predictions, i.e., their semantic aspect, they are not confused by small differences in numerical prediction values due to the introduction of the equality threshold for preference relationships. As a result, both
and
are good candidates for making productive evaluations of p-tools inner consistency aspects.
2.4. Outer Consistency Analysis
Standard information retrieval metrics including the
, the
, the
recall, the
, and the Matthews correlation coefficient (
) were considered to evaluate the outer consistency of p-tools:
where
,
,
, and
stand for the number of true positive, true negative, false positive, and false negative predictions respectively. It is worth noting that special care should be taken with the above metrics when analyzing highly imbalanced datasets like those induced in experiments involving the high throughput screening of genetic mutations. Fortunately, the human being is a highly robust system, thus we expect most of the SNVs to be negative examples (benign mutations). Therefore, the accuracy is not a good metric for measuring the outer consistency of p-tools as a naive predictor set to predict only
mutations would achieve a very high accuracy. On the other hand, the precision metric is useful to measure the proportion of mutations predicted as positive examples that were indeed
predictions (pathogenic mutations).
Similarly, the recall metric is useful to measure the proportion of positive examples that were indeed predictions, with respect to the ground truth for positive examples. Both the precision and recall metrics disregard predictions. There is also often an inverse relationship between the precision and recall metrics so that it is possible to increase one of them at the expense of reducing the other; the F1-score, originally defined for document classification problems where predictions also do not matter, is defined as the harmonic mean of the precision and recall metrics. Finally, the MCC is a statistic robust to differences in the proportion of negative and positive examples that can be more appropriate than the F1-score when negative examples matter is some way. The MCC is called a correlation coefficient because it is −1 when predictions are completely wrong, 1 when they are completely correct, and 0 when they are not better than random predictions.
In order to analyze the outer consistency of p-tools, their outputs were binarized. Align GVGD predictions in “C0” and “C15” classes were considered negative examples (benign) and predictions in the “C45”, “C55”, “C65” classes were considered positive ones (pathogenic). Similarly, Provean predictions in the “Neutral” class were considered negative examples and predictions in the “Deleterious” class were considered positive ones. On the other hand, Panther predictions in the “Benign” class were considered negative examples and predictions in the “Damaging” class were considered positive ones. For Strum and Cupsat, predictions with >=0 were considered negatives examples, while predictions with <0 were considered positive ones. Finally, Polyphen2 predictions in the “Benign” class were considered negative examples and predictions in the “Probably” class were considered positive ones. In all the cases, p-tool predictions involving intermediate categories were disregarded for the outer consistency analysis.