1. Introduction
Deep convolutional neural networks [
1] have emerged as a very powerful type of classification model that is finding applications across diverse fields. A typical neural network model has many millions of parameters embedded in a deep multilayer structure, making them very opaque classifiers whose inner workings are poorly understood. The lack of explainability of deep neural networks is viewed as a major weakness. However, one should note that explainability is a multifaceted phenomenon [
2,
3,
4]. Let us consider several examples illustrating distinct viewpoints regarding explainability.
Deep neural networks are considered for use in scenarios that may result in fatalities. The severe malfunctioning of a system comprising a neural network, such as the unfortunate fatality caused by Uber Autonomous in Arizona, will have negative impacts on the public’s acceptance of AI technologies. Therefore, as well as for legal reasons, it is important to perform precise analyses, ideally with the identification of the root cause, of why such misprediction occurred. We call this requirement post hoc explicability. Such analysis may take many forms, e.g.:
One may investigate which particular sample(s) in the training dataset inspired the network to make a prediction, an early example of which is [
5], where the authors compared activations with those in the training dataset;
One may be curious which part of the classified image led the CNN to make its classification decision—typically visualized using a heatmap, as in [
6,
7,
8].
Another approach involves analyzing the inner workings of a neural network. An example of this is the interpretation of receptive fields of neurons, especially in early levels of deep networks [
9]. It would be better still to have explainability in development, when one may use the insights provided about the classification system to change it in order to prevent possible malfunctions (mispredictions) during system deployment. However, the complete retraining of a large network is often too costly, especially since there is no guarantee that it will bring about the desired improvement. It is therefore desirable to only perform incremental improvement of the system, so that most of the training effort is retained. However, in standard neural networks, it is next to impossible to identify the subset of weights or the substructure of a large network that causes erroneous predictions. One solution would be establishing modularity in the classification system of the kind traditionally provided by ensembles in machine learning [
10]. The absence of modularity in neural network has long been a recognized problem [
11], but only recently have ensemble methods (with a mixture of experts [
12]) found their place among industrial strength deep network systems [
13].
Given the complex nature of explainability and its varied benefits, one may expect that engineering explainability into classification systems entails costs and poses hurdles during model development. In this paper, we examine these tradeoffs for a class of models constructed by means of pairwise coupling (an ensembling technique) from convolutional neural networks. The modularity of these models is especially suitable for studying explainability facets occurring when one restricts classification to a smaller subset of classes.
Medicine abounds with the need for explainable classification systems [
14,
15]. Let us describe a novel facet of explainability applicable in the medical field. Consider the case of a physician specialist who is being assisted by a system employing a convolutional neural network in diagnosis based on patient data (e.g., X-ray, MRI, or histology data). A convolutional neural network may arrive at a class prediction, while the specialist may feel that a different class is the correct one. Having narrowed the set of possibilities to this pair of classes (that preferred by the physician and that preferred by the network), it would be helpful for the classification system to provide the most precise prediction possible given that there are only two classification outcomes under consideration. An improvement is not possible in classification systems that exhibit independence of irrelevant alternatives (see
Section 2.3) e.g., multinomial (softmax) regression, but if an improvement is possible, we say that a system exhibits enhanced pairwise explainability. This explainability is present in pairwise coupling models by design.
As we will show, pairwise coupling models can provide a benefit even in traditional explainability areas, namely uncertainty quantification [
16,
17]. Neural networks are affected by multiple randomness effects—from initialization through random dropout layers to the random ordering of training samples in batches. This prediction randomness is rarely explained to the end user, mainly because it would be costly to train many networks to obtain meaningful measures of randomness for the predictions. We shall say that systems providing an explanation of the stochasticity of predictions have uncertainty quantification.
The remainder of this paper is organized as follows. In
Section 2, we will review principles of the pairwise coupling methodology. In
Section 3, we will provide experimental details including a description of convolutional network architectures. In
Section 4, we examine pairwise explainability through an evaluation of the pairwise accuracy of networks trained in a pairwise manner. In
Section 5, we will evaluate multi-class accuracy of models built from networks trained in a pairwise manner. In
Section 6, we will illustrate explainability in development using the concept of incremental improvement based on errors in the confusion matrix. In
Section 7, we will illustrate uncertainty quantification by constructing many classification models in order to estimate the uncertainty of a prediction. Open research questions are discussed in the conclusion (
Section 8).
2. Review of Pairwise Coupling
In this section, we outline the pairwise coupling classification methodology, proceeding from general definitions through historic motivation (SVM multi-class classification) to presenting state-of-the-art methods.
2.1. Classification Generalities
A (hard) classification problem can be defined as the search for a classifier function
mapping classified objects to a finite set
C of dependent categories (classes)
[
18]. When
, we speak of a binary classifier.
Often, one solves this problem by solving the soft classification problem first. This involves finding a posterior approximating predictor function
, where
c is the cardinality of the set
C. If
, then the prediction is
The main reason for this detour is that it is convenient to optimize a smooth cost function of a parametric classification model using a gradient descent method. Such a gradient search underlies many machine learning methods, ranging from logistic regression to neural networks.
2.2. Motivation for Pairwise Coupling
A notable exception which does not fit the soft classification approach is the support vector machine model (SVM), a non-parametric classification technique very popular since the late 20th century [
19,
20]. The SVM model divides the feature space (or its higher-dimensional embedding via a kernel) into two subsets using a hyperplane, which is found by solving a quadratic programming problem. The innovative concept of bisection presented by SVM poses a problem for multi-class classifications problems (when
), because there is no obvious generalization of the quadratic programming problem to more than two classes. A methodology was thus developed that entails three major steps.
The first step is the adoption of one-on-one classification paradigm. This paradigm is characterized by requiring the creation of all possible pairwise SVM classifiers
which are able to distinguish between the two classes
and
only. There are two immediate advantages to doing so. Since the model requires training only a portion of overall multiclass data, each pairwise classifier is easier to train. Moreover, each two-class dataset is more likely to be balanced, which would not be the case if one opted for a one-vs.-rest approach [
21].
The second step is converting the hard classification model
to a soft classification model
by fitting a sigmoid function for each model. This was proposed in the work of J. Platt [
22]. A subtle point in his approach was the adoption of uninformative priors on the labels, which avoids the overfitting problems associated with logistic regression.
The third step is known in the literature as the pairwise coupling approach [
23,
24], which converts the set of pairwise predictions provided by
to a final multi-class posterior prediction.
2.3. Relationship Between Pairwise Likelihoods and Multi-Class Likelihoods
To understand pairwise coupling’s underlying principles, it is worthwhile to consider its “reverse” first. Suppose we are given a soft classifier which produces a class posterior vector for a sample x. Given such a classifier and a pair of classes with , one may construct a set of soft binary (i.e., two-class) classifiers , which we call the IIA restrictions of p.
This name is inspired by the axiom of independence of alternatives (IIA). In individual choice theory, the IIA axiom states that if an alternative x is preferred from a set T, and x is also an element of a subset S of T, then x should also be the preferred choice from S.
The IIA restriction classifier quantifies this principle. Its likelihoods are such that the relative likelihood ratio of classes
i and
j is the same as in the presence of the rest of the alternatives. Thus, the IIA restriction classifier
outputs the posterior distribution
which is the unique two-class probability distribution on
for which the relative likelihood of classes is
.
The ratios in Equation (
1) may produce singular results if both
and
are simultaneously zero. We note that this singularity is avoided for a typical convolutional neural network, where the output of the softmax layer cannot produce a zero posterior for any class.
Thus, given multi-class prediction
, we may construct the matrix of pairwise likelihoods:
In this paper, we adopt the convention that on the diagonal of a matrix of pairwise likelihoods, there are always zeros. Then, the matrix of pairwise likelihoods has all entries in the interval
and satisfies
An important fact is that the mapping does not lose any information.
Lemma 1. The nonlinear mapping is injective on the set of nonvanishing posteriors. In fact, if for all i, then it is possible to reconstruct from any column or any row of the pairwise likelihood matrix.
Proof. See, e.g., [
25] or [
26]. □
2.4. Pairwise Coupling Methods
By a pairwise coupling method, we mean any method mapping the set of non-negative matrices
satisfying (
2) to the set of probability distributions on
c classes. We say that a pairwise coupling method is regular if it inverts the map
from the matrix from multi-class posteriors
to the corresponding pairwise likelihood matrix
. Thus, if a regular pairwise coupling method is given a matrix of pairwise likelihoods constructed from a multi-class vector
by (
1) and (
2), the method should yield the original multi-class distribution.
The requirements for regular pairwise coupling are rather weak, because the mapping is prescribed only on the
dimensional subspace of the
dimensional parameter space of matrices satisfying (
2). Therefore, there exist many different regular coupling methods.
In our work, we opted to use two regular pairwise coupling methods: the Wu–Ling–Wen method [
27] and the Bayes covariant method [
28]. The former is used in the popular LIBSVM library [
29], while the latter has been proven to be a unique method satisfying additional hypotheses [
28].
2.4.1. Wu–Ling–Weng Method
This method defines an optimization objective
as follows:
From the definition, it is immediately clear that the functional is non-negative. Moreover, it is zero on the image of map
because from (
1), we have
Optimizing
is also numerically efficient. Since the functional is a quadratic function of
, it is possible to reduce optimization to solving a set of linear equations.
2.4.2. Bayes Covariant Coupling
We proposed this method in our work [
28]. The underlying idea is geometric. Let us call the variety of pairwise likelihood matrices (i.e., the image of map
) the Bradley–Terry manifold. This method starts by mapping the matrix of pairwise likelihoods coordinatewise via
In the new coordinate space, the Bradley–Terry manifold becomes a linear subspace, and the method is simply the orthogonal projection on the subspace.
2.4.3. Other Pairwise Coupling Methods
Let us briefly outline other coupling methods which have been proposed in the literature. In their comprehensive study of pairwise coupling Hastie and Tibshirani [
23] introduce a coupling method that optimizes a functional derived from Kullback–Leibler divergence. The work of Zahorian and Nossair [
30] on the classification of vowels using neural networks introduces another coupling method. Regular pairwise methods based on reverting map
in a columnwise manner have been proposed in [
25,
26]. Also, Wu, Lin, and Weng studied another coupling method based on a quadratic functional of posteriors [
27].
2.5. Numerical Stability of Coupling Methods
Some coupling methods have numerically unstable behavior near the boundary of the space of possible pairwise likelihood matrices. For instance, Bayes covariant coupling suffers from this problem, since the mapping
in (
6) is singular at the limit points zero and one.
In our work [
28], we proposed two ways to deal with such instability:
Start by choosing a small threshold and then force individual pairwise likelihoods to lie in the interval by replacing them with if necessary;
Choose a small threshold and remove from consideration any class c for which there is such that , i.e., remove all rows and columns from the pairwise likelihood matrix that correspond to such classes. Then apply the coupling method to the possibly smaller matrix of pairwise likelihoods. Finally, convert the posterior probability distribution to the full set of classes, for instance by extending with zero.
3. Methods
The models built by means of pairwise coupling need to be built from binary classifiers. In this section, we describe the dataset used, the three classes of convolutional neural networks employed as binary classifiers (micro-models, mini-models and macro-models), and the baseline multi-class networks which we used for comparison.
3.1. Dataset
We used the Fashion MNIST (FMNIST) dataset [
31], which has been suggested [
31] to be a better starting point for examining computer vision classification methods compared to the historically more popular MNIST dataset [
32]. Moreover, the small size of this dataset resulted in short training times and thus in a low environmental impact for our experiments, an increasingly important societal consideration [
33]. The overall training and test data sizes (60,000 and 10,000, respectively) are identical to those for MNIST. There are 10 classes, as shown in
Table 1, evenly distributed in the dataset.
3.2. Baseline Networks
To create our baseline model (model
F in
Figure 1), we used the Keras example CNN originally designed for the MNIST dataset (model
M in
Figure 1). The only structural difference is that we replaced dropout layers with batch normalization layers. There are two reasons for this adaptation:
Overall, we trained 32 instances of networks based on the F architecture, whose weights were later used in the initialization of the binary classifiers.
3.3. Architectures for Binary Models
We examined 9 different feed-forward architectures for binary models, which we describe in this section. Roughly in increasing size order, these architectures were micro-models (
Section 3.3.1, see also architectures
in
Figure 1), mini-models (
Section 3.3.2, architectures
in
Figure 1), and macro-models (
Section 3.3.3, architecture
E in
Figure 1). For all of them, we trained 32 different networks, each partially initialized with the weights of the corresponding baseline network
(
).
3.3.1. Binary Micro-Models
In
Section 5, we will compare models built using pairwise coupling methods with the baseline model. It is desirable to use models with the same number of training parameters. Since a complete pairwise-coupled model requires 45 binary classifiers, individual binary classifiers have to be rather small. Namely, since the model
G of the network has 1,200,650 parameters, individual binary classifiers should have ≈1,200,650/45 ≈ 26,681 parameters. It is quite challenging to achieve good performance with a convolutional network that small. A natural first step is to reduce the number of neurons (convolutions) in individual layers. However, based on our previous experience with share-none architectures [
37], we felt that two additional alterations were needed:
Reducing the number of 2D maps by using convolution.
Sharing weights among individual pairwise classifiers. Thus, the first two layers of the binary classifiers had identical weights copied from the corresponding baseline network. These weights were not trainable.
The difference among the four variants of the micro-networks were as follows.
Networks
a and
b were trained with the softmax layers as the final layer. Networks
c and
d used the generalized linear model (GLM) model with a complementary log–log link function, which is considered more suitable in cases of non-symmetric distributions [
38].
Networks a and c employed binary encoding of the dependent variable. Encoding of the dependent variable in networks b and d was altered to instead of 0 and instead of 1 with in an attempt to achieve the following:
3.3.2. Binary Mini-Models
A complete model built using pairwise coupling from binary micro-models has the same number of parameters as the baseline model. However, it is much easier to train, since each pairwise training dataset has times less data in it compared to the full Fashion MNIST dataset. We can roughly estimate that the complexity of training one epoch of a network is proportional to the product of the number of weights and the number of training samples. Since each pairwise micro-model has fewer parameters, the complete arithmetical complexity to run the same number of epochs for all networks is about times lower compared to the baseline model.
We therefore also investigated pairwise coupling models, termed mini-models, that have approximately the same total arithmetical training complexity as the baseline model, although they have more parameters. The underlying binary mini-networks have
times more parameters than micro-networks, or equivalently,
times fewer parameters than the baseline networks. Their architecture is shown in
Figure 1.
Analogously to the case of micro-models, we used 4 flavors of this architecture. The models used softmax as the final layer, whereas the models used a complementary log–log GLM model. The models used standard binary encoding of the class variable, whereas the models used the same encoding as for the models .
3.3.3. Binary Macro-Models
We also included models of type
E, as shown in
Figure 1, whose architecture is identical to the baseline model
F except for the final layer, which is not a 10-class softmax layer, but rather a 2-class softmax layer. The models are initialized with the weights of the corresponding
F model, but all weights are left trainable.
3.4. Evaluation Metrics
For evaluation, we used two metrics. First, the multi-class accuracy is defined as
We also used the pairwise accuracy metric. For a soft classifier which assigns posterior distribution
over
K classes, this is defined as the mean of the (multi-class) accuracies for IIA restrictions to all pairs of classes, as defined by (
1).
3.5. Other Details
Multi-class networks were trained using a batch size of 128, whereas for binary models, we decreased the batch size to 32. The number of epochs was 12 in all cases.
In all cases, we used the AdaDelta stochastic gradient optimizer with standard settings and cross-entropy as the optimization criterion.
Finally,
Table 2 shows the parameter counts for all networks used.
4. Evaluation of Pairwise Accuracy
In
Section 1, we introduced the concept of pairwise explainability. This concept refers to the situation where we are confident that only two possible predictions are possible and we desire the best possible prediction by a convolutional neural network. Of course, one may obtain a two-class prediction by taking the IIA restriction of a multi-class classifier. But is it possible to obtain better results by training specialized binary networks on only two classes?
In this section, we present the answers to this question for the binary architectures described in
Section 3.
4.1. Influence of Architecture
In
Figure 2, we plot boxplots for the average pairwise performance over 32 training epochs for each architecture.
From this figure, we see the obvious trend that with increasing size, the convolutional networks perform better. The key point is that all mini-architectures (as well as the macro-architecture E) perform better than (the IIA restrictions of) the standard multi-class network (architecture F). On the other hand, all micro-networks of types perform worse than the multi-class network.
4.2. Detailed Performance by Pairs of Classes
We also plot the pairwise accuracy for each pair of classes in
Figure 3. From this figure, it is clear that only a handful of pairs of classes show uneven performance among architectures:
Coat/shirt;
Dress/coat;
Dress/shirt;
Pullover/coat;
Pullover/shirt;
T-shirt/shirt.
Again, as observed above, it is the size of the architecture that is the primary factor affecting the pairwise accuracy. Moreover, the results strongly suggest that not all decision boundaries are equally difficult to find, and thus in multi-class pairwise-coupled models, varying learning capacity (i.e., the number of trainable parameters) is likely needed to construct an optimal two-class classifier.
6. Incremental Improvement
In this section, we investigate the possibility of correcting a poorly performing multi-class convolutional network. A standard way to understand the poor performance of a multi-class classifier is to construct the confusion matrix of predictions. For example, we trained a baseline classifier
, which had a confusion matrix, as shown in
Table 5.
From this table, it is clear that the network makes most errors when confusing classes 0 and 6. We may try improving on the classification of an image
x by applying a pairwise coupling method to the pairwise likelihood matrix
, constructed as follows. Let
be the prediction of
and let
be the prediction of a binary classifier
Q trained to distinguish classes 0 and 6. Then, we replace two entries of the pairwise likelihood matrix for
by values provided by
Q:
We illustrate the results of such incremental improvements in
Figure 6.
We can conclude that even when
Q has inferior pairwise accuracy to
, the coupling methods are able to increase the multi-class accuracy. The Bayes covariant method (red regression line in
Figure 6) is slightly better than the Wu–Lin–Weng method (blue regression line) over the range of pairwise accuracies afforded by mini-networks of type
A. However, the latter is more efficient in converting an increase in the pairwise accuracy to an increase in the multiclass accuracy. In fact, the linear regression model for the Wu–Lin–Weng method has an
, whereas it has an
for the Bayes covariant method.
7. Likelihood Randomness Explainability
The incremental improvement achieved in the previous section was enabled by the inherent modularity of the pairwise coupling models. In this section, we examine another application of this modularity.
Neural networks are inherently random algorithms. Their predictions will vary if they are repeatedly trained. This fact is obvious to specialists, but is rarely conveyed to the end user. The primary obstacle is training cost. If it takes weeks to train a single deep neural network, it is utterly impractical to train, say, 100 different networks [
39]. Other alternatives proposed for uncertainty quantification include using dropout [
40] and Bayesian neural networks [
41].
We will illustrate, based on an example from Fashion MNIST, that the problem of the expensive training of numerous copies of deep networks can be easily overcome using pairwise coupling models.
The image #142 (counted from 1) in the test set belongs to class 0 (t-shirt/top). It is shown in
Figure 7.
This image is correctly predicted to belong to class 0 by the network
, but is incorrectly predicted to belong to class 6 (shirt) by the network
. Both of them are quite sure, giving more than 95% likelihood for their prediction, as seen in
Table 6.
The pairwise coupling approach allows one to create a vast number of new classifiers (more precisely,
) by bootstrapping pairwise predictions from either of the two pairwise likelihood matrices corresponding to the predictions of networks
and
and then applying a pairwise coupling method. We plot 100 samples for both pairwise coupling methods in
Figure 8.
The plots show that there is a significant difference between the Wu–Lin–Weng method and the Bayes covariant method. The former is strongly clustered in two clusters near 0 and 1, whereas the latter is much more uniformly distributed from 0.1 to 0.9. The natural assumption that the binary classifier for the pair 0/6 has a crucial influence on the multi-class posteriors is confirmed by plotting conditional density plots in
Figure 9.
Thus, the Wu–Lin–Weng method is very sensitive to the information provided by the critical binary classifier, whereas the Bayes covariant method tries to balance information from all binary classifiers. Both have their advantages. The former is more post hoc explicable, since a misprediction is likely caused by a single classifier. On the other hand, the latter is likely to be more precise, since it integrates information from many models.
8. Conclusions
In this paper, we have studied aspects of pairwise models built from convolutional neural networks. Let us summarize the key findings:
We proposed a novel way to construct a pairwise coupling model by employing new architectures that match memory and compute the requirements of a standard convolutional network that match its accuracy;
We showed that the models exhibit pairwise explainability;
We demonstrated that pairwise coupling models allow for inexpensive uncertainty quantification when the Bayes covariant coupling method is used.
This establishes pairwise coupling as an additional ensembling method at the disposal of designers of real-world classification systems.
Let us discuss the finer details of our experiments.
8.1. Discussion of Experiments
The first observation is the failure to obtain better accuracy (pairwise or multi-class) with the pairwise coupling methodology using the same number of parameters as the baseline convolutional network. We hypothesize that this may be caused by allocating an equal number of parameters to pairwise classification tasks that vary significantly in difficulty (e.g., the shirt/t-shirt contrast seems to be by far the most difficult, see
Figure 3 and
Table 5). If that is so, then the success of the standard multi-class architecture (model
F) suggests that CNNs are able to automatically allocate more capacity to the more challenging tasks. Another possible explanation is that in the presence of multiple classes, the convolutional network is able to learn multi-class features that go beyond those learnable in a two-class setting.
However, the subpar pairwise accuracy of micro-networks and subpar multi-class accuracy of micro-models should be contrasted with the success of mini-networks and mini-models. All mini-networks achieved higher pairwise accuracy than the baseline, and mini-models A and B (although not C and D) with the Bayes covariant method (but not the Wu–Lin–Weng method) obtained statistically significant improvements in performance over the baseline. Recall that the mini-models were designed to have approximately the same arithmetical complexity in training as the baseline. We would like to point out that there are likely additional performance advantages to using mini-models:
The restricted memory size in graphics accelerators favors smaller models, and binary classifiers have fewer parameters than their multi-class equivalents;
It is possible to train pairwise models in parallel without any communication among the computing nodes.
However, the accuracy improvements were modest for both mini-networks and mini-models, and thus by themselves are unlikely to attract adoption. On the other hand, there are two additional explainability attributes which are not present in commonly used CNN models.
The first is the ability to incrementally improve a multi-class classification system by incorporating specialized binary classifiers. It is even not mandatory that the specialized binary classifier outperforms the IIA restriction of the previous system. Our results show that diversity together with a pairwise coupling methodology is able to improve performance even in cases of subpar pairwise performance (cf.
Figure 6). Thus, we were able to confirm the phenomenon of coupling recovery in a real data setting, which has previously been suggested in synthetic data experiments [
23,
27].
The second is the ability to gauge randomness in predicted likelihoods by building many more multi-class systems out of just a couple of pairwise-coupled systems (
Section 7). In this case, we started to see a marked difference between the Wu–Lin-Weng coupling method and the Bayes covariant coupling method. The former seems to give too much confidence to a single pairwise decision, whereas the latter seems to take into account information from all decisions, leading to much more evenly distributed posteriors.
We also examined two methodological variations in creating pairwise coupling models. The first involved using a complementary log–log layer instead of softmax. As
Figure 2 shows, this alternative led to inferior performance. The second involved using non-informative priors like the encoding of the dependent variable, and this step did not lead to noticeable improvements in classification accuracy.
8.2. Applications
Pairwise coupling models are best suited to situations where the number of classes under consideration changes frequently and it is expensive to retrain the full model. A prototypical example is an employee classification system based on biometrics (face images or voiceprints). A model based on pairwise coupling needs much less computing time to handle the addition of a new employee, and its modularity is well-suited to the reduction in the system size when an employee quits.
The uncertainty quantification provided by pairwise coupling, as demonstrated in
Section 7, can be useful in fields such as the following:
Autonomous driving;
Healthcare and medical diagnosis;
Climate science and environmental modeling;
Finance and risk modeling.
8.3. Future Research
The main issue that hinders the adoption of pairwise coupling is the need to train multiple networks. This issue forced us to design specialized architectures of conventional neural networks so that the number of parameters of the pairwise model was comparable to that of the benchmark network. The severity of this problem increases with the number of classes K, since the number of pairwise models grows quadratically with K. Resolving this issue is the key step towards the broader adoption of pairwise classification models.
One solution is the adoption of the so-called arboreal coupling methods, for which the number of required models grows linearly. These were formalized in the preliminary work [
42]. However, these arboreal coupling techniques do not exhaust approaches achieving linear growth in complexity. Moreover, they are non-canonical, so they require potentially expensive model selection. Therefore, finding optimal coupling techniques in the context of deep learning remains an open problem.