1. Introduction
The assessment techniques used to evaluate the performance of information fusion and inferencing algorithms have a significant influence in the selection of optimal algorithms. The foundational metrics are misclassification rate and the Shannon and Brier proper scoring rules [
1,
2,
3]. However, these metrics do not fully capture the trade-off between the need for fusion algorithms to make decisions, provide accurate inferences in the form of probabilities, and be robust against anomalies [
4]. For an algorithm which classifies based on a threshold probability, the classification cost function is simply a step-function at the threshold. Closely related to classification performance is the Brier or mean-square error, which provides a cost function based on a uniform likelihood of the decision boundary [
5]. However, the mean-square error does not fully capture the meaning of an accurate probability which is bounded by impossible (
p = 0) and certain (
p = 1) events. The Shannon surprisal or logarithmic average of the inverse probability accurately reflects the infinite cost of falsely reporting an impossible event. Nevertheless, the surprisal score is in practice often viewed as oversensitive to individual test samples, in assessing the average performance of an inference engine. There is a need for an assessment methodology which balances the requirements for accuracy in the reported probabilities, with robustness to errors, and performance of classification decisions.
By demonstrating the equivalence between maximizing the generalized mean of the reported probability of a true event, and minimizing a cost function based on the non-additive entropy [
6,
7,
8,
9] originally proposed by C. Tsallis, we have developed a simple methodology to assess the decisiveness, accuracy, and robustness of fusion and inferencing algorithms. The equivalence is facilitated by a physical interpretation of Tsallis entropy reflecting the degree of nonlinear coupling between the statistical states of a system [
10]. In this view, Shannon information theory is based on the assumption of mutually exclusive statistical states, while non-additive entropy models systems in which nonlinearity creates strong dependence between the statistical states. The surprisal cost function is equivalent to the geometric mean of the reported probabilities,
i.e., the joint probability is the multiplication of the independent probabilities. In assessing inferencing algorithms, the level of risk tolerance is modeled as a source of nonlinear coupling. By varying the degree of nonlinear coupling, ‘Decisive’ and ‘Robust’ risk bounds on the ‘Accuracy’ or ‘Neutral’ risk reflected by the average surprisal are established with positive and negative coupling coefficients for the ‘coupled-Surprisal’. However, the average ‘coupled-Surprisal’ is equivalent to fusing the reported probabilities using the generalized mean. The three effective probabilities, which we refer to as the ‘Decisive Probability’, ‘Accurate or Risk Neutral Probability’ and ‘Robust Probability’ provide a risk profile of the inference algorithm.
The remainder of this paper is divided as follows:
Section 2 provides background on a fusion algorithm with two power-law parameters and the concept of nonlinear statistical coupling. The alpha parameter forms the generalized mean and the beta parameter determines the effective independence.
Section 2.2 reviews non-additive entropy and nonlinear statistical coupling as a model for risk.
Section 3 shows that the generalized mean is the probability combining rule for the generalized entropies defined by Renyi and Tsallis. In
Section 4 an example demonstrating the application of the generalized mean for designing and assessing a fusion algorithm is illustrated. A conclusion and suggestions for future investigation are provided in
Section 5.
3. Relationship between Generalized Mean and Generalized Entropies
The Tsallis and Renyi entropies can be expressed using the generalized mean:
The probabilities are separated into
pi and
pi−κ to emphasis the relation to the generalized mean with the weight and the sample equal to
pi and the mean parameter
α = −
κ. The expression for the Renyi entropy is common with substitution
κ = 1 −
q; however, the connection between Equations (8) and (6) requires use of the
κ-power and κ-product:
which following analysis by Oikonomou [
39] leads to:
This structure shows that the difference between the Tsallis and Renyi entropy is the use of κ-logarithm, which Oikonomou describes as an external correlation distinct from the correlation internal to the distribution formed by the generalized mean.
The relationship between the generalized mean and generalized entropy can be used to define a scoring rule for decision algorithms, which reports an effective probability for the algorithm. The effective probability is determined using the normalized version of Tsallis entropy as the kernel for the coupled-surprisal −ln
κpi,true, with equally weighted test samples
:
Again the normalized form is advantageous because there is consistency in the interpretation of κ representing the negative risk or optimism, between the maximum entropy distributions, the coupled-surprisal cost function, and the effective probability. The effective probability is a convenient representation of the average score, since test results are commonly expressed as a percentage. The effective probability can be used as a measure of confidence regarding an inferencing algorithm with the different values of κ providing insight regarding the effect of risk or other physical models which have a related deformation of the information cost function. This confidence level can serve as a comparison of algorithms and as a weighting of inputs to fusion engines.
Having established the connection between coupled-entropy and the generalized mean, each of the parameters for the alpha-beta fusion method in Equation (1) can be examined in light of the risk bias or degree of nonlinear statistical coupling. Alpha defines the fusion method and is equal to the nonlinear statistical coupling, thus
κf =
α. The input weights modify the confidence of each input or in terms of risk are
wi = 1 −
κi. The output confidence can be split into a portion which is the sum of the input weights
and a portion which is the output confidence
wβ−1 = 1 −
κo. Thus the fusion method can be viewed as controlling the risk bias on the input, fusion, and output of the probabilities:
The relationship between input and output confidence weights and the nonlinear statistical coupling term is seen more clearly using the coupled-logarithm function. Since the generalized mean includes the sum of coupled-logarithms, consider how the inputs weights affect this sum
If
κf = 0 the coupled power term reduces to the standard power term
which is the expression for the coupled-probability. Likewise, the output probability is modified by the coupled-probability term 1 −
κo.
4. Application to Designing and Assessing a Fusion Algorithm
The relationship between risk, generalized entropy, and the generalized means provides the foundation for a new method to characterize the performance of a fusion algorithm relative to requirements for robustness, accuracy, and decisiveness in a fusion algorithm. A fusion algorithm designed to minimize the Shannon surprisal or equivalently maximize the geometric mean (
i.e., the generalized mean with
κ = 0) provides a ‘Neutral Probability’. Bounding this measure of ‘Accuracy’ are ‘Decisive’ and ‘Robust’ probabilities measured using positive and negative values of
κ, respectively. While a variety of options are available in determining the value of
κ to use, we choose as an example the values ±0.5 in part because
κ = 0.5 has a similar cost function to the Brier or mean-square average. The exact relationship between the coupled-surprisal and the Brier score is:
The computational experiment we will use here is for the classification of handwritten digits from a collection of Dutch utility maps. The data set consists of six different feature sets extracted from the same source images. The feature sets are Fourier: 76 Fourier coefficients of the character shapes; Profiles: 216 profile correlations; KL-coef: 64 Karhunen-Loève coefficients; Pixel: 240 pixel averages in 2 × 3 windows; Zernike: 47 Zernike moments; Morph: 6 morphological features. The data is publicly available from the Machine Learning Repository under the name ‘mfeat’ [
40]. A similar experiment with a larger number of fusion techniques and more limited evaluation criteria can be found in [
41]. The data set consists of 2000 instances, with 200 for each of the 10 numeral classes. We have allocated 100 instances for training and 100 for testing for each of the classes.
Figure 2a shows example digits reconstructed from the feature set consisting of sampled pixel values.
A Bayesian classifier is trained for each of the six feature sets. In training our Bayesian classifier, we have assumed all classes to have the same covariance matrix with a different mean vector for each class. This assumption will result in linear decision boundaries. The posterior outputs from each classifier are fused using the alpha-beta fusion equation. Although only one fusion technique is used, through varying the parameters we are able to reconstruct and thus compare our results to standard techniques including naïve Bayes, averaging, and log-averaging.
Figure 2b and
Figure 2c show the classification performance with and without the fusion algorithm. The performance of the six feature sets varies from 26 misclassifications for set 2 to 327 misclassifications for set 6. The classification performance using the generalized mean to fuse the feature sets is shown in
Figure 2c. The generalized mean is effective in filtering the poor performance of the weaker feature sets (1, 5, and 6), but does not improve upon the best feature set (2). Positive values of alpha act as smoothing filters relative to the log-average or equivalently the geometric mean at
α = 0. The best performance for this example is achieved with
α = 0.25. Negative values of alpha accentuate differences, which beyond
α < −0.25 significantly degrades the classification performance. The beta parameter which models the degree of effective independence does not modify the relative values of the class probabilities and is not influential on the classification performance.
Nevertheless, since the output probabilities of a fusion algorithm may be an input for a higher-level fusion process, the accuracy and robustness of the probabilities are important criteria for assessing the fusion performance. The coupled-surprisal and equivalently the generalized mean provide a method to directly examine the decisiveness, accuracy, and robustness characteristics.
Figure 3a shows the histogram of probability outputs for the alpha-beta fusion method optimized against the Shannon surprisal (
α = 0.4,
β = 0.6) and three common fusion methods; naïve-Bayes (
α = 0,
β = 1), which assumes independence between the inputs; log-averaging (
α = 0,
β = 0), which assumes the inputs are correlated, but does not smooth errors; and averaging (
α = 1,
β = 0), which assumes both correlation and error in the inputs.
Figure 3b shows the risk profile for each of the fusion methods formed by the generalized mean of the true class probabilities
versus the coupling parameter
κ. The histogram for naïve-Bayes is skewed severely toward 0 or 1 and as shown in
Figure 3b only performs well against the decisive metric (
κ = 0.5). The performance drops dramatically near (
κ = −0.2), indicating a lack of robustness. Against the neutral metric (
κ = 0.0) log-averaging out performs averaging and naïve-Bayes. Here the modeling of correlation provides improvement. Log-averaging continues to perform well against the robust metric (
κ = −0.5) indicating that in this example there is not a high degree of errors in the different features sets. Thus the averaging method, which would be advantageous in a situation where errors need to be smoothed, is in this example unnecessary.
Figure 2.
(a) Examples of the handwritten numerals used as a classification and inferencing problem. (b) Individual misclassification for the six feature sets. (c) Fusion of the feature sets using the generalized mean with alpha varied between −1 and 2.
Figure 2.
(a) Examples of the handwritten numerals used as a classification and inferencing problem. (b) Individual misclassification for the six feature sets. (c) Fusion of the feature sets using the generalized mean with alpha varied between −1 and 2.
The fusion performance can be optimized using the alpha-beta algorithm described in
Section 2.1.
Figure 4 shows the performance of the alpha-beta algorithm against the Shannon, Brier, Decisive, and Robust metrics. Selection of the alpha-beta parameters could be based on one of these metrics or a combination of requirements. The optimal value for each metric is circled in each diagram. For Shannon surprisal the optimal value is (
α = 0.4,
β = 0.6) and its risk profile is shown in
Figure 3b. This shows that the Decisive performance is very similar to the naïve Bayes and its Robust performance does not decay rapidly until
κ < −0.5. Although the Brier as a proper score, provides an unbiased inference assessment, its origin in a linear (
κ = 1) metric, see Equation (15), makes it more favorable toward decisive algorithms, then the logarithmic metric (
κ = 0). In particular, forecasts extremely close to zero are not heavily penalized, which may be adequate in assessing the classification performance of an algorithm, but would lack robustness if the probability is an input to fusion algorithms utilizing Bayesian analysis. In this example, the optimal parameter for the effective independence moves from
β = 0.6 to 0.8, which will be more decisive and less robust.
Figure 3.
(a) Histogram of the probabilities assigned to the true class for four fusion methods. (b) A risk profile based on the generalized mean of the true class probabilities versus the coupling parameter κ. The naïve-Bayes is a decisive fusion method which has near perfect score for large positive values of κ, but lacks robustness which is reflected in the sharp drop in the generalized mean for negative values of κ. Averaging and log-averaging are more robust methods; the generalized mean decays slower for negative values of κ but does not achieve as high a value for positive κ. Using the alpha-beta fusion method, the Neutral (Shannon Surprisal) metric is optimized for α = 0.4, β = 0.6 and has improved robustness relative to the naïve-Bayes.
Figure 3.
(a) Histogram of the probabilities assigned to the true class for four fusion methods. (b) A risk profile based on the generalized mean of the true class probabilities versus the coupling parameter κ. The naïve-Bayes is a decisive fusion method which has near perfect score for large positive values of κ, but lacks robustness which is reflected in the sharp drop in the generalized mean for negative values of κ. Averaging and log-averaging are more robust methods; the generalized mean decays slower for negative values of κ but does not achieve as high a value for positive κ. Using the alpha-beta fusion method, the Neutral (Shannon Surprisal) metric is optimized for α = 0.4, β = 0.6 and has improved robustness relative to the naïve-Bayes.
Algorithms like the Dempster-Shafer belief functions [
42], in which probabilities based on Bayesian analysis are augmented by more decisive or more robust beliefs are anticipated. In the Dempster-Shafer methods, the Plausiblity and Belief functions are computed by considering power sets of the systems states. The power sets provide a means to consider the ambiguity in defining states, relaxing the assumption of mutually exclusive states, which is implicit in Bayesian analysis. In much the same way, the nonlinear statistical coupling method relaxes the assumption of mutually exclusive states by considering the coupled-probabilities of a system, which are nonlinear combinations of all the states. While the D-S methods provide a great deal of flexibility in how states are combined for particular beliefs, in practice simplifications are required to maintain computational effectiveness. While the nonlinear statistical coupling is a global mixing of the all states, the mathematical rigor of the approach provides both computational efficiency and analytical clarity. Furthermore, the weights
wi of the alpha-beta fusion method in Equation (1) can be viewed as an individual risk bias on the inputs equal to
κi = 1 −
wi. These algorithmic methods will be considered in more detail in a future publication.
Figure 4.
Performance of alpha-beta fusion against effective probabilities based on the cost-function for (a) Shannon surprisal, κ = 0; (b) Brier or mean-square average; (c) Robust Coupled-Surprisal, κ = −0.5; and (d) Decisive Coupled-Surprisal, κ = 0.5. The circles indicate the region of optimal performance.
Figure 4.
Performance of alpha-beta fusion against effective probabilities based on the cost-function for (a) Shannon surprisal, κ = 0; (b) Brier or mean-square average; (c) Robust Coupled-Surprisal, κ = −0.5; and (d) Decisive Coupled-Surprisal, κ = 0.5. The circles indicate the region of optimal performance.
5. Conclusions
In this paper we have clarified the relationship between the generalized mean of probabilities and the generalized entropy functions defined by Tsallis and Renyi. The relationship is facilitated by defining as a physical property the nonlinear statistical coupling κ, which is a translation of the parameter q defined by Tsallis and Renyi for the generalized entropy function, κ = 1 − q. Equations (8) and (9) show that both the Tsallis and Renyi entropy functions include the generalized mean of the probability states and there difference is the translation of the mean to an entropy scale using either the deformed logarithm for Tsallis entropy or the natural logarithm for Renyi entropy. Using the generalized mean of probabilities directly as a scoring rule provides useful intuition because the scale from 0 to 1 is simple to associate with the performance of a test.
The coupling parameter
κ of the generalized mean is associated with the degree of negative risk or optimism which biases the metric. The association with risk comes from the deformation of the accurate or neutral information cost function defined by the negative logarithm, known as Shannon surprisal. The performance of an inference algorithm is evaluated relative to the risk bias by measuring the average coupled-surprisal of the reported probabilities of the true class equation or equivalently the generalized mean of the true probabilities
. As a starting point for exploring applications of this metric, we define Decisive (
κ = 0.5), Neutral or Accurate (
κ = 0.0), and Robust (
κ = −0.5) probability metrics. Together these metrics highlight the balance between decisiveness and robustness for a fusion algorithm to be effective.
An effective fusion algorithm must model the potential errors and correlations between input nodes. Further the algorithm must be able to scale to large numbers of inputs without becoming computationally prohibitive. We have designed a fusion algorithm which uses the generalized mean to model smoothing of errors and a second parameter
Nβ to model the effective number of independent samples given N inputs. Together the alpha-beta fusion algorithm:
provides a system-wide model of sensitivity and correlation. The alpha parameter is equivalent to the nonlinear statistical coupling
κ. As the coupling between states is increased the output of the fusion is more uniform between the states resulting in a less decisive, but more robust inference. The beta parameter ranges from one effective sample (
β = 0) to N effective samples (
β = 1), providing a model of correlation between the samples.