On the Optimum Linear Soft Fusion of Classifiers

Vergara, Luis; Salazar, Addisson

doi:10.3390/app15095038

Open AccessArticle

On the Optimum Linear Soft Fusion of Classifiers

by

Luis Vergara

^*

and

Addisson Salazar

Instituto de Telecomunicaciones y Aplicaciones Multimedia, Universitat Politècnica de València, C/Camino de Vera s/n, 46022 València, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5038; https://doi.org/10.3390/app15095038

Submission received: 26 March 2025 / Revised: 25 April 2025 / Accepted: 29 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Machine Learning and Data Analysis: Bridging Theory and Real-World Solutions)

Download

Browse Figures

Versions Notes

Abstract

We present new analytical developments that contribute to a better understanding of the (soft) fusion of classifiers. To this end, we propose an optimal linear combiner based on a minimum mean-square-error class estimation approach. This solution allows us to define a post-fusion mean-square-error improvement factor relative to the best fused classifier. Key elements for this improvement factor are the number of classifiers, their pairwise correlations, the imbalance between their performances, and the bias. Furthermore, we consider exponential models for the class-conditional probability densities to establish the relationship between the classifier’s error probability and the mean square error of the class estimate. This allows us to predict the reduction in the post-fusion error probability relative to that of the best classifier. These theoretical findings are contrasted in a biosignal application for the detection of arousals during sleep from EEG signals. The results obtained are reasonably consistent with the theoretical conclusions.

Keywords:

classifiers; soft fusion; optimal combiner; expected improvement; fusion of biosignals

1. Introduction

Soft fusion of classifiers [1,2,3,4,5] consists in combining the individual scores assigned to each class by a set of classifiers, to obtain a single score for each of the classes involved. The combination or fusion of the individual scores is performed with the hope that there is a certain complementarity of the different classifiers that allows improving the estimation of the correct class. Soft fusion of classifiers is a late fusion, which presents significant advantages with respect to the early fusion of features [6,7,8,9,10]. It does not suffer from the problem of increased dimensionality; it allows weighted combination according to the quality of each classifier and it simplifies the problems of synchronization in case different modalities are to be fused. On the other hand, the so-called hard or decision fusion [2,11] is also a late fusion in which the decisions made by the individual classifiers are combined to make a final decision. While allowing very simple combination rules (for example, majority voting), hard fusion entails a loss of information that prevents the full exploitation of the complementarity between the individual classifiers.

Soft fusion of classifiers can be part of complex information fusion systems, such as multimodality fusion [12,13,14,15], sensory data fusion [16,17], or image fusion [18,19]. Although soft fusion is considered a consolidated procedure, it presents certain controversies. For example, if we combine a good classifier with weak classifiers, will we improve or worsen the performance of the good classifier? Or, more generally, what level of improvement can be expected after the fusion, based on certain properties of the individual classifiers? These and other similar questions have been attempted to be answered through essentially experimental studies in specific models and/or applications [20,21,22,23,24,25,26]. Thus, the generalization of the conclusions to environments other than those used in the experiments is questionable. In short, there is a significant lack of theoretical foundations that can only be reached through mathematical developments. This is a problem that usually appears in the field of machine learning, since mathematical analysis can be very complex and it is tempting to directly carry out experimental verifications, given the current high availability of data and computational resources.

The main objective of this work is to make analytical contributions toward a better understanding of the soft fusion of classifiers. These contributions are, however, contrasted with the real data from a biomedical application. We focused on linear fusion and the two-class problem, as it is the most analytically tractable environment and leads to closed-form solutions. In any case, it must be taken into account that linear fusion marks a lower bound on the performance that could eventually be achieved through non-linear solutions [27]. On the other hand, the two-class problem is a significant starting point for the extension to the multi-class problem [28], which can be the subject of further work. Therefore, the conclusions may be of general interest.

Regarding the optimal linear combiner that we propose, it presents a novelty with respect to other similar approaches: the cost function is expressed in terms of the mean square error (MSE) of the estimation of the discrete variable that defines the classes, which is decomposed as the sum of the conditional MSEs for each class. Thus, we show that the MSE after fusion can be expressed in terms of the individual MSEs. The latter facilitates the determination of the parameters that influence the potential improvement of the fusion with respect to the best individual classifier. Furthermore, we consider exponential models for the class-conditional probability densities to establish the relationship between the classifier’s error probability and the MSE of the class estimate. This allows us to predict the reduction in the post-fusion error probability relative to that of the best classifier. These theoretical findings are contrasted in a biosignal application for the detection of arousals during sleep from EEG signals. The results obtained are reasonably consistent with the theoretical conclusions.

2. The Optimal Linear Combiner

Let us call

0 \leq s_{d} \leq 1

to the score assigned by the individual classifier

d

to class

c_{1}

. Considering

s_{d}

as a the posterior probability of

c_{1}

, the score assigned to

c_{0}

should be

1 - s_{d}

. Soft fusion of classifiers consists of generating a score,

s_{f}

, assigned to class

c_{1}

(and so

1 - s_{f}

to class

c_{0}

) by combining the scores

s_{d} d = 1 \dots D

, respectively, provided by

D

individual classifiers, as indicated in Figure 1.

We then make an estimation approach. The fused score,

s_{f}

, can be interpreted as an estimate of the binary random variable,

u

, from the “observations”

s_{d} d = 1 \dots D

. It is

u = 1

when the true class is

c_{1}

and

u = 0

when the true class is

c_{0}

. Conditional to

c_{1}

,

s_{f}

should have values close to 1, and conditional to

c_{0}

, it should be close to 0. It is well-known that the optimal estimate of

u

minimizing the MSE is the conditional mean

E [u / s_{1} \dots s_{D}]

, which is, in general, a complex non-linear function, not expressible in close form. Therefore, we consider the optimal linear combination of scores. On the other hand, it is quite important to relate the MSE of the estimate

s_{f}

with the MSE’s of the individual estimates

s_{d} d = 1 \dots D

. This facilitates the analysis of the expected improvement due to the fusion, which is the essential objective of this work. Thus, given the set of coefficients, we can write

s_{f} = \sum_{d = 1}^{D} w_{d} s_{d},

(1)

but, we can express the MSE of the estimate

s_{f}

as

\begin{array}{l} M S E_{f} = E [{(u - s_{f})}^{2}] = P_{0} \cdot E [{(u - s_{f})}^{2} / u = 0] + P_{1} \cdot E [{(u - s_{f})}^{2} / u = 1] = \\ = P_{0} \cdot (ν_{f / 0} + b_{f / 0}^{2}) + P_{1} \cdot (ν_{f / 1} + b_{f / 1}^{2}) \end{array},

(2)

where

P_{i}, ν_{f / i}, b_{f / i}

, respectively, are the prior, the conditional variance, and the conditional bias corresponding to class

c_{i}

. Let us define the vectors

s = {[s_{1} \dots s_{D}]}^{T} w = {[w_{1} \dots w_{D}]}^{T}

, so that

s_{f} = w^{T} s

. We can compute the conditional variances and bias as follows:

\begin{array}{l} v_{f / i} = E [s_{f / i}^{2}] - E^{2} [s_{f / i}] = E [{(w^{T} s)}^{2} / u = i] - E^{2} [w^{T} s / u = i] = \\ = w^{T} E [s s^{T} / u = i] w - w^{T} E [s / u = i] E^{T} [s / u = i] w = w^{T} (R_{s / i} - E [s / u = i] E^{T} [s / u = i]) w = \\ = w^{T} C_{s / i} w \\ b_{f / i} = E [s_{f / i}] - i = \sum_{d = 1}^{D} w_{d} E [s_{d / i}] - i \sum_{d = 1}^{D} w_{d} = \sum_{d = 1}^{D} w_{d} b_{d / i} = w^{T} b_{s / i} b_{s / i} = {[b_{1 / i} \dots b_{D / i}]}^{T} \end{array},

(3)

where

R_{s / i}

and

C_{s / i}

are, respectively, the correlation and covariance matrices of the score vector conditional to

u = i

, and where we assumed

\sum_{d = 1}^{D} w_{d} = 1

. The latter is a regularization condition to allow for a non-trivial solution in the following and avoid an anomalous increase in bias in the fused score. Considering (3) in (2):

M S E_{f} = (w^{T} C_{s / 0} w + w^{T} b_{s / 0} b_{s / 0}^{T} w) P_{0} + (w^{T} C_{s / 1} w + w^{T} b_{s / 1} b_{s / 1}^{T} w) P_{1} = w^{T} M_{s} w,

(4)

where

M_{s} = (C_{s / 0} + b_{s / 0} b_{/ s 0}^{T}) P_{0} + (C_{s / 1} + b_{s / 1} b_{s / 1}^{T}) P_{1} .

(5)

We next obtain the coefficients that minimize

M S E_{f}

under the constraint

\sum_{d = 1}^{D} w_{d} = w^{T} 1 = 1

, where

1

is defined as a D-dimensional vector of 1. Notice that

w_{d} \geq 0 d = 1 \dots D

; this is a sufficient condition to ensure that

0 \leq s_{f} \leq 1

. We do not explicitly impose in the following

w_{d} \geq 0 d = 1 \dots D

, since this would prevent reaching a close-form solution. Therefore, eventually, some fused scores can be out of the interval range (0,1). This is not critical as we finally must make a decision by comparing

s_{f}

with a threshold in the interval (0,1) so we can simply round down to 1, in case

s_{f} > 1

, and round up to 0, in case

s_{f} < 0

, thus not affecting the final decision. We used the method of Lagrange multipliers, i.e.,

\begin{array}{l} J = w^{T} M_{s} w + λ (1 - w^{T} 1) \\ \begin{array}{l} \frac{δ J}{δ w} = 2 M_{s} w - λ 1 = 0 \Rightarrow w_{L L M S E} = \frac{λ}{2} M_{s}^{- 1} 1 \\ w_{L L M S E}^{T} 1 = \frac{λ}{2} 1^{T} M_{s}^{- 1} 1 = 1 \Rightarrow λ = \frac{2}{1^{T} M_{s}^{- 1} 1} \end{array}\} \Rightarrow w_{L L M S E} = \frac{M_{s}^{- 1} 1}{1^{T} M_{s}^{- 1} 1} \end{array} .

(6)

The corresponding linear least-mean-square-error (LLMSE) solution thus obtained is (to ease the notation we simply indicate

M S E_{f}

instead of

L L M S E_{f}

)

M S E_{f} = w_{L L M S E}^{T} M_{s} w_{L L M S E} = \frac{1^{T} M_{s}^{- 1}}{1^{T} M_{s}^{- 1} 1} M_{s} \frac{M_{s}^{- 1} 1}{1^{T} M_{s}^{- 1} 1} = \frac{1}{1^{T} M_{s}^{- 1} 1} .

(7)

Notice that the MSE corresponding to every individual classifier can be written similar to (2). They are the entries to the main diagonal of

M_{s}

, i.e.,

M S E_{d} = E [{(u - s_{d})}^{2}] = (ν_{d / 0} + b_{d / 0}^{2}) P_{0} + (ν_{d / 1} + b_{d / 1}^{2}) P_{1} = M_{s} (d, d) d = 1 \dots D .

(8)

So, in conclusion we derived the expression of the LLMSE estimator

{\hat{u}}_{L L M S E} = w_{L L M S E}^{T} s = \frac{1^{T} M_{s}^{- 1} s}{1^{T} M_{s}^{- 1} 1}

in such a way that

w^{T} 1 = 1

, and that the provided MSE is expressed in terms of the matrix,

M_{s}

, whose main diagonal entries are the MSEs provided by the individual classifiers. As we will see in the next section, all this will allow us to analyze the expected improvements due to the fusion of classifiers.

3. The Improvement Factor

For notation convenience, let us consider that

M S E_{1} \geq M S E_{2} \geq \dots \geq M S E_{D}

so that

\min (M S E_{d}) = M S E_{D}

. The improvement in terms of MSE can be computed by comparing

M S E_{f}

with the MSE provided by the best of all the individual classifiers. So, let us define the MSE improvement factor

I_{f} = \frac{M S E_{D}}{M S E_{f}}

. It can be demonstrated by reduction ad absurdum that

I_{f} \geq 1

: if

I_{f}

was smaller than 1, then there would be a solution to (6), namely

w_{1} = 0 w_{2} = 0 \dots w_{D - 1} = 0 w_{D} = 1

providing a mean square error

M S E_{D} < M S E_{f}

, but by definition,

M S E_{f}

is the minimum achievable mean square error. Therefore, the first conclusion is that the optimal linear fusion of classifiers, as previously defined, will be at least as good (in the MSE sense) as the best of the individual classifiers. Furthermore, the specific value of

I_{f}

depends exclusively on

M_{s}

. Hence, we can obtain sets of scores from previously trained individual classifiers, so that we can use (5) to obtain a sample estimate,

{\hat{M}}_{s}

, from sample estimates of the covariance matrices, the bias vectors, and the priors:

{\hat{C}}_{s / 1}, {\hat{b}}_{s / 1}, {\hat{C}}_{s / 0}, {\hat{b}}_{s / 0}, {\hat{P}}_{0}, {\hat{P}}_{1}

. Then,

I_{f}

can be estimated from (7).

{\hat{I}}_{f} = \frac{M \hat{S} E_{D}}{M \hat{S} E_{f}} = {\hat{M}}_{s} (D, D) \cdot 1^{T} {\hat{M}}_{s}^{- 1} 1 .

(9)

With the intention of achieving a better understanding of the essential elements that affect

I_{f}

, we now consider some simple yet relevant models, by considering simplified forms of

M_{s}

.

3.1. Fusion of Equally Operated Classifiers (EOCs)

Let us consider the case where all individual classifiers have equal variances, equal pairwise covariances, and equal bias. We call this situation “equally operated classifiers” (EOCs). Although, in practice, this may not strictly be the case, it is representative of what might be expected when all classifiers have a similar individual performance and similar dependence on each other. More formally, this corresponds to

C_{s / i} = (\begin{matrix} ν_{/ i} & ρ_{i} ν_{/ i} & ρ_{i} ν_{/ i} \\ ρ_{i} ν_{/ i} & ⋱ & ρ_{i} ν_{/ i} \\ ρ_{i} ν & ρ_{i} ν_{/ i} & ν_{D / i} \end{matrix}); b_{s / i} = [\begin{matrix} b_{/ i} \\ ⋮ \\ b_{/ i} \end{matrix}] i = 0, 1,

(10)

where

0 \leq ρ_{i} < 1

is the covariance coefficient. For simplicity, we consider the same models for both classes, i.e.,

ν_{/ 0} = ν_{/ 1} = ν; ρ_{0} = ρ_{1} = ρ; b_{/ 0} = b_{/ 1} = b

, so that

C_{s / 0} = C_{s / 1} = C_{s}; b_{s / 0} = b_{s / 1} = b_{s}

. Then, considering (5) and that

P_{0} + P_{1} = 1

, matrix

M_{s}

can be written in the form:

\begin{array}{l} M_{s} = C_{s} + b_{s} b_{s}^{T} = (\begin{matrix} ν & ρ ν & ρ ν \\ ρ ν & ⋱ & ρ ν \\ ρ ν & ρ ν & ν \end{matrix}) + (\begin{matrix} b^{2} & b^{2} & b^{2} \\ b^{2} & ⋱ & b^{2} \\ b^{2} & b^{2} & b^{2} \end{matrix}) = \\ = (1 - ρ) ν (\begin{matrix} 1 & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 \end{matrix}) + ρ ν (\begin{matrix} 1 & 1 & 1 \\ 1 & ⋱ & 1 \\ 1 & 1 & 1 \end{matrix}) + b^{2} (\begin{matrix} 1 & 1 & 1 \\ 1 & ⋱ & 1 \\ 1 & 1 & 1 \end{matrix}) = (1 - ρ) ν I + (ρ ν + b^{2}) 1 1^{T} \end{array} .

(11)

Let us define

G = (1 - ρ) ν I, H = (ρ ν + b^{2}) 1 1^{T}

, where the rank of

H

is 1 and

G

is nonsingular. We can apply Miller’s lemma to compute the inverse:

\begin{array}{l} M_{s}^{- 1} = {(G + H)}^{- 1} = G^{- 1} - \frac{1}{1 + t r a c e (H G^{- 1})} G^{- 1} H G^{- 1} = \\ = \frac{1}{(1 - ρ) ν} I - \frac{1}{(1 + \frac{D (ρ ν + b^{2})}{(1 - ρ) ν})} \frac{ρ ν + b^{2}}{{(1 - ρ)}^{2} ν^{2}} 1 1^{T} = \frac{1}{(1 - ρ) ν} (I - \frac{ρ ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} 1 1^{T}) \end{array} .

(12)

From (6) and (12), we can write:

\begin{array}{l} M S E_{f} = \frac{1}{1^{T} M_{s}^{- 1} 1} = \frac{1}{1^{T} \frac{1}{(1 - ρ) ν} (I - \frac{ρ ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} 1 1^{T}) 1} = \\ = \frac{(1 - ρ) ν}{D - \frac{ρ ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} D^{2}} = \frac{(1 - ρ) ν / D}{1 - \frac{ρ ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} D} = \\ = \frac{(1 - ρ) ν / D}{\frac{(1 - ρ) ν + D (ρ ν + b^{2}) - (ρ ν + b^{2}) D}{(1 - ρ) ν + D (ρ ν + b^{2})}} = \frac{(1 - ρ) ν + D (ρ ν + b^{2})}{D} \end{array} .

(13)

Finally, considering that, in this case,

M S E_{D} = M_{s} (D, D) = ν + b^{2}

, we may write the expression of the improvement factor as:

I_{f} = \frac{M S E_{D}}{M S E_{f}} = D \frac{ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} = \frac{D ν + D b^{2}}{(ρ (D - 1) + 1) ν + D b^{2}} .

(14)

Notice that

ρ (D - 1) + 1 \leq D \Rightarrow I_{f} \geq 1

, as it should be. For

b

and

D

, the maximum reduction is achieved if the individual scores are uncorrelated

ρ = 0 \Rightarrow I_{f} = \frac{D ν + D b^{2}}{ν + D b^{2}},

(15)

which in turn reaches its maximum value of

I_{f} = D

, if

b = 0

. Thus, bias is a limiting factor for the reduction in the MSE. Actually, we demonstrate below that, in EOCs, it is

w_{L M M S E} = \frac{1}{D} 1

, so from (3), the fused score,

s_{f}

, has the same bias,

b

, as the individual scores,

s_{d} d = 1 \dots D

. Therefore, the fusion improvement is essentially due to the reduction in the variance. Other interesting conclusions can be derived from (14). For example, if

ρ = 1 \Rightarrow I_{f} = 1

, that is, when the scores are fully correlated, the set of individual classifiers actually behaves like a single classifier. Equation (14) is also consistent when fusion is not implemented, i.e.,

D = 1 \Rightarrow I_{f} = 1

.

We showed that, in the EOC case, the parameters affecting the improvement factor,

I_{f}

, are the number of classifiers,

D

; the pairwise correlation,

ρ

; and the bias,

b

. This is illustrated in Figure 2. Thus, in Figure 2a, we show

I_{f}

as a function of

ρ

for different values of

D

; bias

b

was assumed to be 0. We see that

I_{f}

increases with

D

, but decreases with

ρ

from a maximum value of

I_{f} = D

for

ρ = 0

to a minimum value of

I_{f} = 1

for

ρ = 1

. Similarly, in Figure 2b, we show

I_{f}

as a function of

b

for different values of

D

; correlation

ρ

was assumed to be 0. Again,

I_{f}

increases with

D

, but, as expected, decreases for an increasing bias.

Let us finally compute the optimum coefficient vector,

w

, corresponding to the EOCs.

\begin{array}{l} w_{L L M S E} = \frac{M_{s}^{- 1} 1}{1^{T} M_{s}^{- 1} 1} = \frac{(1 - ρ) ν + D (ρ ν + b^{2})}{D} \frac{1}{(1 - ρ) ν} (I - \frac{ρ ν + b^{2}}{(1 - ρ) ν + D (ρ ν + b^{2})} 1 1^{T}) 1 = \\ = \frac{(1 - ρ) ν + D (ρ ν + b^{2})}{D} \frac{1}{(1 - ρ) ν} (1 - \frac{(ρ ν + b^{2}) D}{(1 - ρ) ν + D (ρ ν + b^{2})}) 1 = \frac{1}{D} \frac{1}{(1 - ρ) ν} ((1 - ρ) ν) 1 = \frac{1}{D} 1 \end{array} .

(16)

Thus, as expected, the mean of the scores is the LLMSE solution in the case of EOCs, always achieving an improvement factor of

1 \leq I_{f} \leq D

.

3.2. Fusion of Non-Equally Operated Classifiers (NEOCs)

Notice that, in the case of individual classifiers with disparate performances, it is reasonable to consider that the scores they provide are uncorrelated. This simplifies an analysis that would otherwise be intractable. Furthermore, as for the EOCs, we assume for simplicity that the covariance–bias model is the same for both classes; then:

C_{s} = C_{s / 0} = C_{s / 1} = (\begin{matrix} ν_{1} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & ν_{D} \end{matrix}) = Λ_{s}; b_{s} = b_{s / 0} = b_{s / 1} = [\begin{matrix} b_{1} \\ ⋮ \\ b_{D} \end{matrix}],

(17)

Let us use Miller’s lemma again to calculate the inverse of

M_{s} = Λ_{s} + b_{s} b_{s}^{T}

:

M_{s}^{- 1} = {(Λ_{s} + b_{s} b_{s}^{T})}^{- 1} = Λ_{s}^{- 1} - \frac{1}{1 + t r a c e (b_{s} b_{s}^{T} Λ_{s}^{- 1})} Λ_{s}^{- 1} b_{s} b_{s}^{T} Λ_{s}^{- 1} .

(18)

Then:

\begin{array}{l} M S E_{f} = \frac{1}{1^{T} M_{s}^{- 1} 1} = \frac{1}{1^{T} (Λ_{s}^{- 1} - \frac{1}{1 + t r a c e (b_{s} b_{s}^{T} Λ_{s}^{- 1})} Λ_{s}^{- 1} b_{s} b_{s}^{T} Λ_{s}^{- 1}) 1} = \\ = \frac{1}{\sum_{d = 1}^{D} \frac{1}{v_{d}} - \frac{1}{1 + \sum_{d = 1}^{D} \frac{b_{d}^{2}}{v_{d}}} {(\sum_{d = 1}^{D} \frac{b_{d}}{v_{d}})}^{2}} \end{array} .

(19)

And, finally:

I_{f} = \frac{M S E_{D}}{M S E_{f}} = (v_{D} + b_{D}^{2}) (\sum_{d = 1}^{D} \frac{1}{v_{d}} - \frac{1}{1 + \sum_{d = 1}^{D} \frac{b_{d}^{2}}{v_{d}}} {(\sum_{d = 1}^{D} \frac{b_{d}}{v_{d}})}^{2}) .

(20)

This is a complicated expression due to the arbitrary combinations of bias and variance in each individual classifier. However, we can impose some simplification to gain insights into the expected improvement. Thus, we may consider that all the individual scores have the same bias,

b_{1} = b_{2} = \dots = b_{D} = b

. Then, we can write:

\begin{array}{l} I_{f} = \frac{M S E_{D}}{M S E_{f}} = (v_{D} + b^{2}) (\sum_{d = 1}^{D} \frac{1}{v_{d}} - \frac{1}{1 + b^{2} \sum_{d = 1}^{D} \frac{1}{v_{d}}} b^{2} {(\sum_{d = 1}^{D} \frac{1}{v_{d}})}^{2}) = (v_{D} + b^{2}) \frac{\sum_{d = 1}^{D} \frac{1}{v_{d}}}{1 + b^{2} \sum_{d = 1}^{D} \frac{1}{v_{d}}} = \\ = \frac{v_{D} \sum_{d = 1}^{D} \frac{v_{D}}{v_{d}} + b^{2} \sum_{d = 1}^{D} \frac{v_{D}}{v_{d}}}{v_{D} + b^{2} \sum_{d = 1}^{D} \frac{v_{D}}{v_{d}}} = \frac{C v_{D} + C b^{2}}{v_{D} + C b^{2}} \end{array},

(21)

where,

C = \sum_{d = 1}^{D} \frac{v_{D}}{v_{d}}

. Considering that

0 \leq \frac{v_{D}}{v_{d}} \leq 1 d = 1 \dots D - 1

,

1 \leq C \leq D

. If

v_{D} ≪ v_{d}, d = 1 \dots D - 1

, we have a very unbalanced scenario; then,

C = 1 \Rightarrow I_{f} = 1

, i.e., if the best classifier is “much better” than the rest, little improvement can be expected from the fusion. In the opposite case, we have a totally balanced situation if

v_{D} = v_{d} = 1, d = 1 \dots D - 1

, and then

C = D

. Moreover,

I_{f}

reaches its maximum

\frac{D v_{D} + D b^{2}}{v_{D} + D b^{2}}

, which consistently coincides with (15) (this is actually the EOC case with uncorrelated classifiers). In general, every classifier contributes

\frac{v_{D}}{v_{d}}

to increase

I_{f}

. Let us define the unbalanced factor

U = \frac{D - C}{D - 1} D > 1

such that

0 \leq U \leq 1

, being

U = 0

in the totally balanced case (

C = D

), and

U = 1

for the very unbalanced case (

C = 1

).

We showed that, in NEOCs, the parameters affecting the improvement factor,

I_{f}

, are the number of classifiers,

D

; the unbalance factor,

U

; and the bias,

b

. This is illustrated in Figure 3. Thus, in Figure 3a, we show

I_{f}

as a function of

U

for different values of

D

; bias was assumed to be 0. According to (21), this implies that

I_{f} = C = - U (D - 1) + D

, explaining the decreasing straight line varying from a maximum value of

I_{f} = D

for

U = 0

to a the minimum value of

I_{f} = 1

for

U = 1

. Similarly, in Figure 3b, we show

I_{f}

as a function of

b

for different values of

D

; the unbalance factor

U

was assumed to be

U = 0.5

. Again,

I_{f}

decreases for the increasing bias. Moreover, notice that Figure 2b and Figure 3b show similar curves, but in the first case,

U = 0

(all classifiers show an equal performance), while

U = 0.5

is assumed in the second case; then,

I_{f}

suffers a certain degradation, reaching the maximum value of

I_{f} = C = - 0.5 (D - 1) + D = 0.5 (D + 1)

for

b = 0

.

Let us finally compute the optimum coefficient vector,

w

, corresponding to the NEOCs. Thus:

\begin{array}{l} w_{L L M S E} = \frac{M_{s}^{- 1} 1}{1^{T} M_{s}^{- 1} 1} = M S E_{f} M_{s}^{- 1} 1 = M S E_{f} (Λ_{s}^{- 1} - \frac{1}{1 + t r a c e (b_{s} b_{s}^{T} Λ_{s}^{- 1})} Λ_{s}^{- 1} b_{s} b_{s}^{T} Λ_{s}^{- 1}) 1 = \\ = M S E_{f} (I - \frac{1}{1 + t r a c e (b_{s} b_{s}^{T} Λ_{s}^{- 1})} Λ_{s}^{- 1} b_{s} b_{s}^{T}) Λ_{s}^{- 1} 1 \end{array} .

(22)

We see that the optimum linear combiner is the vector

Λ_{s}^{- 1} 1 = {[\frac{1}{ν_{1}} \dots \frac{1}{ν_{D}}]}^{T}

, which gives more relevance to the classifiers having a smaller variance, minus the linear transformation of this vector, which depends on the bias contributions. Actually,

b_{s} = 0 \Rightarrow w_{L L M S E} = {[\frac{1}{ν_{1}} \dots \frac{1}{ν_{D}}]}^{T}

.

4. From Mean Square Error to Probability of Error

Given a score,

s

, we have to decide between

c_{0}

and

c_{1}

. Interpreting

s

as the posterior probability of

c_{1}

, we implement the MAP test:

s \begin{matrix} \overset{c_{1}}{>} \\ \underset{c_{0}}{<} \end{matrix} \frac{1}{2} .

(23)

Then, the performance of the classifier can be evaluated by the probability of error,

P_{ε}

. Obviously, the lower the

M S E

, the lower

P_{ε}

should be. However, this is not demonstrable in a general case, as both the

M S E

and

P_{ε}

are quite different integrals of the class-conditional probability density functions (pdfs), namely:

\begin{array}{l} M S E = E [{(u - s)}^{2}] = P_{0} \cdot E [{(u - s)}^{2} / c_{0}] + P_{1} \cdot E [{(u - s)}^{2} / c_{1}] = \\ = P_{0} \int_{0}^{1} {(0 - s)}^{2} p (s / c_{0}) d s + P_{1} \int_{0}^{1} {(1 - s)}^{2} p (s / c_{1}) d s \\ P_{ε} = P_{0} \cdot \int_{\frac{1}{2}}^{1} p (s / c_{0}) d s + P_{1} \cdot \int_{0}^{\frac{1}{2}} p (s / c_{1}) d s \end{array} .

(24)

For this reason, we now assume a simple but representative model of the conditional pdfs. Let us consider that

p (s / c_{1})

and

p (s / c_{0})

are exponential pdfs in the interval

(0, 1)

, i.e.,

p (s / c_{0}) = λ_{0} e^{- λ_{0} s} u (s) p (s / c_{1}) = λ_{1} e^{- λ_{1} (1 - s)} u (1 - s) .

(25)

where

u (s)

is the unit step function. In (25), we assumed that

λ_{0}

and

λ_{1}

are large enough so that

λ_{0} e^{- λ_{0} s} ≃ 0 s > 1 λ_{1} e^{- λ_{1} (1 - s)} ≃ 0 s < 0

, and so the exponentials are appropriate pdfs for scores. We show in Figure 4 two exponential class-conditional pdfs for

λ_{0} = λ_{1} = 4

.

We can express

P_{ε}

is (24) as the combination of the probability of false-positives,

P_{ε I}

(type I error), and the probability of false negatives,

P_{ε I I}

(type II error), and considering (23) and the exponential models (25), we have:

\begin{array}{l} P_{ε} = P_{0} \cdot P_{ε_{I}} + P_{1} \cdot P_{ε_{I I}} \\ P_{ε_{I}} = \int_{\frac{1}{2}}^{1} p (s / c_{0}) d s = P_{0} \cdot \int_{\frac{1}{2}}^{1} λ_{0} e^{- λ_{0} s} d s ≃ P_{0} \cdot \int_{\frac{1}{2}}^{\infty} λ_{0} e^{- λ_{0} s} d s = e^{- \frac{λ_{0}}{2}} \\ P_{ε_{I I}} = \int_{0}^{\frac{1}{2}} p (s / c_{1}) d s = \int_{0}^{\frac{1}{2}} λ_{1} e^{- λ_{1} (1 - s)} d s = \int_{\frac{1}{2}}^{1} λ_{1} e^{- λ_{1} s'} d s' ≃ \int_{\frac{1}{2}}^{\infty} λ_{1} e^{- λ_{1} s'} d s' = e^{- \frac{λ_{1}}{2}} \end{array} .

(26)

In general, as we did in (2) for the fused score, we can express the MSE as the combination of the MSEs conditional to every possible value of the random variable,

u

.

\begin{array}{l} M S E = E [{(u - s)}^{2}] = P_{0} \cdot M S E_{I} + P_{1} \cdot M S E_{I I} \\ M S E_{I} = {(0 - E [s / c_{0}])}^{2} + var [s / c_{0}] = {(\frac{1}{λ_{0}})}^{2} + \frac{1}{λ_{0}^{2}} = \frac{2}{λ_{0}^{2}} \\ M S E_{I I} = {(1 - E [s / c_{1}])}^{2} + var [s / c_{1}] = {(1 - (1 - \frac{1}{λ_{1}}))}^{2} + \frac{1}{λ_{1}^{2}} = \frac{2}{λ_{1}^{2}} \end{array} .

(27)

Solving for

λ_{0}

and

λ_{1}

in (27) and substituting in (26), we arrive at:

P_{ε} = P_{0} \cdot e^{- \frac{1}{\sqrt{2 M S E_{I}}}} + P_{1} \cdot e^{- \frac{1}{\sqrt{2 M S E_{I I}}}} .

(28)

Thus, in (28), we see the definite impact that the minimization of the MSE has on the minimization of the probability of error. Hence, it is of great interest to relate the improvement factor previously defined with the expected improvement of

P_{ε f}

with respect to

P_{ε D}

. This is straigthforward in the case that

λ_{0} = λ_{1} \Rightarrow M S E_{I} = M S E_{I I}

:

P_{ε f} = e^{- \frac{1}{\sqrt{2 M S E_{f}}}} = e^{- \frac{\sqrt{I_{f}}}{\sqrt{2 M S E_{D}}}} = {(P_{ε D})}^{\sqrt{I_{f}}},

(29)

But, in general, we need to define a separate improvement factor for every type of error:

I_{f I} = \frac{M S E_{D I}}{M S E_{f I}}, I_{f I I} = \frac{M S E_{D I I}}{M S E_{f I I}}

, so that

P_{ε f} = P_{0} \cdot P_{ε D I} {}^{\sqrt{I_{f I}}}+ P_{1} \cdot P_{ε D I I} {}^{\sqrt{I_{f I I}}}.

(30)

Let us show some illustrative examples for the above. We consider, for simplicity, the case

λ_{0} = λ_{1}

. Let us define the theoretical accuracy

A = 1 - P_{ε}

. Using (29), we represent in Figure 5a the post-fusion accuracy,

A_{f}

, as a function of the accuracy,

A_{D}

, of the best classifier for different values of

I_{f}

. The significant increase in accuracy with the increase in

I_{f}

is notable. Thus, for example, a value of

A_{D} = 0.70

increases to

A_{f} = 0.87

for

I_{f} = 3

. From the previous analysis, we know that 3 uncorrelated and unbiased classifiers would be enough to provide such an improvement factor. Furthermore, we represent in Figure 5b the accuracy,

A_{f}

, as a function of

I_{f}

for different values of

A_{D}

. Once again, the significant accuracy improvement that can be achieved with improvement factors that are not excessively high is notable. This is especially true for weak classifiers that provide relatively low accuracy.

5. Real Data Experiments

In this section, we consider some experiments with real data. This involves the automatic classification of sleep stages based on the information provided by different biosignals (polysomnograms). This classification is usually performed manually by a medical expert, resulting in a slow and tedious procedure. In particular, we consider the detection of very-short periods of wakefulness [29,30] (also called arousals) as their frequency is related to the presence of apnea and epilepsy. The objective is by no means to present the most-precise method for solving this complex problem, among the myriad of options and variants available. Rather, it is to verify in a real-life context how the fusion of several classifiers achieves results consistent with the theory outlined above. The data were obtained from the public database Physionet [31]. Ten patients were considered. Every patient was monitored during sleep, thus obtaining the corresponding polysomnogram. In this experiment, we considered the electroencephalograms (EEGs). Then, 8 EEG features were obtained in non-overlapping intervals of 30 sec (epochs) from every EEG recording. The number of epochs varied for every patient, ranging from 749 (6 h and 14 min) to 925 (7 h and 42 min). The 8 features were: Powers in the frequency bands delta (0–4 Hz), theta (5–7 Hz), alpha (8–12 Hz), sigma (13–15 Hz), and beta (16–30 Hz), and 3 Hjorth parameters: activity, mobility, and complexity. These features are routinely used in the analysis of EEG signals [32,33].

As we were interested in the detection of arousals, a two-class problem was considered, where class 1 corresponded to arousal and class 0 corresponded to any of the other possible sleep stage. This is in concordance with the two-class scenario assumed in the previous theoretical analysis.

Five classifiers were considered: Gaussian Bayes, Gaussian Naïve Bayes, Nearest Mean, Linear Discriminant, and Logistic Regression. Table 1 indicates the equations corresponding to the score computation in every method, as well as the estimated parameters during training.

The classifiers were separately trained for every patient using the first half of the respective EEG recordings. Then, the second half was used for testing. The scores of all patients were grouped by classes. Thus, a total number of 6706 scores of class 0 and 1708 of class 1 were generated by every classifier. Despite this imbalanced scenario, we assumed that

P_{0} = P_{1} = 0.5

in all the experiments, in order to not harm the detection of arousals, which was a priority in this particular context.

Table 2 shows the samples’ estimates of the score bias and variance for every class and every method, according to the definitions of (2) and (3). We can see that class 1 exhibits a significantly higher bias and variance than class 0 in all the five methods. The corresponding MSE and the accuracy are also indicated in Table 2 and used to rank the methods from “the best” (top) to the “worst” (bottom). The last row of Table 2 shows the statistics corresponding to the scores of the optimal linear combiner (6). Matrix

M_{s}

was estimated from the scores of the training set, using sample estimates in (5) for the covariance and bias matrices. Notice that, as expected, the bias of the fused scores does not show a reduction with respect to the bias of the individual classifiers. However, the variance is reduced, this effect being the main reason from the MSE decrease and corresponding accuracy increase achieved by the fusion of classifiers. Certainly, the improvement is modest due in part to the high correlation among the scores of the 5 individual classifiers, as indicated by the correlation coefficient matrices included at the bottom of Table 2. Furthermore, significant bias, especially in class 1, is also an issue for obtaining greater improvements. In any case, these real data results are in concordance with the expectation of the theoretical analysis.

{\hat{ρ}}_{0} = [\begin{array}{l} 1.00 0.87 0.76 0.84 0.64 \\ 0.87 1.00 0.68 0.70 0.70 \\ 0.76 0.68 1.00 0.89 0.52 \\ 0.84 0.70 0.89 1.00 0.56 \\ 0.64 0.70 0.52 0.56 1.00 \end{array}] {\hat{ρ}}_{1} = [\begin{array}{l} 1.00 0.77 0.78 0.85 0.70 \\ 0.77 1.00 0.68 0.67 0.84 \\ 0.78 0.68 1.00 0.89 0.61 \\ 0.85 0.67 0.89 1.00 0.61 \\ 0.70 0.84 0.61 0.61 1.00 \end{array}]

With the intention of reinforcing this perception, we show in Figure 6 (solid blue) the variation in the improvement factor with the number of fused classifiers. Starting from

D = 1

with the Nearest Mean, which is the best individual classifier according to Table 2, we progressively incorporate a new classifier following the order of Table 2. We can see that

I_{f}

increases with

D

as predicted by the theory, but shifts rather slowly due to the high correlation and large bias. We also represent in Figure 6 (blue dot) the improvement factor of the EOC case computed from (14) for every

D

, where

ν, ρ

, and

b^{2}

are estimated as:

\hat{ν} = \frac{1}{D} \sum_{d = 1}^{D} {\hat{v}}_{d}; \hat{ρ} = \frac{1}{D} \sqrt{{‖\hat{ρ}‖}_{F}^{2} - D}; {\hat{b}}^{2} = \frac{1}{D} {‖{\hat{b}}_{s} {\hat{b}}_{s}^{T}‖}_{F}

(31)

where

{\hat{v}}_{d} = \frac{1}{2} ({\hat{ν}}_{d / 0} + {\hat{ν}}_{d / 1})

,

\hat{ρ} = \frac{1}{2} ({\hat{ρ}}_{0} + {\hat{ρ}}_{1})

, and

{\hat{b}}_{s} = \frac{1}{2} ({\hat{b}}_{s / 0} + {\hat{b}}_{s / 1})

are obtained from the results of Table 2. Thus, the Frobenius norms,

{‖\cdot‖}_{F}

, of the correlation coefficient and the bias (constant entries) matrices of the EOC model, respectively, equate the Frobenios norms of the corresponding (non-constant entries) real-data matrices. In essence, we define an EOC model that “resembles” as much as possible the actual real-data model. Figure 6 suggests that the EOC model, in spite of its simplicity, can provide a reasonably approximation of the actual

I_{f}

value.

We conducted another experiment, where the scores of each classifier are much more uncorrelated than in the previous experiment. To simulate high decorrelation, the inputs to the different classifiers in the interval under test are randomly selected (sampling with replacement) from all the available feature vectors of the training set that belong to the true class of that interval. In this way, the inputs to the different classifiers in each test interval belong to the same class, but are generally different, producing a high level of decorrelation between the scores obtained from them. This procedure serves to simulate decorrelation, but is not practical since the class being tested must be known in advance. In a practical situation, decorrelation must be obtained through appropriate training, for example, bagging-type techniques [34], in which the same classifier is trained with different subsets of instances obtained from an original set. The hard decisions of the differently trained classifiers are combined to obtain a final decision. More generally, in classifier fusion, each classifier can be trained on different subsets of instances to induce some level of score uncorrelation. Another scenario where the scores can be uncorrelated is in late-multimodal fusion [9], where the input of each individual classifier corresponds to a different modality.

We show in Table 3 the same type of information shown in Table 2. Notice that the statistics of the individual classifiers are quite similar to those in Table 2, however the post-fusion MSE has been significantly reduced (mainly because of the variance reduction), and the accuracy has improved by 7% with respect to the best classifier. Notice the low level of correlation (especially in class 0), as indicated by the correlation coefficient matrices included at the bottom of Table 3.

{\hat{ρ}}_{0} = [\begin{array}{l} 1.00 0.08 0.01 0.03 0.04 \\ 0.08 1.00 0.01 0.01 0.07 \\ 0.01 0.01 1.00 0.03 0.02 \\ 0.03 0.01 0.03 1.00 0.00 \\ 0.04 0.07 0.02 0.00 1.00 \end{array}] {\hat{ρ}}_{1} = [\begin{array}{l} 1.00 0.12 0.16 0.17 0.06 \\ 0.12 1.00 0.12 0.15 0.13 \\ 0.16 0.12 1.00 0.27 0.10 \\ 0.17 0.15 0.27 1.00 0.08 \\ 0.06 0.13 0.10 0.08 1.00 \end{array}]

The comparative impact of uncorrelation can be better appreciated in Figure 6, where we show (solid red) the variation in the improvement factor in the case of uncorrelation,

I_{f u}

, with the number of fused classifiers. Notice the significant increase in the improvement factor with respect to the correlated case (

I_{f}

). We also represent in Figure 6 (red dot) the improvement factor,

I_{f N E O C}

, of the NEOC case computed from (21) for every

D

, where

\hat{C} = \sum_{d = 1}^{D} \frac{{\hat{v}}_{D}}{{\hat{v}}_{d}}

;

{\hat{v}}_{d} d = 1 \dots D

as well as

b^{2}

are computed as in (31).

I_{f N E O C}

shows a similar evolution to

I_{f u}

, but with some underestimation. Notice that

{\hat{v}}_{D}

corresponds to the actual variance of the Linear classifier, which is the minimum variance among all the 5 classifiers, as deduced from Table 2. However, the

M S E_{D}

value considered to compute

I_{f u}

is that corresponding to the Nearest Mean, which is the minimum MSE among all the 5 classifiers, as deduced from Table 2. This probably means facing an additional problem to achieve a good fit of the NEOC model to the actual model, because in the NEOC model, the bias is assumed to be the same for all classifiers, and then both

{\hat{v}}_{D}

and

M S E_{D}

correspond to the same classifier. In any case, notice that

I_{f N E O C}

is still far from 5, the maximum achievable value if all the classifiers are unbaised. This is evidence of the significant limitation imposed by bias in the fusion of classifiers. Another limitation to achieving maximum improvement in the NEOC model was the unbalance factor

U = \frac{D - C}{D - 1}

. In our case,

U

has a value between 0.5 and 0.6 for all values of

D

, which is between the perfectly balanced case (

U = 0

) and the totally unbalanced case (

U = 1

).

Finally, we show in Figure 7 the variation in estimated accuracy with the number of classifiers for the correlated (blue solid) and the uncorrelated (red solid) cases. Similar to

I_{f}

, the accuracy slowly increases with

D

in the correlated case. And, similar to

I_{f u}

, the increase is more significant in the uncorrelated case. We also show in Figure 7 (blue dot and red dot) the theoretical accuracies predicted by Equation (30), assuming

P_{0} = P_{1} = 0.5

, namely:

\begin{array}{l} A_{f t} = 1 - P_{ε f t} = 1 - 0.5 \cdot ({P_{ε D I}}^{\sqrt{I_{f I}}} + {P_{ε D I I}}^{\sqrt{I_{f I I}}}) \\ A_{f u t} = 1 - P_{ε f u t} = 1 - 0.5 \cdot ({P_{ε D u I}}^{\sqrt{I_{f u I}}} + {P_{ε D u I I}}^{\sqrt{I_{f u I I}}}) \end{array},

(32)

where

P_{ε D I}

and

P_{ε D I I}

are the respective actual probabilities of errors type I and type II of the best classifiers in the correlated case. And similarly, with respect to

P_{ε D u I}

and

P_{ε D u I I}

.

We can see the good fitting between the theoretical prediction and the actually computed accuracies. This suggests that the exponential model assumed to deduce (30) is appropriate for the fused scores and for the scores of the best classifier. This can be corroborated if we look at Figure 8, which shows histograms normalized to represent the non-parametric estimates of pdfs. We show in Figure 8a the histograms corresponding to the pdf estimates of the best classifiers

\hat{p} (s_{D} / c_{0})

(left) and

\hat{p} (s_{D} / c_{1})

(right). Furthermore, we show in Figure 8b the histograms corresponding to the pdf estimates of the fused scores

\hat{p} (s_{f} / c_{0})

(left) and

\hat{p} (s_{f} / c_{1})

(right). The exponential approximations are superimposed. Every exponential approximation was made from the ML estimates of the corresponding parameter, i.e.,

\hat{λ} = \frac{1}{\hat{E} [s]}

, where

\hat{E} [s]

is the sample mean of the scores.

In order to gain greater confidence, we performed some additional experiments. Firstly, we take into account the imbalance scenario. Thus, we conducted various experiments using either the oversampling of the least-abundant class or undersampling of the most-abundant class. For oversampling, we used the generative network-based GANSO [35,36], which demonstrated its superiority over the classic SMOTE method. To subsample, we simply removed instances of the most-abundant class until we reached the training size of the least-abundant one. The results do not differ significantly from those shown in Table 2 and Table 3. On the other hand, with the intention of gaining statistical consistency, we repeated the previous experiments 100 times, obtaining the mean values and standard deviations of all the parameters shown in Table 2 and Table 3. In each repetition, a different partition of the training and testing sets is considered (always at 50 percent). The mean values are similar to the data in Table 2 and Table 3. The standard deviations never exceed 2% of the mean values. We also performed a significance test on the accuracies obtained by the fusion, characterizing the distribution of the null hypothesis with the 100 accuracy values obtained with the best of the individual classifiers. In all cases, the p-value is less than 0.05, so the null hypothesis is rejected, and we can consider that the accuracy values from the fusion are statistically significant. In conclusion, the results in Table 2 and Table 3 are sufficiently illustrative for the purpose of reinforcing the theoretical analysis (the focus of this research), with some experimental verification.

6. Discussion and Conclusions

The main objective of this work is to contribute, through mathematical analysis, to a better understanding of the soft fusion of classifiers. To perform this analysis, a combiner is proposed that provides the LLMSE estimator of the binary random variable associated with the two-class problem. The LLMSE thus obtained is expressed in terms of the properties of the individual classifiers, particularly their MSEs. All this leads to the definition of an improvement factor, which is the ratio of the MSE corresponding to the best classifier (the one with the minimum MSE) to the MSE provided by the fused score. Two simplified, yet relevant, models (EOC and NEOC) are considered to determine the key parameters affecting the expected improvement due to fusion. All in all, we concluded that those key parameters are the number of individual classifiers, the pairwise correlation among the individual scores, the bias, and the unbalance among the individual variances. In summary, we can say that it is interesting to fuse the largest number of classifiers, which preferably provide scores that are uncorrelated, unbiased, and with similar variances.

Ultimately, the classifier probability of error (or the accuracy) is the most important figure of merit. Therefore, we related the improvement in MSE to the expected accuracy improvement. To do so, we considered exponential models for the class-conditional pdfs. Although these are specific models, a clear connection between the MSE of the class estimation and the classifier performance was demonstrated.

The real data experiments showed consistency with the theoretical analysis, although the actual score models were different from the assumed in both EOC and NEOC models. Regarding accuracy, we verified the good fitting between the theoretical predictions and the actual data. This suggests that the score pdf exponential models are appropriate in this case.

Both the theoretical analysis and the experimental results encourage expanding this work in several respects. First, more general models than EOC and NEOC could better fit different real-world data scenarios. Second, it would be of great interest to seek methods capable of providing unbiased and uncorrelated scores. Specifically, the severe limitation on improvement imposed by bias suggests the merit of further investigation into this problem. The extension of the analysis to the multiclass problem and to unsupervised or semi-supervised scenarios [37,38] will be of undoubted interest. Additionally, other non-Bayesian fusion approaches, such as those based on the Dempster–Shafer theory, should be explored [39].

Author Contributions

Conceptualization, L.V. and A.S.; methodology, L.V. and A.S.; software, L.V. and A.S.; validation, A.S. and L.V.; formal analysis, L.V.; investigation, L.V. and A.S.; resources, A.S.; data curation, A.S.; writing—original draft preparation, L.V.; writing—review and editing, A.S. and L.V.; supervision, L.V.; project administration, L.V.; funding acquisition, L.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Generalitat Valenciana under Grant CIPROM/2022/20.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ruta, D.; Gabrys, B. An overview of classifier fusion methods. Comput. Inf. Syst. 2000, 7, 1–10. [Google Scholar]
Tulyakov, S.; Jaeger, S.; Govindaraju, V.; Doermann, D. Review of classifier combination methods. Stud. Comput. Intell. 2008, 90, 361–386. [Google Scholar]
Ross, A.; Nandakumar, K. Fusion, score-level. In Encyclopedia of Biometrics; Springer: Berlin/Heidelberg, Germany, 2009; pp. 611–616. [Google Scholar]
Mohandes, M.; Deriche, M.; Aliyu, S. Classifiers Combination Techniques: A Comprehensive Review. IEEE Access 2018, 6, 19626–19639. [Google Scholar] [CrossRef]
Pereira, L.M.; Salazar, A.; Vergara, L. A Comparative Study on Recent Automatic Data Fusion Methods. Computers 2024, 13, 13. [Google Scholar] [CrossRef]
Ross, A. Fusion, feature-level. In Encyclopedia of Biometrics; Springer: Berlin/Heidelberg, Germany, 2009; pp. 597–602. [Google Scholar]
Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs Late Fusion in Multimodal Convolutional Neural Networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar]
Sbramani, P.; Sattar, K.N.A.; Pérez, R.; Girirajan, B.; Wozniak, M. Multi-Classifier Feature Fusion-Based Road Detection for Connected Autonomous Vehicles. Appl. Sci. 2021, 11, 7984. [Google Scholar] [CrossRef]
Pereira, L.M.; Salazar, A.; Vergara, L. A comparative analysis of early and late fusion for the multimodal two-class problem. IEEE Access 2023, 11, 84283–84300. [Google Scholar] [CrossRef]
Li, W.; Lv, D.; Yu, Y.; Zhang, Y.; Gu, L.; Wang, Z.; Zhu, Z. Multi-Scale Deep Feature Fusion with Machine Learning Classifier for Birdsong Classification. Appl. Sci. 2025, 15, 1885. [Google Scholar] [CrossRef]
Ruta, D.; Gabrys, B. Classifier selection for majority voting. Inf. Fusion 2005, 6, 63–81. [Google Scholar] [CrossRef]
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
Lahat, D.; Adali, T.; Jutten, C. Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects. Proc. IEEE 2015, 103, 1449–1477. [Google Scholar] [CrossRef]
Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef] [PubMed]
Barua, A.; Ahmed, M.U.; Begum, S. A Systematic Literature Review on Multimodal Machine Learning: Applications, Challenges, Gaps and Future Directions. IEEE Access 2023, 11, 14804–14831. [Google Scholar] [CrossRef]
Hall, D.; Llinas, J. Multisensor Data Fusion; CRC Press: Boca Ratón, FL, USA, 2001. [Google Scholar]
Khaleghi, B.; Khamis, A.; Karray, F.O.; Razavi, S.N. Multisensor data fusion: A review of the state-of-the-art. Inf. Fusion 2013, 14, 28–44. [Google Scholar] [CrossRef]
Sahu, D.; Parsai, M.P. Different Image Fusion Techniques—A Critical Review. Int. J. Mod. Eng. Res. 2012, 2, 4298–4301. [Google Scholar]
Alwakeel, A.; Alwakeel, M.; Hijji, M.; Saleem, T.J.; Zahra, S.R. Performance Evaluation of Different Decision Fusion Approaches for Image Classification. Appl. Sci. 2023, 13, 1168. [Google Scholar] [CrossRef]
Kuncheva, L.I.; Bezdek, J.C.; Duin, R.P.W. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recognit. 2001, 34, 299–314. [Google Scholar] [CrossRef]
Kuncheva, L.I. Switching Between Selection and Fusion in Combining Classifiers: An Experiment. IEEE Trans. Syst. Man Cybern.-Part B 2002, 32, 146–156. [Google Scholar] [CrossRef]
Lip, C.C.; Ramli, D.A. Comparative study on feature, score and decision level fusion schemes for robust multibiometric systems. Front. Comput. Educ. 2012, 133, 941–948. [Google Scholar]
Heydarian, H.; Adam, M.T.P.; Burrows, T.L.; Rollo, M.E. Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection. IEEE Access 2015, 13, 643–655. [Google Scholar] [CrossRef]
Mi, A.; Wang, L.; Qi, J. A Multiple Classifier Fusion Algorithm Using Weighted Decision Templates. Sci. Program. 2016, 3, 1–10. [Google Scholar] [CrossRef]
Rothe, S.; Kudszus, B.; Söffker, D. Does Classifier Fusion Improve the Overall Performance? Numerical Analysis of Data and Fusion Method Characteristics Influencing Classifier Fusion Performance. Entropy 2021, 19, 866. [Google Scholar] [CrossRef]
Juyoung, K.; Kim, B.; Lee, H. Fall Recognition Based on Time-Level Decision Fusion Classification. Appl. Sci. 2024, 14, 709. [Google Scholar] [CrossRef]
Salazar, A.; Safont, G.; Vergara, L.; Vidal, E. Graph Regularization Methods in Soft Detector Fusion. IEEE Access 2023, 11, 144747–144759. [Google Scholar] [CrossRef]
Safont, G.; Salazar, A.; Vergara, L. Vector Score Alpha integration for classifier late fusion. Pattern Recognit. Lett. 2020, 136, 48–55. [Google Scholar] [CrossRef]
Jobert, M.; Shulz, H.; Jahnig, P.; Tismer, C.; Bes, F.; Escola, H. A computerized method for detecting episodes of wakefulness during sleep based on the Alphas low-waveindex(ASI). Sleep 1994, 17, 37–46. [Google Scholar]
Salazar, A.; Vergara, L.; Miralles, R. On including sequential dependence in ICA mixture models. Signal Process. 2010, 90, 2314–2318. [Google Scholar] [CrossRef]
Heneghan, C. St. Vincent’s University Hospital/University College Dublin, Sleep Apnea Database. 2011. Available online: https://physionet.org/content/ucddb/1.0.0/ (accessed on 15 March 2023).
Hjorth, B. EEG analysis based on time domain properties. Electroencephalogr. Clin. Neurophysiol. 1970, 29, 306–310. [Google Scholar] [CrossRef]
Motamedi-Fakhr, S.; Moshrefi-Torbati, M.; Hill, M.; Hill, C.; White, P. Signal processing techniques applied to human sleep EEG signals—A review. Biomed. Signal Process. Control 2014, 10, 21–33. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Salazar, A.; Vergara, L.; Safont, G. Generative Adve rsarial Networks and Markov Random Fields for oversampling very small training sets. Expert Syst. Appl. 2021, 163, 113819. [Google Scholar] [CrossRef]
Salazar, A.; Rodríguez, A.; Vargas, N.; Vergara, L. On training road surface classifiers by data augmentation. Appl. Sci. 2022, 12, 3423. [Google Scholar] [CrossRef]
Rui, Y.; Zhou, Z.; Cai, X.; Dong, L. A novel robust method for acoustic emission source location using DBSCAN principle. Measurement 2022, 191, 110812. [Google Scholar] [CrossRef]
Ullah, B.; Kamran, M.; Rui, Y. Predictive Modeling of Short-Term Rockburst for the Stability of Subsurface Structures Using Machine Learning Approaches: T-SNE, K-Means Clustering and XGBoost. Mathematics 2022, 10, 449. [Google Scholar] [CrossRef]
Tang, Y.; Fei, Z.; Huang, L.; Zhang, W.; Zhao, B.; Guan, H.; Huang, Y. Failure Mode and Effects Analysis Method on the Air System of an Aircraft Turbofan Engine in Multi-Criteria Open Group Decision-Making Environment. Cybern. Syst. 2025, 1–32. [Google Scholar] [CrossRef]

Figure 1. Soft fusion of classifiers.

Figure 2. (a) Improvement factor as a function of correlation; bias is assumed to be 0; (b) improvement factor as a function of bias; correlation is assumed to be 0.

Figure 3. (a) Improvement factor as a function of unbalance; bias is assumed to be 0; (b) improvement factor as a function of bias; unbalance is assumed to be 0.5.

Figure 4. Class-conditional exponential probability density functions for

λ_{0} = λ_{1} = 4

.

Figure 4. Class-conditional exponential probability density functions for

λ_{0} = λ_{1} = 4

.

Figure 5. (a) Accuracy after fusion as a function of the accuracy of the best classifier; (b) accuracy after fusion as a function of the accuracy of the best classifier.

Figure 6. Variation in the improvement factor with the number

D

of classifiers.

I_{f}

, estimated improvement factor with correlated scores.

I_{f E O C}

, improvement factor calculated with Equation (14) with EOC parameters estimated from (31).

I_{f u}

, estimated improvement factor with uncorrelated scores.

I_{f N E O C}

, improvement factor calculated with Equation (21), with EOC parameters estimated from (31).

Figure 6. Variation in the improvement factor with the number

D

of classifiers.

I_{f}

, estimated improvement factor with correlated scores.

I_{f E O C}

, improvement factor calculated with Equation (14) with EOC parameters estimated from (31).

I_{f u}

, estimated improvement factor with uncorrelated scores.

I_{f N E O C}

, improvement factor calculated with Equation (21), with EOC parameters estimated from (31).

Figure 7. Variation in the accuracy with the number,

D

, of classifiers.

A_{f}

, estimated accuracy with correlated scores.

A_{f t}

, accuracy predicted by Equation (32).

A_{f u}

, accuracy with uncorrelated scores.

A_{f u t}

, accuracy predicted by Equation (32).

Figure 7. Variation in the accuracy with the number,

D

, of classifiers.

A_{f}

, estimated accuracy with correlated scores.

A_{f t}

, accuracy predicted by Equation (32).

A_{f u}

, accuracy with uncorrelated scores.

A_{f u t}

, accuracy predicted by Equation (32).

Figure 8. (a) Histograms corresponding to the pdf estimates of the best classifiers

\hat{p} (s_{D} / c_{0})

(left) and

\hat{p} (s_{D} / c_{1})

(right); exponential approximations are superimposed. (b) Histograms corresponding to the pdf estimates of the fused scores

\hat{p} (s_{f} / c_{0})

(left) and

\hat{p} (s_{f} / c_{1})

(right); exponential approximations are superimposed.

Figure 8. (a) Histograms corresponding to the pdf estimates of the best classifiers

\hat{p} (s_{D} / c_{0})

(left) and

\hat{p} (s_{D} / c_{1})

(right); exponential approximations are superimposed. (b) Histograms corresponding to the pdf estimates of the fused scores

\hat{p} (s_{f} / c_{0})

(left) and

\hat{p} (s_{f} / c_{1})

(right); exponential approximations are superimposed.

Table 1. Score computation method and required training.

Method	Score	Training
Gaussian Bayes	$s_{d} = \frac{e^{- \frac{1}{2} ({(x - μ_{1})}^{T} C_{1}^{- 1} (x - μ_{1}))}}{e^{- \frac{1}{2} ({(x - μ_{1})}^{T} C_{1}^{- 1} (x - μ_{1}))} + e^{- \frac{1}{2} ({(x - μ_{0})}^{T} C_{0}^{- 1} (x - μ_{0}))}}$	${\hat{μ}}_{1}, {\hat{μ}}_{0}, {\hat{C}}_{1}, {\hat{C}}_{0}$ (ML sample estimates)
Gaussian Naïve Bayes	$s_{d} = \frac{e^{- \frac{1}{2 σ_{1}^{2}} ({(x - μ_{1})}^{T} (x - μ_{1}))}}{e^{- \frac{1}{2 σ_{1}^{2}} ({(x - μ_{1})}^{T} (x - μ_{1}))} + e^{- \frac{1}{2 σ_{0}^{2}} ({(x - μ_{0})}^{T} (x - μ_{0}))}}$	${\hat{μ}}_{1}, {\hat{μ}}_{0}, {\hat{σ}}_{1}^{2}, {\hat{σ}}_{0}^{2}$ (ML sample estimates)
Nearest Mean	$s_{d} = \frac{e^{- \frac{1}{2} {‖x - μ_{1}‖}^{2}}}{e^{- \frac{1}{2} {‖x - μ_{1}‖}^{2}} + e^{- \frac{1}{2} {‖x - μ_{0}‖}^{2}}}$	${\hat{μ}}_{1}, {\hat{μ}}_{0}$ (ML sample estimates)
Linear Discriminant	$s_{d} = \frac{w_{1}^{T} x}{w_{1}^{T} x + w_{0}^{T} x}$	${\hat{w}}_{1}, {\hat{w}}_{0}$ (Respective solutions to two overdetermined systems of equations)
Logistic Regression	$s_{d} = \frac{1}{1 + e^{- w^{T} x}}$	$\hat{w}$ (Solutions to one overdetermined system of equations)

Table 2. Estimates of the score statistics and accuracy of the classifiers. Correlated case.

	${\hat{b}}_{d / 0}$	${\hat{b}}_{d / 1}$	${\hat{ν}}_{d / 0}$	${\hat{ν}}_{d / 1}$	$M \hat{S} E_{d}$	$\hat{A} c c_{d}$
Nearest-mean	0.1047	0.2562	0.0539	0.1120	0.1213	83.29%
Linear Discr.	0.1175	0.3901	0.0221	0.0618	0.1250	82.99%
Gauss. Bayes	0.0670	0.2916	0.0469	0.1691	0.1528	82.55%
Naive Bayes	0.0686	0.3011	0.0492	0.1826	0.1636	82.08%
Logistic Regr.	0.0401	0.3586	0.0371	0.2242	0.1957	79.99%
	${\hat{b}}_{f / 0}$	${\hat{b}}_{f / 1}$	${\hat{ν}}_{f / 0}$	${\hat{ν}}_{f / 1}$	$M \hat{S} E_{f}$	$\hat{A} c c_{f}$
FUSION	0.1264	0.3032	0.0408	0.0670	0.1078	84.23%

Table 3. Estimates of the score statistics and accuracy of the classifiers. Uncorrelated case.

	${\hat{b}}_{d / 0}$	${\hat{b}}_{d / 1}$	${\hat{ν}}_{d / 0}$	${\hat{ν}}_{d / 1}$	$M \hat{S} E_{d}$	$\hat{A} c c_{d}$
Nearest-mean	0.1034	0.2656	0.0533	0.1146	0.1246	82.93%
Linear	0.1180	0.3905	0.0223	0.0638	0.1263	82.80%
Bayes	0.0680	0.2859	0.0476	0.1672	0.1506	82.65%
Naive Bayes	0.0649	0.2952	0.0458	0.1788	0.1580	81.72%
Logistic	0.0398	0.3525	0.0370	0.2236	0.1932	80.34%
	${\hat{b}}_{f / 0}$	${\hat{b}}_{f / 1}$	${\hat{ν}}_{f / 0}$	${\hat{ν}}_{f / 1}$	$M \hat{S} E_{f}$	$\hat{A} c c_{f}$
FUSION	0.0845	0.3081	0.0110	0.0449	0.0790	89.60%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vergara, L.; Salazar, A. On the Optimum Linear Soft Fusion of Classifiers. Appl. Sci. 2025, 15, 5038. https://doi.org/10.3390/app15095038

AMA Style

Vergara L, Salazar A. On the Optimum Linear Soft Fusion of Classifiers. Applied Sciences. 2025; 15(9):5038. https://doi.org/10.3390/app15095038

Chicago/Turabian Style

Vergara, Luis, and Addisson Salazar. 2025. "On the Optimum Linear Soft Fusion of Classifiers" Applied Sciences 15, no. 9: 5038. https://doi.org/10.3390/app15095038

APA Style

Vergara, L., & Salazar, A. (2025). On the Optimum Linear Soft Fusion of Classifiers. Applied Sciences, 15(9), 5038. https://doi.org/10.3390/app15095038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Optimum Linear Soft Fusion of Classifiers

Abstract

1. Introduction

2. The Optimal Linear Combiner

3. The Improvement Factor

3.1. Fusion of Equally Operated Classifiers (EOCs)

3.2. Fusion of Non-Equally Operated Classifiers (NEOCs)

4. From Mean Square Error to Probability of Error

5. Real Data Experiments

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI