Next Article in Journal
A Method for Diagnosing Gearboxes of Means of Transport Using Multi-Stage Filtering and Entropy
Next Article in Special Issue
Analysis of TDMP Algorithm of LDPC Codes Based on Density Evolution and Gaussian Approximation
Previous Article in Journal
Attention to the Variation of Probabilistic Events: Information Processing with Message Importance Measure
Previous Article in Special Issue
A New Dictionary Construction Based Multimodal Medical Image Fusion Framework
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

M-ary Rank Classifier Combination: A Binary Linear Programming Problem

Informatique, Bio-informatique et Systèmes Complexes (IBISC) EA 4526, univ Evry, Université Paris-Saclay, 40 rue du Pelvoux, 91020 Evry, France
Authors to whom correspondence should be addressed.
Entropy 2019, 21(5), 440;
Received: 16 January 2019 / Revised: 11 April 2019 / Accepted: 18 April 2019 / Published: 26 April 2019
(This article belongs to the Special Issue Information Theory Applications in Signal Processing)


The goal of classifier combination can be briefly stated as combining the decisions of individual classifiers to obtain a better classifier. In this paper, we propose a method based on the combination of weak rank classifiers because rankings contain more information than unique choices for a many-class problem. The problem of combining the decisions of more than one classifier with raw outputs in the form of candidate class rankings is considered and formulated as a general discrete optimization problem with an objective function based on the distance between the data and the consensus decision. This formulation uses certain performance statistics about the joint behavior of the ensemble of classifiers. Assuming that each classifier produces a ranking list of classes, an initial approach leads to a binary linear programming problem with a simple and global optimum solution. The consensus function can be considered as a mapping from a set of individual rankings to a combined ranking, leading to the most relevant decision. We also propose an information measure that quantifies the degree of consensus between the classifiers to assess the strength of the combination rule that is used. It is easy to implement and does not require any training. The main conclusion is that the classification rate is strongly improved by combining rank classifiers globally. The proposed algorithm is tested on real cytology image data to detect cervical cancer.

1. Introduction

Using a single classifier has shown limitations in achieving satisfactory recognition performance, and this leads us to use multiple classifiers, which is now a common practice in machine learning. Classifier combination has been studied in many disciplines such as the social sciences, sensor fusion, pattern recognition, etc. Schapire [1] proved that a strong classifier can be generated by combining weak classifiers. It has been accepted as an effective method to improve classification performances. Many examples of ensemble classifier systems can be found in process engineering or medicine. For a survey of the issues and approaches on classifier combination, readers are referred to Woźniak [2] and Oza and Turner [3]. The same type of approach has also been used, for instance, in remote sensing domains (e.g., for land cover mapping with Landsat Multispectral Scanner, elevation) [4], computer security [5], financial risks [6], proteomics [7].
Classifiers can provide as their final decision only a single class, a ranked list of all the classes, or a score associated with each class as a measure of confidence for the class. In this paper, we focus only on rank-values to perform combination. Rank data are useful when data can not be easily reduced to numbers, such as data that are related to concepts, opinions, feelings, values, and behaviors of people in a social context, genes, characters, etc. Ranking also has the advantage of removing scale effects while permitting ranking patterns to be compared. But rank-ordering has also its disadvantages: it is difficult to combine data from different rankings, and the information contained in the data is limited [8].
After learning, each classifier of the ensemble has output its own results. Several fusion strategies have been proposed in the literature to combine classifiers at the rank level [9,10]. Among them, one of the most common techniques is certainly the linear combination of the classifier outputs [11,12]. The voting principle is the simplest method of combination, where the top candidate from each classifier constitutes a single vote. The final decisions can be made by majority rule (over half of the votes) [13], plurality (maximum number of votes) [14], weighted sum of significance [15], or other variants. The method of Borda count [16], which sums up the rank values of classifiers, can be considered as a generalization of the voting principle. The Bayesian approach estimates the class posterior probabilities conditioned on classifier decisions by approximating various probability densities [17]. Although decision theory itself does not assume classifiers are independent, this assumption is almost always adopted in practical implementation to reduce the exponential complexity of probability estimation. In summary, classifier combination is an ensemble method that classifies new data by taking a weighted vote of the predictions of a set of classifiers [18]. This is originally a Bayesian averaging, but more recent algorithms include boosting, bagging, random forests, and variants [19,20,21]. Note that Dempster–Shafer formalism for aggregating beliefs based on uncertainty reasoning lends itself to a more flexible model used to combine multiple pieces of evidence and capable of taking uncertainty and ignorance into account [22].
Finally, a rank classifier provides an ordered list of classes associating each class with a rank integer that indicates its importance in the list. The output of a classifier K k is therefore a vector of ranks attributed to the K classes.
An ensemble of classifiers might be a better choice than a single classifier because of the variability of the ensemble errors, which is such that the consensus performance is significantly better than the best individual in the ensemble [23]. This analysis is certainly true when the classifiers of the ensemble “see” different training patterns, and it can be effective even when the classifiers all share the same training set. In a computerized tomography problem to illustrate how the ensemble consensus outperformed the best individuals, Anthimopoulos observed that the marginal benefit obtained by increasing the ensemble size is usually low due to correlation among errors: most classifiers will get the right answer on easy inputs, while many classifiers will make mistakes on difficult inputs [24].
In addition, running several searches and combining the solutions produces a better approximation than many learning techniques that use local searches to converge toward a solution, with the risk of staying stacked in local optima (which may not be true in the case of deep learning classifiers, since Kawaguchi has shown that every local minimum is a global minimum [25]). Thus, we might not be capable of producing the optimal classifier using a training set and a given classifier architecture, compared to a set of several classifiers. Since the number of classifiers can be very high (in the thousands), it is difficult to “understand” the classifier ensemble decision characteristics.
Although general performances are often improved when classifiers are combined, it becomes computationally costly to combine well-trained classifiers [26]. Most of the time, it is believed that the combination of independent classifiers will provide greater performance improvement [27], while combiner decisions could be biased toward duplicated outputs. However, this belief stems from the difficulty of using a dependence assumption. In fact, in practical situations, classifier independence is difficult to assess.
How do multiple rank classifiers improve separation performances when individual classification performances are slightly better than random decision making? And what is “classifier independence” ? This term raises several issues that we will address in Section 4, where we come back to the theory of rank aggregation and propose an algorithm to combine classifiers. The main properties of the classifier are discussed. Section 2 exposes the general framework and the notations used. A classifier ensemble dependence measure is then proposed to evaluate the conditional mutual information in Section 5. Experimental results are presented in Section 6 for the detection of cervical cancer. Finally, Section 7 gives conclusions on rank classifier combination and further investigations are discussed.
Set and regions are indicated by double-trace uppercase letters such as G , S , R , vectors with bold lowercase such as x , y , and matrices with uppercase bold letters such as C , M , Σ . The elements of a matrix M = { m i j } are indexed by the row index i and the column index j. Lowercase letters refer to individual elements in a vector whose position in the vector is indicated by the last subscript. Therefore, x i j refers to the jth element of vector x i . p ( C i ) is the a priori probability of the random value X belonging to class C i , 1 i K , K being the number of classes. M is the number of classifiers used for combination. | C | denotes the cardinality of set C . T denotes the transpose operator.

2. Problem Statement and Model

We consider a classification dataset B 0 with n observations
B 0 = { ( x i , c i ) } i = 1 n ,
obtained from a physical signal, or synonymously, explanatory variables, objects, instances, cases, patterns, t-uples, etc. where each x i belongs to class c i { C 1 , , C K } . The vector x i lies in an attribute space A R p and each component x i j is a numerical or nominal categorical attribute, also named feature, variable, dimension, component, field, etc.
The output of the M classifiers K 1 ( x ) , , K M ( x ) are represented by a K-dimensional vector u i ( x ) = ( u i 1 , , u i K ) T , 1 i m : each component u i j is a certain value associated with class C j given by K i ( x ) . Depending on the nature of the classifier K i ( x ) , u i j can be a rank value that reflects a complete or partial ordering of all classes, or a value in { 0 , 1 } corresponding to the predicted class assigned to 1 and the others to zero, or a score, e.g., a discriminant value, associated with each class C j , which serves as a confidence measure for the class to be the true class. The latter can easily be converted into the two former. Therefore, each classifier K i ( x ) defines a mapping function from the image domain R p to a K-dimensional vector space defined over a set of values E i . The general framework is illustrated in Figure 1.
In this paper, u i j is a rank value that reflects a complete or partial ordering of the classes. The objective is to design an optimal combination function G that takes all the u i as input and produces as an output the decision vector z = ( z 1 , , z K ) T , where z k is the rank associated with the decision on class C k , that is, z = G ( u 1 , , u M ) . Thus, we seek G as a discriminant function defined over R K × m .
In the following, it is assumed that (i) classifiers have equal individual performance (ii) classifiers K i are treated as “black boxes”. Hence, the combination operator applies only on the real space vectors u i .

3. Conditional Independence Properties

The term “classifier independence” has been used in an intuitive manner, but what is classifier independence? Formally, two classifiers K 1 and K 2 are said to be independent if
p ( u 1 = c j , u 2 = c ) = p ( u 1 = c j , ) p ( u 2 = c ) 1 j , K ,
with u 1 and u 2 being the decision values of K 1 and K 2 . The idea is illustrated in the following example.
Example 1
(Independent classifiers). Consider a binary classification problem (with equiprobable classes C 1 and C 2 ) and two classifiers C 1 and C 2 with similar performances and whose outputs are u 1 and u 2 , i.e., their probabilities of correct classification α 1 and α 2 are equal:
p ( u 1 = c 1 | c 1 ) = p ( u 1 = c 2 | c 2 ) = α 1 p ( u 2 = c 1 | c 1 ) = p ( u 2 = c 2 | c 2 ) = α 2 p ( u 1 = c 1 | c 2 ) = p ( u 1 = c 2 | c 1 ) = 1 α 1 p ( u 2 = c 1 | c 2 ) = p ( u 2 = c 2 | c 1 ) = 1 α 2 .
Then the total probability rule helps to find the probability of the outputs:
p ( u 1 = c 1 ) = p ( u 1 = c 1 | c 1 ) p ( c 1 ) + p ( u 1 = c 1 | c 2 ) p ( c 2 ) = α 1 / 2 + ( 1 α 1 ) / 2 = 1 / 2 p ( u 2 = c 1 ) = p ( u 2 = c 1 | c 1 ) p ( c 1 ) + p ( u 2 = c 1 | c 2 ) p ( c 2 ) = α 2 / 2 + ( 1 α 2 ) / 2 = 1 / 2 .
The two classifiers are independent if the joint probability p ( u 1 , u 2 ) factorizes
p ( u 1 = c 1 , u 2 = c 1 ) = p ( u 1 = c 1 ) p ( u 2 = c 1 ) = l 1 2 × 1 2 = 1 4 .
And similarly,
p ( u 1 = c 1 , u 2 = c 2 ) = p ( u 1 = c 2 , u 2 = c 1 ) = p ( u 1 = c 1 ) p ( u 2 = c 2 ) = 1 2 × 1 2 = 1 4 .
In Equations (5)–(6), α 1 and α 2 do not appear anymore. The value of p ( u 1 , u 2 ) should be 1 4 , independently of the classifier performances. This is possible only if α 1 = α 2 = 1 2 . Thus, the ensemble performance does not depend of the performance of the individuals. In other words, independent classifiers in the sense of definition (2) are random classifiers (recognition rate of 50%)!
Suppose now that classifiers are very efficient and that α 1 and α 2 are almost identical to 1. In this case, the probability that the two answers are correct is also almost equal to 1 and
p ( u 1 = c 1 , u 2 = c 2 ) p ( c 1 ) = 1 / 2 p ( u 1 = c 1 ) p ( u 2 = c 2 ) ,
which is far from the value of 1 4 required by the condition of independence.
Example (1) suggests that interesting classifiers (non-random!) cannot be independent in the sense of Equation (2). Making the assumption that decision vectors u 1 , , u M are conditionally independent given x C j , the discriminant function G maximizes the posterior probability p ( C j ) i = 1 M p ( u i | C j ) = p ( C j ) i = 1 M k = 1 K p ( u i k | C j ) , which can be point estimated from the entries of the M K K -confusion matrices, as given, for instance, in Table 1.
Let 𝟙 j < k be the indicatrix function for which 𝟙 j < k = 1 if the rank of the class C j is less than the alternative class C k , and 0 otherwise. Then in Table 1, n j k = 𝟙 j < k and the line and column marginals are respectively defined by n j · = k = 1 K n j k and n · k = k = 1 K n j k . If class C j is the kth choice for classifier K i , then p ( u i k | C j ) = n j k n j · .
Example 2
(Conditional independent classifiers). Consider once again the binary classification case introduced in Example (1) and assume that the classifiers are very efficient: α 1 = α 2 1 . Then
p ( u 1 = c 2 , u 2 = c 1 | c 1 ) 1 p ( u 1 = c 1 | c 1 ) p ( u 2 = c 1 | c 1 ) = α 1 α 2 1
We conclude that two classifiers can be conditionally independent even if they are very efficient. Equation (8) does not indicate that the classifiers are independent. It only suggests that they can be conditionally independent or conditionally dependent.
Therefore, conditional independence can be seen as a necessary condition for classifier combination. But the direct use of the confusion matrix as a criterion to derive the optimal combination rule is not feasible since the true classes are unknown.

4. Rank Class Combination Problem

4.1. Rank-Order Statistic Model

A rank classifier gives an ordered list of classes associating each class with a integer that indicates its importance in the list; in the case of K classes, it is an integer k { 1 , 2 , , K } . The output of a classifier K k is a vector of ranks attributed to K classes:
u k ( x ) = r k = r 1 k r 2 k r K k , ,
and r j k = r k ( C j ) is the rank assigned to class C j by the classifier K k . By convention, the smaller the rank assigned to a class, the more likely it is. In other words, r i k < r j k if K k judges C i more likely than C j . The vector r ( k ) is therefore a permutation of the first K integers. The matrix R = { r i k } represents the total order ranking of the K classes attributed by the M classifiers, i.e., r i k r i k , i i [28]. In the following, for ease of writing, we will denote r i k = r i ( k ) . Then
R = ( r 1 r 2 r M ) = r 11 r 12 r 1 M r 21 r 22 r 2 M r K 1 r K 2 r K M ,
where r j 1 r j 2 r j M is the set of ranks assigned to class C j by the M classifiers.
The solution of a rank class combination problem is a total order ranking (TOR) r * , given by a virtual classifier minimizing the disagreement of opinions between the M classifiers. The optimization problem is defined as follows:
r * = arg min r k = 1 M f ( r , r k ) , s . t . r S K ,
where r k is the rank distribution on the K classes proposed by the classifier K k , S K is the symmetric group of the K ! permutations [29], and f : S K × S K R + is a metric on S K . Solving Equation (11) is difficult due to the constraint r S K . In the following subsections, the search for r * conducts to a linear optimization program with an exact solution that depends on the metric used, i.e., the disagreement distance or the Condorcet distance.
The choice of these metrics is motivated by a range of properties: (i) both have an intuitive and plausible interpretation as a number of pairwise choices, (ii) they provide the best possible description of the process of ranking classes as performed by a human, (iii) both have a number of appealing mathematical properties such as counting rather than measuring and providing a very good concordance indicator [30,31].

4.2. Total Order Ranking with Disagreement Distance

The disagreement between the rankings from classifiers K k and K k is measured by f d ( r k , r k ) = i = 1 K sgn | r i k r i k | . The kth permutation r k can be represented by a permutation matrix P ( k ) = { x i j ( k ) } , x i j ( k ) { 0 , 1 } , with x i j ( k ) = 1 if class i is positioned in place j and 0 otherwise (see Figure 2). Therefore, the constraint r S K in Equation (11) imposes j = 1 K r i j * = i = 1 K r i j * = 1 , i , j . Let ϕ d ( r ) = k = 1 M f d ( r , r k ) = P , P ( k ) d with tensor Einstein notation. Equation (11) can then be rewritten:
r * = arg min r ϕ d ( r ) = arg min r S K k = 1 M i = 1 K sgn | r i r i k | ,
where r i denotes the rank of the ith candidate in the unknown ranking r . As r can be represented by its permutation matrix P = { x i j } , it comes from the rewriting of r i = j j x i j in Equation (12):
ϕ d ( r ) = k = 1 M i = 1 K sgn | j j x i j r i k | s . t . j x i j = 1 ,
which is equivalent to:
ϕ d ( r ) = k = 1 M i = 1 K sgn | j ( j r i k ) x i j |
Taking into account the summation on j and the fact that x i j only takes the value 1 once (and 0 elsewhere), only ( j r i k ) corresponding to the value j for which x i j = 1 is considered. Then
ϕ d ( r ) = k = 1 M i = 1 K sgn ( j K | j r i k | x i j ) = k = 1 M i = 1 K j = 1 K sgn ( | j r i k | ) x i j .
Let us define by
κ i j ( r ) = k = 1 M sgn | j r i k | = k = 1 M x i j x i j ( k )
the cost of attributing the alternative i in position j. κ i j is also the number of classifiers that don’t position the alternative i in place j. κ i j ( r ) is equivalent to m π i j , where π i j is the number of classifiers who do position the alternative i in place j. Given that | x i j x i j ( k ) | = ( x i j x i j ( k ) ) 2 because | x i j x i j ( k ) | { 0 , 1 } , we obtain
ϕ d ( r ) = 1 2 k = 1 M i = 1 K j = 1 K ( x i j x i j ( k ) ) 2 = k = 1 M ( K i = 1 K j = 1 K x i j x i j ( k ) ) ,
and then
ϕ d ( r ) = i = 1 K i = 1 K ( m k = 1 M x i j ( k ) ) x i j .
In Equation (18), considering that π i j = k = 1 M x i j ( k ) is the number of classifiers that position class C i in place j, the linear objective function associated with Equation (12) is finally formulated as
P * = arg min P i = 1 K j = 1 K ( M π i j ) x i j s . t . π i j = k = 1 K x i j ( k ) , i = 1 M x i j = j = 1 M x i j = 1 , and x i j { 0 , 1 } ,
constrained by i K π i j = j K π i j = K . The form to be minimized in Equation (19) recodes the classifier combination rule, which is reduced to solve an NP-hard binary linear programming problem (see [32] for some resolution strategies).

4.3. Total Order Ranking with Condorcet Distance

To define this distance, we define a new set of matrices { Y ( 1 ) , , Y ( m ) } , where Y i j ( k ) = { y i j } = 𝟙 i < j is put for the indicator matrix of classifier K k with the convention y i j ( k ) = 1 if the rank of class C i is less than that of class C j and 0 otherwise (see Figure 3).
Using the tables Y ( k ) as in Section 4.2
f C ( r k , r k ) = f ( Y ( k ) , Y ( k ) ) = 1 2 i K j K | y i j ( k ) y i j ( k ) | , k , k = 1 , , M ,
which can be simplified as follows in the case of total order:
f C ( r k , r k ) = 1 2 i K j K ( y i k ( k ) y i k ( k ) ) 2 = i j y i j ( k ) y j i ( k ) .
As ( y i j ( k ) ) 2 = y i j ( k ) = 0 or 1, the consensus function associated with the Condorcet distance is given by
ϕ C ( r ) = 1 2 i = 1 K j = 1 K M y i j + i = 1 K j = 1 K k = 1 M y i j 2 i = 1 K j = 1 K y i j k = 1 M y i j ( k ) .
Let δ i j = k = 1 K y i j ( k ) be the total number of classifiers preferring class C i to C j . Defining Δ = { δ i j } as a matrix summing the M matrices Y ( k ) associated with the rankings r k of the classifier K k allows us to rewrite ϕ C as
ϕ C ( r ) = 1 2 i = 1 K j = 1 K M y i j + i = 1 K j = 1 K δ i j 2 i = 1 K j = 1 K δ i j y i j .
As r defines a total order, i = 1 K j = 1 K y i j = K ( K 1 ) 2 and i = 1 K j = 1 K δ i j < M K ( K 1 ) 2 .
Let θ = 1 2 M K ( K 1 ) 2 + i = 1 K j = 1 K δ i j . Then t h e t a is constant and ϕ C ( r ) is
ϕ C ( r ) = θ i = 1 K j = 1 K δ i j y i j .
Finally, the search for an optimal rank classifier combination conducts to the following binary linear program:
max Y i = 1 K j = 1 K δ i j y i j s . t . δ i j = k = 1 K y i j ( k ) , y i j + y j i = 1 , i < j , y i i = 0 i y i j + y j i y i k 1 , i j k , y i j { 0 , 1 } .
From a machine learning perspective, solving Equations (19) and (25) provides deterministic matrix solutions P * and Y * , respectively, from which r * is easily reconstructed, but these solutions are not necessarily identical [28].
Example 3
(Classifier ensemble aggregation rule). The problem selected to illustrate our theory is that of combining four classifiers for recognizing handwritten digits 0 to 9. Binary images from the MNIST database are used [33]. The four classifiers are tested on a sample and proposed rankings are collected in Table 2.
The two rankings are concordant except for the predictions for digits 5 and 6.

5. Classifier Ensemble Information Measure

Since sgn | x | | x | , then
f d ( r k , r k ) = i = 1 K sgn | r i k r i k | i = 1 K | r i k r i k | .
If r i k = K i = 1 K y i j ( k ) , then from Equation (26),
i = 1 K | r i k r i k | = i K j K ( y i j ( k ) y i j ( k ) ) i K j K y i j ( k ) y i j ( k ) = 2 f C ( r k , r k ) .
In summary, f d ( r k , r k ) f C ( r k , r k ) , which means that f C is more uncertain than f D and could be preferred for a classifier ensemble agreement. The question is, how precisely can we measure this voting conjunction?
Section 4.3 introduced a matrix representation of the information. By summing for all the tables Y ( k ) , one obtains the matrix Δ defined previously. If we arrange the classifiers according to a permutation order Σ = ( σ ( 1 ) , σ ( 2 ) , , σ ( K ) ) , Δ can be represented from matrix Δ ( Σ ) obtained by the permutation of rows and columns.
The objective function to minimized is given in the general case by:
F C r = θ ( sum of the elements of the upper triangular part of the matrix ) ,
and as follows in the case of total orders:
F C ( r ) = ( sum of the elements of the lower triangular part of the matrix ) .
A measure of classifier ensemble agreement is a coefficient between 0 and 1 measuring the intensity of the link between the set of classifier votes. The closer its value is to 1, the more the opinions of the classifiers are in agreement. Conversely, the closer their value is to 0, the greater the disagreement between the votes. Here, we give the coefficients of concordance for the two metrics.

5.1. Disagreement Distance

Theorem 1
(Conjunction coefficient interval for the disagreement metric). Let { K i } i = 1 M be an ensemble of conditionally independent classifiers voting on K classes. Then, the interval of variation of the conjunction coefficient I d is [ 0 , 1 ] .
See Appendix A for the proof.

5.2. Condorcet Distance

If M classifiers vote on K classes with pairing order comparison matrices Y ( k ) , the sum of which makes it possible to obtain Δ = { δ i j } with δ i j = k = 1 K y i j ( k ) , as defined in Section 4.3, the conjunction coefficient is defined as
I C = 4 j = 1 K j = 1 K δ i j ( δ i j 1 ) M ( M 1 ) K ( K 1 ) 1 .
Theorem 2
(Conjunction coefficient interval for the Condorcet metric). Let { K i } i = 1 M be an ensemble of conditionally independent classifiers voting on K classes. Then the interval of variation of the conjunction coefficient I C defined by (30) is
I C [ 1 M ; 1 ] i f M i s e v e n , 1 M 1 ; 1 ] o t h e r w i s e .
See Appendix B for the proof.

6. Experiments

6.1. The Detection of Cervical Cancer

Many studies have shown evidence that cervical cancer may be imputed to a subset of DNA viruses called human papillomavirus (HPV) (referred to as risky patients)that infect cutaneous and mucosal epithelia, and in which acute infection causes benign cutaneous lesions [34,35]. Some of these viruses infect the genital tract and cause malignant tumors, which are most commonly located in the cervix. Even though most of these infections are controlled by the immune system, some remain persistent and are ascribed to different types of cancers and particularly, to cervical cancer. In 2016, cervical cancer represented the 12th most lethal female cancer in the European Union, accounting for 13500 deaths a year and 30400 new cases a year. Therefore, cervical cancer screening still continues to play a critical role in the control of cervical cancer. However, the screening of a smear is nowadays mostly made manually: a pathologist inspects each cell of a smear with a microscope to check if it is atypical or not. Consequently, human error is always possible, and in particular, mistakenly diagnosing atypical cells as normal. This situation can occur because of the practitioner’s fatigue or a lack of experience or concentration. In addition, diagnosis is also linked to the preparation of cells, and in some situations, atypical cells can be partially hidden by others, which makes their interpretation or classification difficult. In addition, the presence of atypical cells in the entire studied population is very uncommon (up to 1‰) which makes the detection task even more difficult. Therefore, an error is easily possible. This could have irreversible effects on the evolution of the cancer and can impact treatment. The introduction of an automatic procedure, able to point out the pathological cells, would both help the practitioner in his diagnosis and improve or strengthen it.
Depending on the morphology of the nuclei of the cells, the diagnosis varies: if a nucleus is considered normal and all of the cells removed have the same diagnosis, then the cervix is considered normal. On the other hand, if a nucleus is considered abnormal, the diagnosis is not automatically associated with a risky smear.
We propose to test our classifier combination strategy to cluster cells into three different classes (normal cells, atypical cells, and debris) using a certain number of classifiers.

6.2. The Dataset

The cytological dataset is constituted of smear images from 14 different women. They generally comprise more than one hundred cells characterized by 42 morphological or textural variables. Nine showed a negative hpv test and the other five, a positive test. In addition, few observations were labeled by an expert who pointed out some atypical cells and noisy objects. The dataset is presented in detail in Table 3. Among the most recurrent patterns of abnormal cells are nucleus regularity or a swollen aspect, nucleus size, important optical density, number of nucleoli, high core/cytoplasm ratio, ratio of minimum/maximum width of the nucleus, etc.
The images were colored with Papanicolaou stain, which is the most widely used reference color for the screening of cervical cancers; it makes it possible to distinguish the different nuclei, which are colored in blue, the mother cells in dark purple to black, and the keratinized and squamous epithelium. The images were then segmented into thumbnail images of 16 × 16 pixels which correspond a priori to objects. Most of the time, these objects are nuclei, but they may sometimes be non-identified objects that we call “noise”. Indeed, they can correspond, for example, to a poor segmentation, a superimposed nuclei, etc.
A few observations were labeled by an expert who pointed out some atypical cells and noisy objects. The fact that a nucleus has one of these characteristics does not always imply its malignancy. In fact, a cell can have a singular morphology but not be infected, and others may present abnormalities that correspond to pre-cancerous lesions such as dysplastic cells and in situ carcinomas or to cancerous cells. Figure 4a shows a cluster of abnormal cells (with large nuclei) that are not yet cancerous, because of their low density, unlike Figure 4b, where one can observe a set of abnormal cells with dense nuclei.
Table 4 summarizes the characteristics of the dataset. First, the observed data come from samples of 14 different smears, which supposes the existence of inter-individual variability (confirmed by tests of variance between the hpv negatives, the hpv positives, or between the two types of population; the 5% risk threshold tests rejected the assumption of equality of means for all variables). However, it is possible that this variability is simply relative to the studied dataset, in the sense that the study was done on a small number of smear samples. This assumption remains to be verified on larger databases. It can also be noted in Table 3 that the known population of “abnormal” cells remains very low in proportion to the other classes, and, in contrast, the recognized “default/waste” class represents more than 15% of the data. The low proportion of the target class and the heterogeneity of the debris present obstacles for clustering. This means that, among the cells belonging to risky patient smears, there exists a non-null risk that some nuclei are atypical. Iin practice, this proportion is usually very low (0.1% to 5%).
From this image segmentation, morphological and photometric features are extracted and computed. In total, the studied dataset has 3857 cell samples belonging to 14 different smears and consists of 42 variables: variables 1 to 19 represent morphological variables, and the rest corresponds to textural and photometric characters. The channel of treatments from the smear image to the dataset is reported in Figure 5.
Each smear was pre-processed according to a standardized protocol: cell collection, spreading a thin layer on slides, and the staining of these slides. Each slide was then scanned, segmented cell by cell, and finally, underwent an extraction of 42 morphological and textural characteristics.

6.3. Experimental Protocol

Two-layer multilayer perceptrons (MLPs) were chosen as classifiers to produce the desired outputs, which were ordered to produce the ranks. Each multilayer perceptron (MLP) contains 42 input units, 10 hidden units, and 3 output units. Training was achieved using a learning rate of 0.1 and a momentum of 0.9 for two epochs on the training set. We deliberately trained the MLPs without optimization of a validation set. It is important to stress that the training set for the classifier was not the same set as the test set, the ensure that the experiments would be unbiased. The best results obtained for an mlp were a classification error rate of 0.159 ± 0.022 and a false positive rate (FPR) (or false alarm ratio) of 0.133 ± 0.050 . The fpr is the number of false positives divided by the total number of negatives N, i.e., F P / N . The false negative rate (FNR) is the number of false negatives divided by the number of real positive cases in the data, i.e., F N / P . In practice, this is a test result that indicates that a condition does not hold, while in fact it does.
In order to assess the efficiency of the rank classifier combination algorithms, error rates were computed from a certain percentage of nuclei whose labels were known. This represented 70% of the observations in a subsample, as we took into account the 20 labeled atypical nuclei randomly selected, and we also assumed that those coming from control patients (120) were all normal nuclei. We proceeded in the same manner to compute the fpr which stands for the percentage of actual atypical nuclei mis-classified.
In Table 5 and in Figure 6a, we report the classification error rate computed from the data with known labels and its corresponding fpr, for the two procedures. First of all, we can observe that the Condorcet combination rule shows the best performances in terms of classification error rate and fpr (see also Figure 6c). Indeed, only 4.28% of cells are mis-classified, whereas the disagreement combination rule has a mis-classification rate of 4.84% in the best case, with 765 classifiers. The main conclusion is that the success ratio is strongly improved when combining classifiers. However, it is disappointing to see that the Condorcet algorithm results in a significant number of false negatives (pathological cells classified as normal ones); the fnr also remains relatively high, around 10%, in many simulations. Indeed, the classification risk is not symmetric here: the detection of pathological cells activates the decision for treatment, and their absence implies an absence of treatment.
We compared the clustering partition obtained by the three competitors: sparse k-means (SkM) proposed by Witten and Tibshirani [36,37], general sparse multi-class linear discriminant analysis (GSM-LDA) [38], and sparse EM (sEM)by Zhong et al. [39]. First, we can observe that among these algorithms, the sEM shows the best performance in terms of clustering accuracy. Only 9% of observations are mis-classified, on average, whereas the GSM-LDA algorithm has a mis-classification rate of 15.9%, and the SkM algorithm mis-classifies 19.2% of nuclei. However, the sparse approaches provide a better clustering results from a medical point of view since the results can be interpreted conversely to the LDA-type algorithm, for which the fitted discriminative axis is a linear combination of the original variables. Therefore, SkM and sEM provide information which can be interpreted to better understand both the data and the phenomenon.
The rank classifier combination provides the best classification results. We can observe that the global clustering error rates are considerably reduced (Table 6). Indeed, the best error rate reaches 4.28% with 907 classifiers and a conjunction coefficient of 96.6%.

7. Conclusions and Future Research

In this paper, we show that an exact optimal combination rule for a rank classifier ensemble can be computed as the solution to a binary linear programming problem. This rule can be seen as a total order ranking attributed to K classes by a virtual voter resuming the points of view of M voters. One could also stand the dual problem of the previous one, i.e., is there a distribution of marks or values that could have been attributed to a virtual class C by the m voters? The first problem is related to the idea of aggregating points of view, the second with the idea of summarizing profiles.
We compared disagreement and Condorcet metrics, making it possible to quantify the consensus between the classifiers with a conjunction coefficient. The optimal rankings are not the same, i.e., the solution depends of the metric used. But they have shown their efficiency, in addition to the appealing property of being deterministic algorithms: they improve the classification results and ease the interpretation and the understanding of the results. Another point worth mentioning is the theoretical capability of handling the reject option. A weak point of this technique is that it treats all classifiers equally and does not take into account individual classifier capabilities. This disadvantage can be reduced to a certain degree by applying weights. The weights can be different for every classifier, which in turn requires additional training. This idea deserves to be further explored.
The role of variable selection appears to be significant, as it enables the improvement of both the clustering partition and the modeling of the atypical cells in the cancer detection smear (see Figure 5). In the future, we propose including a rule to rank the selected features and to investigate how the number and nature of classifiers influence the results of the rank classifier combination.

Author Contributions

In this research paper H.M.’s contributions are in investigation and conceptualization. V.V. made substantial contributions to conception and design and in interpretation of data. Both participated in drafting the article.


This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Conjunction Coefficient Extreme Values for Disagreement Metric

As a proof, consider the matrix ß = { π i j } defined in Section 4.2 in the case when the rankings r k are total orders. Then
j = 1 K j = 1 K π i j ( π i j M ) = 0 ,
if all of the classifiers propose the same rankings, because π i j = 0 or π i j = M . Then π i j ( π i j M ) = 0 , i , j . From (A1), j = 1 K j = 1 K π i j 2 = K M 2 . Therefore, we define the conjunction coefficient as
I d = j = 1 K j = 1 K π i j 2 K M 2 .
I d 1 because j = 1 K j = 1 K π i j 2 j = 1 K j = 1 K π i j M since π i j M ( i , j ) .

Appendix B. Conjunction Coefficient Extreme Values for the Condorcet Metric

The minimum value is obtained for a maximum disagreement, i.e., if most classifiers prefer i to j than the opposite, or, mathematically, if δ i j + δ j i = 0 or δ i j = δ j i = M 2 , i , j .
In the case of M being even
i < j K ( δ i j δ j i ) 2 = i < j K δ i j 2 + i < j K δ j i 2 2 i < j K δ j i δ i j .
As i K j K ( δ i j δ j i ) 2 = 0 and 2 i K j K δ i j δ j i = M 2 K ( K 1 ) 4 , then
i K j K δ i j 2 = M 2 K ( K 1 ) 4 .
i K j K δ i j ( δ i j 1 ) = K ( K 1 ) 2 M 2 2 M . .
I C = M 2 M 1 1 = 1 M 1 .
The case of M being odd. Let M = 2 m + 1 . In the case of maximum disagreement, then i < j K ( δ i j δ j i ) = ± 1 and δ i j δ j i = m ( m + 1 ) , i , j . It comes i < j K δ i j ( δ i j 1 ) = i < j 1 = K ( K 1 ) 2 . Moreover,
i K j K δ i j 2 = K ( K 1 ) 2 [ m ( m + 1 ) + 1 ]
i K j K δ i j ( δ i j 1 ) = K ( K 1 ) 2 [ m ( m + 1 ) + 1 2 m 1 ] = K ( K 1 ) 2 m 2 .
I C = 4 m 2 2 m ( 2 m + 1 ) 1 = 1 M .


  1. Schapire, R.E. Using output codes to boost multiclass learning problems. In Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; pp. 313–321. [Google Scholar]
  2. Woźniak, M.; Graña, M.; Corchado, E. A survey of multiple classifier systems as hybrid systems. Special Issue on Information Fusion in Hybrid Intelligent Fusion Systems. Inf. Fusion 2014, 16, 3–17. [Google Scholar] [CrossRef]
  3. Oza, N.; Tumer, L. Classifier ensembles: Select real-world applications. Inf. Fusion 2008, 9, 4–20. [Google Scholar] [CrossRef][Green Version]
  4. Han, M.; Zhu, X.; Yao, W. Remote sensing image classification based on neural network ensemble algorithm. Neurocomputing 2012, 78, 133–138. [Google Scholar] [CrossRef]
  5. Raj Kumar, P.A.; Selvakumar, S. Distributed Denial of Service Attack Detection Using an Ensemble of Neural Classifier. Comput. Commun. 2011, 34, 1328–1341. [Google Scholar] [CrossRef]
  6. Bolton, R.J.; Hand, D.J. Statistical Fraud Detection: A Review. Stat. Sci. 2002, 17, 235–255. [Google Scholar]
  7. Nanni, L. Ensemble of classifiers for protein fold recognition. Neurocomputing 2006, 69, 850–853. [Google Scholar] [CrossRef]
  8. Vigneron, V.; Duarte, L.T. Rank-order principal components. A separation algorithm for ordinal data exploration. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio, Brazil, 8–13 July 2018; pp. 1036–1041. [Google Scholar]
  9. Altinçay, H.; Demirekler, M. An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification. Speech Commun. 2000, 30, 255–272. [Google Scholar] [CrossRef]
  10. Yang, S.; Browne, A. Neural network ensembles: Combining multiple models for enhanced performance using a multistage approach. Expert Syst. 2004, 21, 279–288. [Google Scholar] [CrossRef]
  11. Wozniak, M. Hybrid Classifiers: Methods of Data, Knowledge, and Classifier Combination; Number 519 in Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  12. Kuncheva, L.I. Classifier Ensembles for Changing Environments. In Proceedings of the 5th International Workshop on Multiple Classifier Systems, Cagliari, Italy, 9–11 June 2004; pp. 1–15. [Google Scholar]
  13. Bhatt, N.; Thakkar, A.; Ganatra, A.; Bhatt, N. Ranking of Classifiers based on Dataset Characteristics using Active Meta Learning. Int.J. Comput. Appl. 2013, 69, 31–36. [Google Scholar] [CrossRef]
  14. Abaza, A.; Ross, A. Quality Based Rank-level Fusion in Multibiometric Systems. In Proceedings of the 3rd IEEE International Conference on Biometrics: Theory, Applications and Systems, Washington, DC, USA, 28–30 September 2009; pp. 459–464. [Google Scholar]
  15. Li, Y.; Wang, N.; Perkins, E.; Zhang, C.; Gong, P. Identification and optimization of classifier genes from multi-class earthworm microarray dataset. PLoS ONE 2010, 5, e13715. [Google Scholar] [CrossRef]
  16. García-Lapresta, J.L.; Martínez-Panero, M. Borda Count Versus Approval Voting: A Fuzzy Approach. Public Choice 2002, 112, 167–184. [Google Scholar] [CrossRef]
  17. Zhang, H.; Su, J. Naive Bayesian Classifiers for Ranking. In Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, 20–24 September 2004; Volume 3201, pp. 501–512. [Google Scholar]
  18. Dietterich, T.G. Ensemble Methods in Machine Learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, Cagliari, Italy, 21–23 June 2000; Springer: London, UK, 2000; pp. 1–15. [Google Scholar]
  19. Denison, D.D.; Hansen, M.; Holmes, C.C.; Mallick, B.; Yu, B. Nonlinear Estimation and Classification; Number 171 in Lecture Notes in Statistic; Springer: New York, NY, USA, 2003. [Google Scholar]
  20. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef][Green Version]
  21. Lee, S.; Kouzani, A.; Hu, E. Random forest based lung nodule classification aided by clustering. Comput. Med. Imaging Graph. 2010, 34, 535–542. [Google Scholar] [CrossRef] [PubMed]
  22. Panigrahi, S.; Kundu, A.; Sural, S.; Majumdar, A. Credit card fraud detection: A fusion approach using Dempster–Shafer theory and Bayesian learning. Inf. Fusion 2009, 10, 354–363. [Google Scholar] [CrossRef]
  23. Hansen, L.; Salamon, P. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 1990, 12, 993–1001. [Google Scholar] [CrossRef][Green Version]
  24. Anthimopoulos, M.; Christodoulidis, S.; Ebner, L.; Christe, A.; Mougiakakou, S. Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network. IEEE Trans. Med. Imaging 2016, 35, 1207–1216. [Google Scholar] [CrossRef]
  25. Kawaguchi, K. Deep Learning without Poor Local Minima. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 586–594. [Google Scholar]
  26. Datta, S.; Pihur, V.; Datta, S. An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinform. 2010, 11, 427. [Google Scholar] [CrossRef] [PubMed]
  27. Nadal, J.; Legault, R.; Suen, C. Complementary algorithms for the recognition of totally unconstrained handwritten numerals. In Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, NJ, USA, 16–21 June 1990; pp. 443–446. [Google Scholar]
  28. Brüggemann, R.; Patil, G. Ranking and Prioritization for Multi-Indicator Systems: Introduction to Partial Order Applications; Environmental and Ecological Statistics; Springer: New York, NY, USA, 2011. [Google Scholar]
  29. Benson, D. Representations of Elementary Abelian p-Groups and Vector Bundles, 1st ed.; Cambridge Tracts in Mathematics; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  30. Vigneron, V.; Duarte, L. Toward Rank Disaggregation: An Approach Based on Linear Programming and Latent Variable Analysis. In Latent Variable Analysis and Signal Separation; Tichavský, P., Babaie-Zadeh, M., Michel, O.J., Thirion-Moreau, N., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 192–200. [Google Scholar]
  31. Gehrlein, W.; Lepelley, D. Voting Paradoxes and Group Coherence: The Condorcet Efficiency of Voting Rules, 1st ed.; Studies in Choice and Welfare; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  32. Korte, B.; Vygen, J. Combinatorial Optimization: Theory and Algorithms, 4th ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  33. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  34. Li, G.; Guillaud, M.; Follen, M.; MacAulay, C. Double staining cytologic samples with quantitative Feulgen-thionin and anti-Ki-67 immunocytochemistry as a method of distinguishing cells with abnormal DNA content from normal cycling cells. Anal. Quant. Cytopathol. Histopathol. 2012, 34, 273–284. [Google Scholar]
  35. Scheurer, M.; Guillaud, M.; Tortolero-Luna, G.; McAulay, C.; Follen, M.; Adler-Storthz, K. Human papillomavirus-related cellular changes measured by cytometric analysis of DNA ploidy and chromatin texture. Cytom. Part B Clin. Cytom. 2007, 72, 324–331. [Google Scholar] [CrossRef][Green Version]
  36. Witten, D.M.; Tibshirani, R. A framework for feature selection in clustering. J. Am. Stat. Assoc. 2010, 105, 713–726. [Google Scholar] [CrossRef] [PubMed][Green Version]
  37. Kondo, Y.; Salibian-Barrera, M.; Zamar, R. RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm. J. Stat. Softw. Artic. 2016, 72, 1–26. [Google Scholar] [CrossRef]
  38. Safo, S.E.; Ahn, J. General Sparse Multi-class Linear Discriminant Analysis. Comput. Stat. Data Anal. 2016, 99, 81–90. [Google Scholar] [CrossRef]
  39. Zhong, M.; Tang, H.; Chen, H.; Tang, Y. An EM algorithm for learning sparse and overcomplete representations. Neurocomputing 2004, 57, 469–476. [Google Scholar] [CrossRef]
Figure 1. General framework for classifier combination. The classifier K i ( x ) produces output vector u i . Finally, from u i the combination function produces a final decision vector z .
Figure 1. General framework for classifier combination. The classifier K i ( x ) produces output vector u i . Finally, from u i the combination function produces a final decision vector z .
Entropy 21 00440 g001
Figure 2. Permutation matrix put for the ranking of classifier K k .
Figure 2. Permutation matrix put for the ranking of classifier K k .
Entropy 21 00440 g002
Figure 3. Condorcet matrices.
Figure 3. Condorcet matrices.
Entropy 21 00440 g003
Figure 4. Images of cervical cells colored with Papanicolaou stain. (a) Clumps of abnormal cells with large nuclei. (b) Abnormal cells with dense nuclei.
Figure 4. Images of cervical cells colored with Papanicolaou stain. (a) Clumps of abnormal cells with large nuclei. (b) Abnormal cells with dense nuclei.
Entropy 21 00440 g004
Figure 5. Overview of the processing chain.
Figure 5. Overview of the processing chain.
Entropy 21 00440 g005
Figure 6. Graphic representations of the classification results for disagreement (blue) and Condorcet (red) distances.
Figure 6. Graphic representations of the classification results for disagreement (blue) and Condorcet (red) distances.
Entropy 21 00440 g006
Table 1. Confusion matrix of a classifier K i used to estimate p ( U i k | C j ) in the Bayesian approach. U i = R j denotes the classifier decision on class being ranked jth.
Table 1. Confusion matrix of a classifier K i used to estimate p ( U i k | C j ) in the Bayesian approach. U i = R j denotes the classifier decision on class being ranked jth.
Predicted Classes
R 1 R j R K
True classes C 1 n 11 n 1 j n 1 K n 1 ·
C j n j 1 n j j n j K n j ·
C K n K 1 n K j n K K n K ·
n · 1 n · j n · K
Table 2. Proposed rank classifier combination using disagreement and Condorcet distances.
Table 2. Proposed rank classifier combination using disagreement and Condorcet distances.
DigitsClassifier RanksProposed Rank
K 1 K 2 K 3 K 4 Disag.Condorcet
Table 3. Dataset characteristics.
Table 3. Dataset characteristics.
HPV TestTotal Number of CellsNumber (or %) of
– Debris – – Cancer –
Table 4. Overview of the studied dataset.
Table 4. Overview of the studied dataset.
No. of PatientsNo. of NucleiNo./Yype of Data
control patients92165427/noisy objects
risky patients51692105/atypical nuclei
228 / noisy objects
Total143857760 objects
Table 5. Classification results with disagreement and Condorcet combination rules using a set of M classifiers (with 4 M 1321 ).
Table 5. Classification results with disagreement and Condorcet combination rules using a set of M classifiers (with 4 M 1321 ).
Disagreement DistanceCondorcet Distance
I d M Error RateFPRFNR I C M Error RateFPRFNR
Table 6. Results obtained for the sparse k-means (SkM), general sparse multi-class linear discriminant analysis (GSM-LDA), and sparse EM (sEM) algorithms: Average and standard error of clustering error rate, false positive rate fpr, and false negative rate fnr on 20 simulations.
Table 6. Results obtained for the sparse k-means (SkM), general sparse multi-class linear discriminant analysis (GSM-LDA), and sparse EM (sEM) algorithms: Average and standard error of clustering error rate, false positive rate fpr, and false negative rate fnr on 20 simulations.
AlgorithmError RateFPRFNR
skm [36]0.192 ± 0.0160.205 ± 0.0440.165 ± 0.084
gsm [38]0.159 ± 0.0220.133 ± 0.0500.118 ± 0.099
sem [39]0.090 ± 0.0470.077 ± 0.0220.062 ± 0.061

Share and Cite

MDPI and ACS Style

Vigneron, V.; Maaref, H. M-ary Rank Classifier Combination: A Binary Linear Programming Problem. Entropy 2019, 21, 440.

AMA Style

Vigneron V, Maaref H. M-ary Rank Classifier Combination: A Binary Linear Programming Problem. Entropy. 2019; 21(5):440.

Chicago/Turabian Style

Vigneron, Vincent, and Hichem Maaref. 2019. "M-ary Rank Classifier Combination: A Binary Linear Programming Problem" Entropy 21, no. 5: 440.

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop