1. Introduction
Information theoretical concepts are inherently and deeply linked to statistical principles. These concepts play a crucial role in the processes of data analysis and modeling. The rich set of tools information theory offers focus on quantifying uncertainty in a random variable, divergence, dependence, and information gain.
They facilitate the understanding of how data can be utilized more efficiently, which variables provide greater informational value, and how uncertainty within the data can be quantified. They are used in areas like model selection, estimation, hypothesis testing, learning, and decision-making.
By integrating information-theoretic objectives into statistical inference and learning, researchers can develop models that are both interpretable and data-efficient, making these methods especially valuable in high-dimensional and data-scarce domains such as bioinformatics, neuroscience, and medical diagnostics.
To name a few, tools and notions from information theory such as Shannon entropy, Kullback–Leibler divergence, and mutual information provide rigorous means for assessing the complexity and informativeness of models and data. These measures are central to a wide range of statistical applications, including model selection (e.g., through criteria like AIC, Akaike’s Information Criteria and MDL, Mininum Description Length), feature selection, clustering, density estimation, and hypothesis testing.
Additionally, methods based on the principle of maximum entropy allow for the estimation of probability distributions under partial information, yielding minimally biased solutions constrained only by known data characteristics. Cross-entropy and information gain are widely used in classification tasks and decision tree algorithms, respectively, to optimize predictive performance.
To define the core problem addressed in this manuscript, consider a given matrix X of dimension with continuous individual predictors being its columns and a given binary outcome vector y of . Our aim is to find a coefficient vector w such that the linear combination would be a better predictor of y than all individual columns of X.
It is well known that logistic regression maximizes the logarithm of the likelihood function
having
to estimate coefficients vector
where
is the sigmoid function,
n is the number of samples, and
is the
row of
X. After estimating coefficients vector
from logistic regression, one can easily select
as a candidate coefficient vector to be used for linear combination of individual predictors.
The novel approaches we will exhibit in the following pages utilize information-theoretical principles for the determination of coefficient vector
w.
Section 2 reviews previous methods used for linear combinations of biomarkers, all of which are non-information-theoretic in nature.
Section 3 is mainly for the details of our proposed methods.
Section 4 is devoted to details and results of the simulation, and
Section 5 showcases an application to real-world data.
2. Previous Methods Used in Linear Combinations of Biomarkers
A biomarker is typically employed as a diagnostic or evaluative tool in medical research. Identifying a single ideal biomarker that offers both high sensitivity and specificity is challenging, particularly when high specificity is needed for population screening. An effective alternative is the combination of multiple biomarkers, which can deliver superior performance compared to using just one biomarker.
Especially in cases where there is more than one weak biomarker, it is important to use them in combination [
1]. Combining biomarkers has several advantages such as increased diagnostic accuracy, better prediction of disease course and prognosis, and the development of personalized medical applications.
Numerous studies in literature employ ROC (Receiver Operating Characteristic) analysis to assess the performance of combined biomarkers developed through various methodologies. Those in binary classification focused on maximizing various non-parametric estimates of the area under the ROC curve (AUC) or the Mann–Whitney U statistic to derive the optimal linear combination of single biomarkers.
AUC is an essential performance indicator of binary models. AUC values show the model’s ability to distinguish between two classes of the binary outcome (discriminative ability). High AUC values indicate better and more accurate discrimination.
The authors in [
2] used Fisher’s discriminant functions under multivariate normality and (non)proportional covariance settings to produce a linear combination which maximizes AUC. Later, the authors in [
3] enhanced and investigated some other properties of these linear combinations and proposed alternative combinations.
The authors in [
4] proposed a distribution-free rank based approach for deriving linear combinations of two biomarkers that maximize the area or partial area under the emprical ROC curve. They compared its performance to those obtained from optimizing logistic likelihood function and linear discriminant analysis. They claimed that linear discriminant analysis optimizes AUC when multivariate normality was achieved. They provided further insights such as selecting emprical AUC as an objective function may yield better performances than selecting a logistic likelihood function in another publication [
5].
The nonparametric Min–Max procedure proposed in [
6] is claimed to be more robust to distributional assumptions, is easier to compute, and has better performance. The methodology proposed in [
7] extended combination approaches to the problem where outcome is not dichotomous but ordinal with more than two categories. They proposed two new methods and compared their performances with existing methods.
In the study [
1], they compared existing methods where the number of biomarkers are large, biomarkers are weak, and the number of observations are not an order of magnitude greater than the number of biomarkers. They also proposed a new method for combinations.
Underlining certain inadequacies of present combination methods, the authors in [
8] proposed a new kernel-based AUC optimization method and claimed that it outperformed the smoothed AUC method previously proposed in [
9].
The authors in [
10] propose a derivative-free black-box optimization technique, called Spherically Constrained Optimization Routine (SCOR), to identify optimal linear combinations where the outcome is ordinal. The method proposed in [
11], Nonparametric Predictive Inference (NPI), examines the best linear combination of two biomarkers, where the dependency between two biomarkers is modeled using parametric copulas. Another copula-based combination method in [
12] utilized different copulas for an optimal linear combination of two biomarkers with a binary outcome.
It is important to note that none of the preceding methods employed information-theoretical concepts in their methodologies.
3. Information Theoretical Methods for Linear Combinations
This section details maximum entropy, minimum cross entropy, minimum relative entropy, and maximum mutual information concepts.
3.1. Entropy Maximization MaxEnt
Having historical roots in physics, the maximum entropy principle is an approach to finding the distribution that maximizes the entropy of a probability distribution under certain constraints. It is proposed by Jaynes [
13] as a general principle of inference and has been applied successfully to numerous fields. Maximizing entropy means choosing the distribution that reflects maximum uncertainty. Axiomatic derivation of maximum entropy and minimum cross entropy principle was done by [
14]. The authors in [
15] applied the maximum entropy principle to recover information from multinomial data.
Suppose we know that a system has a set of possible states with unknown probabilities for and we have a given matrix X of dimension and a binary vector y of dimension n. We want to find a coefficient vector w of dimension d so that vector would be a better predictor of Y.
The maximum entropy principle explains how to select a certain distribution among different possible distributions that satisfy our constraints. We treat as variables when finding the maximum entropy solution to our problem. Notice that, since we deal with binary vector y and , solving for will suffice and reduce the complexity of the problem.
In mathematical terms, we want to find
that maximizes
:
Replacing
, this would become
We impose the following constraints to fit the data:
and
where
values are empirical expectations obtained from data and
. Since, in general, we have many more observations than predictors,
n is much larger than
. We can relax the second constraint (2) to
since
values can be easily obtained by
. Hence, the final optimization problem becomes
under the constraints
and
This is a solvable convex optimization problem with linear constraints. After determining
namely
values,
w still needs to be found. Since we based the problem on finding probabilities first rather than the coefficient vector
w first, we need to glue these probabilities to our given matrix
X. We use logits
obtained from those probabilities and perform least squares estimation to equation
where
l is the logits vector.
Other ways of estimating w without directly finding probabilities were possible, as we did below in cross entropy minimization. However, we deliberately focused here on this method to show that the method finds probabilities complying to the Maximum Entropy principle.
3.2. Cross-Entropy Minimization MinCrEnt
First proposed by Kullback [
14], this principle incorporates cross entropy of two distributions to find the optimal solution.
Suppose we know that a system has a set of possible states
with unknown probabilities
for
, as described in
Section 3.1.
The principle states that, among different possible distributions
p that satisfy certain constraints, one should select the one which minimizes the cross entropy with
q where
q is a known a priori distribution. In our case, we select
q to be the empirical distribution of the outcome (namely,
Y):
Again using
, we can simplify the objective function to
The objective function we obtained on right hand side indeed coincides with the negative of log-likelihood function from logistic regression. We know the distribution of
values a priori, but we do not know the
value. A natural and inevitable choice for getting
values is to use a transformation of given data matrix X with coefficient vector
w. Thus, we define
where
is the sigmoid function. Hence, the optimization problem becomes
where
values are known probabilities obtained from vector y.
We do not impose any further constraints here because use of sigmoid function confines the probabilities between 0 and 1, and we already made use of X. This is an unconstrained convex optimization problem that can easily be solved.
3.3. Relative Entropy Minimization (MinRelEnt)
Relative entropy minimization is a technique less known and partly misidentified with cross entropy minimization [
16]. Like cross entropy, relative entropy requires a reference distribution. We used the empirical distribution of
y as a reference distribution
q in cross entropy minimization, but we choose the uniform distribution
here. The optimization problem becomes
or
As we did in entropy maximization, we use the same constraint on empirical expectations to determine probabilities
and for probabilities we set
This is also a convex optimization problem with linear constraints. After finding probabilities, we transform them to logits and solve where l is the logits vector using least squares as we did for entropy maximization. It is worth noting that alternative choices for are possible, such as the empirical distribution of y or incorporating any other prior belief about y.
3.4. Mutual Information Maximization MaxMutInf
As was previously done in a similar but not the same setting by Faivishevsky and Goldberger [
17], we estimate coefficient vector
w through mutual information maximization. We use the following identity for mutual information:
where
X is the predictor matrix,
Y is the emprical distribution of outcome vector
y in our problem, and
is the mutual information estimator defined by Faivishevsky and Goldberger. Since
Y is categorical, we deal with conditional entropies corresponding to states of
:
where
is the entropy of
X restricted to values where
Y takes the value
(also called in-class entropy). Hence, the optimization problem becomes
or
where
is the number of observations having class value
. It is easy to see that
. The smoothness of MeanNN entropy estimator enables its gradient analytically computed. Therefore,
with respect to
w becomes
Since the gradient of mutual information is achieved, we use gradient ascent method to maximize mutual information.
4. Simulation
An extensive simulation study was conducted to compare the efficiencies of those methods on combination of continuous variables (or biomarkers). Imitating but also enriching the one performed previously in [
1] for comparison purposes, we considered normal, gamma, and beta distributions under different settings.
In all settings, we assumed mean values corresponding to class . For class , we set mean values as , where d denotes the number of predictors and . As a result, the mean values for classes were evenly spaced and dependent on the number of predictors d. Candidate biomarkers generated through this approach, typically have AUC values between , most commonly falling below .
We used multivariate normal variables with equal and not equal covariance structures to generate normal, beta, and gamma variates using normal copula and inverse transform sampling. When covariances are equal, we fixed where I is the identity matrix, J is a matrix of all 1 s, and . For unequal covariance, we set keeping unchanged.
For each setting, 1000 datasets were generated and randomly divided into two sets of equal size. One set was used for training and the other for testing. Coefficient estimates obtained from training datasets were recorded for each method. Using the test datasets, linear combinations were computed based on the previously estimated coefficients, and corresponding performance metrics were evaluated. These metrics included AUC, Area Under the Precision–Recall Curve (AUPRC), and Matthews Correlation Coefficient (MCC). In addition to mean values, 95% confidence intervals and median values were reported for each metric.
Sample sizes were set to
, and the number of predictors was varied across
. Unlike the referenced simulation study, we also examined scenarios with unequal class allocation, specifically cases where
. In the unequal class allocation scenarios, the sample sizes for the
class were adjusted to
.
Table 1,
Table 2 and
Table 3 present AUC values obtained from the simulation results. Additional results including those for AUPRC, MCC, and all unequal allocation cases are provided in the
Supplementary Material.
All simulations were conducted using Julia version 1.11.0-rc3. We utilized the Ipopt.jl package version 1.11.0 for nonlinear constrained optimization in entropy-based methods, the GLM.jl package version 1.9.0 for logistic regression, and custom gradient ascent code for the mutual information maximization method.
The code used in the simulations will be released as public Julia and R packages in future publications. Interested researchers may contact the author for access.
5. Application to Real-Life Data
We used publicly available Wisconsin datasets for the prognosis [
18] and diagnosis [
19] of breast cancer. There are 30 continuous predictors in each dataset to be used for the prediction of either prognosis or diagnosis of breast cancer indicated
.
The prognostic dataset has 198 observations while the diagnostic has 569. Frequencies of class are and in those datasets. We considered the first 5 predictors in each dataset for our calculations, namely the variables radius_mean, texture_mean, perimeter_mean, area_mean, and smoothness_mean.
Predictive ability of these variables are given in
Table 4. As can be seen from the
Table 4, those predictors have a very low predictive ability for the prognosis of breast cancer, but a high ability for diagnosis.
We performed logistic regression and entropy optimization techniques described in
Section 3 and gathered coefficients shown in
Table 5 and
Table 6, representing candidate linear combinations for both datasets.
Finally, calculating linear combinations using those coefficients, we obtained AUC, AUPRC, and MCC values given in
Table 7 and
Table 8. Our methods yielded metric values very similar to those derived from logistic regression.
6. Discussion and Conclusions
The primary objective of this study was to explore how information-theoretical methods can be applied to construct linear combinations of continuous variables in the context of a binary classification problem.
We addressed this question by introducing four distinct approaches (MaxEnt, MinCrEnt, MinRelEnt, and MaxMutInf), each grounded in fundamental principles of information theory.
We believe that identifying and formalizing these approaches may guide future research directions by drawing greater attention to information-theoretical concepts and encouraging their broader application in this problem setting. This study also represents the first systematic evaluation of information-theoretic approaches applied in this setting.
Earlier methods were often constrained by strong distributional assumptions about biomarkers, such as normality and equal or proportional covariance structures. Additionally, the number of predictors they could handle was typically limited to just two.
More recent approaches have relaxed these assumptions, allowing for the inclusion of a larger number of biomarkers in linear combinations. However, most of these methods rely on performance metrics like sensitivity, specific segments of the ROC curve, or the AUC (Area Under the Curve) as the optimization objective. Some later methods extended their applicability to multi-class classification problems using metrics such as the Volume Under the Surface (VUS) or the Hypervolume Under the Manifold (HUM).
Among these, the SCOR algorithm [
10] and the method proposed in [
1] have shown promising results in maximizing the AUC or HUM objective.
Notably, methods established in [
11,
13] distinguish themselves by incorporating copulas into the picture, representing the first known use of copulas in the context of biomarker combination. Despite this innovation, they currently remain limited to handling only two biomarkers.
An important distinction of our proposed methods is that, whereas previous methods primarily aimed to maximize a metric such as the AUC value, the proposed methods in this article rely solely on information-theoretic criteria and never incorporated AUC or any other metric as an optimization objective.
As another distinction, unlike many previous studies that relied on a single evaluation metric, our assessment of model performance employed a comprehensive set of three metrics: AUC, AUPRC, and MCC. This multi-faceted approach provides a more robust and nuanced comparison of model performance across different aspects of classification quality.
The proposed methods here are straightforward to apply in binary classification problems or to be extended into a multiclass setting. These methods are computationally simpler than most existing approaches, with the exception of MaxMutInf. Due to the need to compute complex gradient functions when optimizing mutual information, MaxMutInf is relatively more challenging to implement. Extending our methods into classification problems involving more than two outcome levels may represent a valuable direction for future research.
The three entropy-based methods (MaxEnt, MinCrEnt, and MinRelEnt) demonstrated test performances that were consistently comparable to those derived from logistic regression, across all simulation conditions and real data applications. The MaxMutInf method yielded slightly greater test AUC, AUPRC, and MCC values compared to the other approaches, particularly in simulation settings involving Beta and Gamma distributed data. The differences were more pronounced in simulations with smaller sample sizes. However, in some settings involving normally distributed data, the MaxMutInf method produced test AUC and AUPRC metrics that were comparable or even marginally lower than those of other methods.
The Maximum Mutual Information method appeared more robust to distributional asymmetries, as well as to smaller and unequal sample sizes, except in a few settings involving normally distributed variables. Exploring and implementing different differential entropy estimators beyond the Kozachenko–Leonenko entropy estimator may be a future research direction that could potentially reveal more insights into the performance of Mutual Information Maximization.