Incorporating a New Summary Statistic into the Min–Max Approach: A Min–Max–Median, Min–Max–IQR Combination of Biomarkers for Maximising the Youden Index

: Linearly combining multiple biomarkers is a common practice that can provide a better diagnostic performance. When the number of biomarkers is sufﬁciently high, a computational burden problem arises. Liu et al. proposed a distribution-free approach (min–max approach) that linearly combines the minimum and maximum values of the biomarkers, involving only a single coefﬁcient search. However, the combination of minimum and maximum biomarkers alone may not be sufﬁcient in terms of discrimination. In this paper, we propose a new approach that extends that of Liu et al. by incorporating a new summary statistic, speciﬁcally, the median or interquartile range (min–max–median and min–max–IQR approaches) in order to ﬁnd the optimal combination that maximises the Youden index. Although this approach is more computationally intensive than the one proposed by Liu et al, it includes more information and the number of parameters to be estimated remains reasonable. We compare the performance of the proposed approaches (min–max–median and min–max–IQR) with the min–max approach and logistic regression. For this purpose, a wide range of different simulated data scenarios were explored. We also apply the approaches to two real datasets (Duchenne Muscular Dystrophy and Small for Gestational Age).


Introduction
In clinical practice, it is common to have information on multiple biomarkers for disease diagnosis. Combining them all into a single marker is common practice and usually offers better diagnostic accuracy than considering each biomarker separately [1,2]. How to increment the performance of a standard marker by adding new markers has also received attention in the literature [3].
The combination of continuous biomarkers provides a continuous value and the dichotomisation of this value is important as it provides the clinician with a classification rule related to a disease and their clinical decisions are usually based on such categorisation of patients into groups [4].
Although different algorithms have been considered to provide predictive models, linear models (linear combination of biomarkers) have been widely proposed for binary classification problems due to their simplicity of interpretation and good performance [5].
The accuracy of the diagnostic marker is usually assessed on the basis of its discriminatory ability as analysed by the Receiver Operating Characteristic (ROC) curve through derived statistics such as sensitivity and specificity pairs, the area or partial area under the ROC curve (AUC/pAUC), or the Youden index [6]. Specifically, the formulation of algorithms for the estimation of binary classification models that maximise the AUC has been widely explored and has been the background and basis for many subsequent algorithms recently published [7][8][9].
Su and Liu [10] provided the estimation of linear models that maximises AUC under multivariate normality. However, this normality assumption is a high-demand hypothesis on real data and is often not easy to meet in clinical reality. This real limitation leads to the need for more flexible statistical approaches that are not subject to distributional assumptions [11,12]. Several statistical methods have been presented in the literature that do not depend on a specific distribution. One example is the statistics of the fractional moments, whose effectiveness and advantages have been demonstrated in real scenarios [13,14]. Pepe et al. [15,16] proposed a distribution-free approach to estimate the linear model that maximise AUC based on the Mann-Whitney U statistic [17], in order to address the Su and Liu limitation. The process that Pepe et al. proposed is a discrete optimisation which is based on an extensive search over the parameter vector. However, this process is not computationally accessible (NP-hard problem) for a number of biomarkers p ≥ 3. In order to make it computationally tractable, Pepe et al. also suggested the use of stepwise algorithms that are an alternative which have shown to perform well in various scenarios. The idea of these algorithms is to include a new biomarker at each step by selecting the best combination of two biomarkers. The approach and suggestions proposed by Pepe et al. have been the source of development of non-parametric and semiparametric approaches in the construction of classifiers under optimality criteria derived from the ROC curve.
Esteban et al. [18] implemented the approach of Pepe et al. providing strategies to handle ties that appear in the sequencing of partial optimisations under AUC optimisation criteria. Kang et al. [19,20] proposed a less computationally demanding stepwise combination approach based on a rated of AUC values that correspond to predictors variables. However, finding the optimal parameters in these stepwise approaches becomes a computational burden problem that is also difficult to tackle when the number of biomarkers is high (p > 5).
Liu et al. [21] proposed an approach, called the min-max approach, which linearly combines the minimum and maximum values of biomarkers, involving a single coefficient search that maximises the Mann-Whitney U statistic of AUC. The advantage of this approach is that it is computationally tractable regardless of the number of biomarkers. However, it has been shown that there are scenarios where the min-max approach generally achieves less optimality compared to other methods that use information from all biomarkers, such as stepwise approaches [19,22]. This may indicate that the combination of minimum and maximum biomarkers alone may not be sufficient in terms of discrimination [5].
Algorithms focused on optimising other parameters derived from the ROC curve have also been developed. Liu et al. [23] analysed the optimal linear combination into diagnostic marker maximising sensitivity over a range of specificity. Yin and Tian [24] proposed approaches to estimate the joint confidence region of sensitivity and specificity at the cut-off point determined by the Youden index. Based on the stepwise combination approach of Kang et al., Yin and Tian [22] conducted a study with the aim of optimising the Youden index. Yin and Tian [25] also studied the optimisation of the AUC and the Youden index simultaneously and presented both parametric and non-parametric approaches that estimate the joint confidence region of the AUC and the Youden index. Based on the min-max approach, Ma et al. [26] adapted and extended the min-max method to the estimation of the pAUC for multiple continuous biomarkers. Yu and Park [27] and Yan et al. [28] also explored methods for the linear combination of multiple biomarkers that optimise the pAUC. The Youden index is a statistical metric widely and successfully used in several clinical studies and, at the same time, serves as an appropriate summary for making the diagnosis and as a criterion for choosing the best cut-off points for dichotomising a biomarker [29], which is key for the diagnosis of the disease. There are different methods for providing the best cut-off point depending on the aim of the classification, but there is consensus that, without a clear reason to provide higher values for sensitivity or specificity, the Youden index provides an optimised balance of sensitivity/specificity.
In order to address the computational burden regardless of the number of biomarkers and to increase the predictive capacity, in this work we propose a new approach that extends that of Liu et al. by incorporating a new summary statistic, in particular, the median or interquartile range (min-max-median and min-max-IQR approaches). Although this approach is more computationally intensive than the one proposed by Liu et al., it includes more information and the number of parameters to estimate remains reasonable. For the choice of the best linear combination of these three variables, we use a stepwise method (two steps) following the empirical search approach of Pepe and Thompson. For the search of the optimal linear combination, we consider the Youden index as an objective function.
We compared the performance of the proposed algorithm (min-max-median and min-max-IQR approaches) with the min-max approach adapted to optimise the Youden index and logistic regression, which are computationally tractable linear approaches. For this purpose, a comprehensive study was carried out considering different simulated data scenarios. The performances of the approaches on real datasets (e.g., Duchenne muscular dystrophy and Small for Gestational Age prediction) were also analysed. The comparison of the performance of the algorithms was carried out by considering the results obtained on the entire dataset (resubstitution method) and using a validation procedure derived from K-fold cross-validation, which evaluates the generalisability of the predictive model and avoids possible overfitting [30].
Thus, the aim of this work is to propose a new approach (min-max-median/minmax-IQR approach) that has the advantage of being computable regardless of the number of biomarkers and that does not depend on any distributional assumptions, and analyse their performance in comparison with the min-max approach and the logistic regression which is a standard in prediction models with good performance, to find optimal scenarios.

Materials and Methods
Firstly, we introduce the formulation of the linear model and estimation suggestions of Pepe et al. [15,16]. This is the basis for the formulation and estimation of the linear model of summary statistic-based approaches. Then, the methods based on Youden index maximisation to be compared are presented: the min-max approach, the min-max-median and min-max-IQR approaches and logistic regression. Finally, the simulated scenarios as well as the real datasets considered are described. All methods were programmed and applied using the free software R [31].

Background: Pepe et al.'s Approach
Pepe and Thompson [15] proposed a non-parametric approach in order to estimate the linear model that maximises the AUC based on the Mann-Whitney U statistic [17]. Thus, this approach allows its use for all types of data without any distribution assumptions. The formulation of the linear model is as follows: where p denotes the number of biomarkers, X i the biomarker i ∈ [1, . . . , p] and β i the parameter to be estimated. If, in addition, we assumed that we have n k individuals of the group k = 1, 2 (diseased and non-diseased, respectively) and X ki is the vector of p biomarkers for the i th individual of group k, then the empirical AUC based on the Mann-Whitney U statistic is given by the following expression: For the estimation of the parameter β i , Pepe et al. suggested a discrete optimisation which is based on a grid search over 201 equally spaced values in the interval [−1, 1]. For simplicity, consider the linear combination of two biomarkers: X i + βX j . Due to the invariant property of the ROC curve for any monotonic transformation, dividing by the β value does not change the value of the sensitivity and specificity pair. Thus, estimating X i + βX j for β > 1 and β < −1 is equivalent to estimating 1 β X i + X j for β ∈ [−1, 1] and, therefore, all possible values of β ∈ R are covered.
However, this optimisation requires a large computational effort for dimensions p ≥ 3. In order to make it computationally accessible, Pepe et al. suggested using stepwise algorithms in which a new biomarker is included at each stage, selecting the best combination of two biomarkers. This turns an unaddressable computational burden problem into an addressable one by applying p − 1 times Pepe et al.'s suggested singleparameter estimate.
Although Pepe et al. originally introduced this approach to optimise the AUC, both the formulation of the linear model (1), the suggested empirical search and the suggested use of stepwise approaches are the basis in the development of the min-max approaches and our proposals: min-max-median and min-max-IQR approaches under Youden index maximisation.
Let X kij be the j th biomarker (j = 1, . . . , p) for the i th individual of group k = 1,2 (disease and non-disease, respectively), X ki the vector of p biomarkers for the i th individual of group k and X k = X k1 , . . . , X kn k where n k is the number of individuals in group k. The Youden Index (J) of the linear combination is defined as: where c denotes the cut-off point and F Y k (c) = P(Y k ≤ c) the cumulative distribution function of random variable Y k = β T X k , k = 1, 2 (linear combination), β = (β 1 , . . . , β p ) T being the parameter vector to be estimated. It is therefore immediate to deduce the following formula from the empirical estimation of the Youden index: where c β = {c : max c F Y 2 (c) − F Y 1 (c) } denotes the optimal cut-off point.

Min-Max Approach
Liu et al. [21] proposed the distribution-free min-max approach. The idea of this approach is based on reducing the order of the linear combination by considering only two markers from the measurements information of the original p biomarkers (maximum value and the minimum value of all the p biomarkers): Therefore, this approach is reduced to estimating a single β parameter of the linear combination that maximises the AUC, as it is based on the linear model proposed by Pepe et al. expressed in Equation (1).
This approach was adapted in our work with in order to maximise the Youden index with the following expression: where X ki,max = max 1≤j≤p X kij and X ki,min = min 1≤j≤p X kij for each i = 1, . . . , n k . k = 1, 2.
The search for the optimal parameter follows the empirical search suggested by Pepe et al.: for each value β of the 201 equally spaced ∈ [−1, 1], the optimal cut-off point (ĉ β ) that maximises Youden index is calculated. The final value chosen (β) is the one with the highest Youden (Ĵ β ) obtained.

Our Proposed Summary Statistics-Based Approaches: Min-Max-Median and Min-Max-IQR
We propose two new approaches based on summary statistics to maximise the Youden index. The underlying idea of the approaches is based on the same idea as the min-max approach (Section 2.2) of reducing the dimension of the problem by considering summary statistic information of the p biomarkers as variables. They could be seen as an extension of the min-max approach by incorporating a new summary statistic. The aim of this proposal was to incorporate more information from the original biomarkers while keeping the approach computationally tractable.
Specifically, we propose two approaches: including the median and IQR at the maximum and minimum of the p biomarkers measurements. Therefore, our approach is reduced to considering the linear combination of three variables (max, min, median-max, min, IQR).
For the estimation of the optimal linear combination of these three variables, a stepwise approach is used, as suggested by Pepe et al., in order to ensure an approachable computational time. Specifically, our proposals implement an adaptation of the stepwise approach proposed by Esteban et al. [18] for Youden Index maximisation, where at each step a new variable is introduced and the optimal linear combination of two variables is selected using the empirical search suggested by Pepe et al. In this way, we turn a time-consuming problem into a computationally tractable problem by considering two steps of optimising the linear combination of two variables.
The proposed approach is explained in detail in the following steps. Consider the min-max-median approach. The min-max-IQR approach is equivalent: 1. Firstly, the problem of estimating the optimal linear combination of p variables is transformed to the estimation of the optimal linear combination of three variables (min, max, median). For this purpose, the values of these new variables are calculated for each i individual from their values of the p original biomarkers: X kij , X ki,median = median 1≤j≤p X kij for each i = 1, . . . , n k ., k = 1, 2. (7) 2. Once the information is transformed, a stepwise approach is used to estimate the optimal linear combination of the three new variables (min-max-median). For simplicity, denote by X kij the values of the new transformed variables (min-max-median, j = 1, 2, 3) for the i th individual of group k = 1, 2. The first step of the stepwise approach consists of choosing the combination of two variables that maximises the Youden index, using empirical search proposed by Pepe et al.: The maximum Youden index can be reached for more than one optimal linear combination. Our approach considers these ties and drags them forward in the next step. For simplicity, consider that it is reached for a single linear combination. The optimal combination chosen (two variables and one parameter β 2 ) that maximises the Youden index (8) is considered as a single variable in the next step (and final) of the stepwise approach. For simplicity, suppose the following optimal linear combination: 3. Finally, the last variable (X ki3 ) not chosen in the previous step is included and the optimal linear combination of the two variables is chosen as in point 2. Specifically, either combination (9) or (10) that maximises the Youden indexĴ β 3 is selected: The code of all approaches is available upon request to Ms. Rocío Aznar-Gimeno (raznar@itainnova.es).
The predictive ability of these approaches based on summary statistics (min-max approach and min-max-median/IQR approach) is also compared with the classical logistic regression approach [32], which is a subtype of generalised linear model to predict the probability of an event (disease or non-disease) given a set of variables through the logistic function. We used for comparisons the logistic regression method in the multivariate binormal setting, for the diseased and non-diseased populations, which showed similar performances in the linear discriminant analysis, which is standard in predictive models, exhibiting a good performance in all scenarios [15]. The objective was to compare computationally tractable linear combination methods, regardless of the number of original biomarkers, dealing with biomarkers' real values.

Performance Comparison
The comparison was carried out by considering both the resubstitution method and a method from 5-fold cross-validation. For this purpose, a range of simulated data were explored. Their application to two real dataset examples was also explored in order to evaluate their performances in real cases.

Simulations
The simulated data analysed cover a wide range of scenarios in terms of the distribution of biomarkers, the ability to discriminate between biomarkers and the correlation between them, considering smaller to larger sample sizes.
In order to observe behaviours with smaller and larger numbers of biomarkers, four biomarkers (p = 4) and ten biomarkers (p = 10) were considered in the simulated scenarios. The peculiarity of statistics-based approaches (in particular our proposed approaches) is that they translate biomarker information into only 3 variables, regardless of the number of initial biomarkers. This has the advantage that they are always computationally feasible. Furthermore, in scenarios where the number of biomarkers to be considered exceeds the sample size, these approaches can avoid overfitting to the data, due to their inherent dimensionality reduction characteristics. In order to analyse the performance of the algorithms in these specific cases, simulated scenarios of p = 100 biomarkers and a smaller sample size (n 1 = n 2 = 30 were also explored. Different joint distributions were considered following multivariate normal (X k ∼ MV N(m k , Σ k )) and also non-normal distributions (in particular, log-normal: e X k ) in both diseased and non-diseased populations (k = 1, 2). In this way, both normal distributions, which is one of the common distributions in statistics, and distributions that allow assessment beyond normality and symmetry, were analysed.
When the multivariate normal distribution for diseased and non-diseased population are assumed X ∼ N(m 1 , Σ 1 ), Y ∼ N(m 2 , Σ 2 ) the coefficients for the optimal linear combi-nation (which coincides with the Linear Discrimination Function (LDF)) and its area under the ROC curve are known and given by β Su and Liu [10]).
Using this formula, we simulated markers by choosing mean vectors and variancecovariance matrices, whose AUC shown a good discrimination ability, with an AUC above 0.8 or at least near 0.8. The reason behind this is that these models are those whose dichotomisation by Youden index is interesting in real practice.
Regarding the vector of means, the null vector was considered for the non-diseased population (m 2 = 0) in all simulated scenarios. For the diseased population, scenarios considering both the same and different abilities to discriminate between biomarkers were explored: (i) same mean for each biomarker: m 1 = (1.0, 1.0, . . . , 1.0) T ; (ii) different means: Regarding the variance-covariance matrix, for simplicity, the variance of each biomarker was set to 1, so that covariances equalled correlations. Scenarios were analysed with the same covariance matrix between populations (Σ 1 = Σ 2 ) and different (Σ 1 = Σ 2 ). Specifically, for the cases with the same covariance matrix, the following scenarios were explored: where I is the identity matrix and J a matrix of all ones, and (iii) negative correlation (ρ = −0.1) for all pairs of biomarkers. As for the scenarios with different covariance matrices between populations, the following scenarios were analysed: For each scenario, 1000 random samples from underlying distribution were considered with the following different samples sizes: (i) n 1 = n 2 = 30; (ii) n 1 = n 2 = 100; (iii) n 1 = n 2 = 500. Each method was applied to each of the simulated scenarios and the maximum Youden index was obtained.

Application to Real Data
The analysed methods were applied to two real data examples with the aim of diagnosing and predicting of Duchenne Muscular Dystrophy and Small for Gestational Age.
Duchenne muscular dystrophy (DMD) is a progressive recessive muscular disorder that is transmitted from mother to child. It is the most common muscular dystrophy diagnosed during childhood and early diagnosis is essential to limit its negative consequences [33]. Percy et al. [34] analysed the effectiveness of screening for female DMD carriers from four biomarkers of blood samples: serum creatine kinase (CK), haemopexin (H), pyruvate kinase (PK), and lactate dehydrogenase (LD). The available data that were analysed contain complete information on these four biomarkers of 67 women who are carriers of the progressive recessive disorder DMD and 127 women who are not carriers.
Small-for-gestational-age (SGA) infants-those with a birth weight below the 10th percentile-have been associated with increased risk of adverse perinatal outcomes [35,36]. Predicting them from the ultrasound percentile at the 35th gestational week as well as other fetal and parental variables may help to avoid these outcomes. The dataset analysed contains information of 4474 pregnancies whose births were assisted at the Miguel Servet University Hospital, between March 2012 and December 2016. The ability of the methods to predict SGA at birth was analysed from information of 7 biomarkers: ultrasound percentile at 35th gestational week, maternal age, parity, maternal body mass index (BMI), Free Beta Human Chorionic Gonatrodopin (FBHCG), Pregnacy-Associated Plasma Protein-A (PAPP-A) and paternal height.

Simulations
The results of each method for each simulated scenario are presented as the mean of the maximum Youden indices obtained in the 1000 random samples as well as the standard de-viation. For simplicity, we denote logistic regression, min-max approach, min-max-median approach and min-max-IQR approach by LR, MM, MMM and MMIQR, respectively.

Normal Distributions
Tables 1 and 2 present the results obtained for the simulated scenarios following multivariate normal distribution with different means and same means between biomarkers (in particular, p = 4 and p = 10), respectively.
Specifically, Table 1 analyses, for each sample size, biomarker dataset (p = 4 and p = 10) and for each validation method (with and without CV), the following simulated scenarios: (i) independence (Σ 1 = Σ 2 = I), (ii) medium correlation (Σ 1 = Σ 2 = 0.5 · I + 0.5 · J), (iii) different covariance matrices: Σ 1 = 0.3 · I + 0.7 · J (high correlation) and Σ 2 = 0.7 · I + 0.3 · J (low correlation) and (iv) negative correlation (ρ = −0.1).   Table 1 shows that in the normal scenarios of independent biomarkers of different means, the MMM and MMIQR approaches show comparable performance, slightly outperforming the former, and outperforming the MM approach especially in larger samples. This behaviour is more pronounced when the number of biomarkers is larger (p = 10). However, they fall short of LR, which is the best performer. This behaviour is also observed in the scenario of negative correlations, although in this case the difference in performance between the summary statistics-based approaches and LR is somewhat smaller. The results show that the resubstitution method is indeed more optimistic than the one derived from cross-validation, although the conclusions on the performance comparison are broadly similar.
As for the simulated scenarios considering four biomarkers with different covariance matrices between populations, the new approaches incorporating a new summary statistic do not reach the performance of the MM approach and could be considered as approaches with similar performances. However, they outperform the MM approach, albeit only slightly when the number of biomarkers is larger (p = 10) and sample sizes are larger. The results show that in the scenarios of biomarkers with medium positive correlations, the new approaches do not outperform the MM approach in any of the cases based on the cross-validation method. In all cases, the LR outperforms all other methods. Table 2 analyses the following simulated scenarios of biomarkers with the same means: (i) independence (Σ 1 = Σ 2 = I), (ii) medium correlation (Σ 1 = Σ 2 = 0.5 · I + 0.5 · J), (iii) different covariance matrices: Σ 1 = 0.3 · I + 0.7 · J (high correlation) and Σ 2 = 0.7 · I + 0.3 · J (low correlation) and (iv) different covariance matrices: Σ 1 = 0.5 · I + 0.5 · J (medium correlation) and Σ 2 = I (independents).   Table 2 shows, as Table 1, that in simulated normal independent biomarker scenarios the MMM and MMIQR approaches outperform the MM approach, although the LR is the best performer based on CV. However, in this case of scenarios of biomarkers with same means, the difference in performance between the MMM and MMIQR approaches and the LR is significantly lower.
The results in Table 2 show that in the simulated scenarios of biomarkers with positive medium correlation, summary statistics-based methods (MM, MMM, MMIQ) could be considered to perform comparably in larger sample sizes considering the cross-validation method. Although LR still dominates over the other methods, the difference is considerably smaller than considering biomarkers with different means (Table 1).
However, the results show that in the simulated scenarios with different covariance matrices between groups, the MM approach outperforms the LR approach. Furthermore, in the scenario where biomarkers within a group are independent, the proposed MMM and MMIQR approaches slightly exceed the MM approach for larger sample sizes. Table 3 shows the results obtained from the following simulated scenarios following a log-normal distribution: (i) biomarkers with different means and medium correlation (Σ 1 = Σ 2 = 0.5 · I + 0.5 · J), (ii) biomarkers with different means and different covariance matrices: Σ 1 = 0.3 · I + 0.7 · J (high correlation) and Σ 2 = 0.7 · I + 0.3 · J (low correlation), (iii) biomarkers with same means and medium correlation (Σ 1 = Σ 2 = 0.5 · I + 0.5 · J), (iv) biomarkers with same means and different covariance matrices: Σ 1 = 0.3 · I + 0.7 · J (high correlation) and Σ 2 = 0.7 · I + 0.3 · J (low correlation), and (v) biomarkers with same means and different covariance matrices: Σ 1 = 0.5 · I + 0.5 · J (medium correlation) and Σ 2 = I (independents). The results in Table 3 show that, under cross-validation, in the simulated scenarios of biomarkers with different means and medium correlation, the MMM and MMIQR approaches do not outperform the MM approach and the LR significantly dominates over the rest. In scenarios with different covariance matrices between groups with a higher number of biomarkers (p = 10), although LR remains the best performer, the MMM and MMIQR approaches outperform the MM approach, with the MMIQR approach being slightly superior.

Non-Normal Distributions
However, in the biomarker scenarios with the same means, considering medium correlation between biomarkers, the results show that for larger sample sizes the MMM and MMIQR approaches outperform the MM approach in this case, with the MMIQR approach being very slightly superior.
For the simulated scenarios of biomarkers with the same means and different covariance matrices between groups, the summary statistics-based approaches outperform logistic regression. For the scenario with independent biomarkers in one group, the MMIQR approach outperforms the others by applying the cross-validation method. Table 4 analyses the following simulated scenarios of p = 100 normal biomarkers with the same means for a smaller sample size (n 1 = n 2 = 30): (i) independence (Σ 1 = Σ 2 = I) and (ii) different covariance matrices: Σ 1 = 0.5 · I + 0.5 · J (medium correlation) and Σ 2 = I (independents).  Table 4 shows that in these scenarios where the size of the variables is larger than the sample size, the LR model is significantly overfitted. The results obtained based on CV show that statistics-based algorithms are superior to LR. This difference is most noticeable in the scenario with different covariance matrices. In this scenario, our approaches (MMM and MMIQR) fail to outperform the MM approach. However, in the scenario of independent biomarkers, our approaches outperform it. These results are in line with the conclusions drawn from the results in Section 3.1.1.

Computational Times
Our proposed approach uses the information provided by a set of predictive markers in three new markers derived from them. This is an automated calculus that it is not time-consuming. From them, we estimated the coefficients of the model by an extensive search using a step by step approach. We want to remark that computational times for this approach are reasonably independent of the sample size. In our simulations, the computational times of our proposed approaches ranged from 8 to 154 seconds using a sample size ranging from 60 to 20,000 individuals showing that our algorithm is computationally tractable in high dimensionality.  As the variances of the four biomarkers differ from each other, it was necessary to normalise the values of each biomarker before applying the methods based on summary statistics. This normalisation is important to avoid different units of measurement and to ensure the consistent use of these approaches.

Real Datasets
The analysed methods were applied to this dataset with the aim of finding the optimal combination of the four biomarkers that achieves the maximum Youden index. The estimates of the Youden index of each biomarker (CK, H, PK, LD) in a univariate way were 0.6124, 0.4172, 0.5079, and 0.5776. Table 5 shows the maximum Youden index achieved as well as their respective sensitivity and specificity values, using both the resubstitution and the cross-validation-based methods. The MMM approach outperforms the MM approach, although it does not reach the performance of the LR, which is the best performer, achieving a Youden index of 0.7948 after cross-validation. The results rounded to four decimal places are displayed. Figure 2 represents the information of each biomarker for each group (SGA and non-SGA group). Before applying the summary statistics-based methods, a normalisation was applied.  Table 6 shows the maximum Youden index achieved for all methods studied as well as their respective sensitivity and specificity values, using both the resubstitution and the cross-validation-based methods. The MMM and MMIQR approaches outperform the MM approach, although they do not match the performance of the LR, which is the best performer, achieving a Youden index of 0.5901 after cross-validation. In this example, the MMM and MMIQR approaches outperform the MM approach more markedly than the previous example, which includes a smaller number of biomarkers (p = 4). These results therefore show that for a larger number of biomarkers, the benefit of using the approaches, incorporating a new summary statistic (MMM, MMIQR vs. MM approach), was greater. These conclusions were also drawn from the simulated data. Specifically, this is a scenario with a predictor, the estimated percentile weight at the 35th week, with a greater discrimination ability than the rest of predictor variables. It seems that in this case, the MM approach cannot capture the added discrimination ability of the set of biomarkers; in this case, the improvement of MMM and MMIQ approaches is more clear. Although for none of the real examples analysed did the summary statistics-based approaches outperform LR, it is clear from this analysis that the proposed approaches outperform the MM approach, which has been shown in the simulation to have superior discrimination ability than the LR for specific scenarios of different covariance matrices between groups and the same univariate predictive ability of biomarkers.

Discussion and Conclusions
In this study, we propose new summary statistics-based approaches and compare their performances with the min-max approach and logistic regression under Youden index optimisation, which are computationally tractable approaches regardless of the problem dimension. The methods were applied to both simulated data and two real datasets.
The results of the comparison using cross-validation (Tables 1-3) suggest that a logistic regression should be used over the other summary statistics-based approaches when the number of biomarkers is smaller than the sample size, except in scenarios of biomarkers with the same predictive ability and different covariance matrices between groups (Table 2), where summary statistics-based approaches are recommended, which outperformed logistic regression. In fact, generally, differences in performance were smaller when considering scenarios with biomarkers with the same means ( Table 2). The study by Kang et al. [20], but under AUC optimisation, led to a similar conclusion regarding optimal scenarios for the min-max approach.
The results showed that the proposed new approaches outperform the min-max approach in general when independent biomarkers are included. This difference increases when the number of biomarkers is larger. These results may be expected as independent information as well as a larger number of biomarkers may contribute to the fact that it is insufficient to consider only the minimum and maximum values of biomarkers. Although no significant differences were found between the MMM and MMIQR approaches, the MMIQR approach slightly outperformed the MMM approach in non-normal distributions and the MMM approach overall slightly outperformed the others in all other scenarios. This could indicate that in asymmetric distributions, the incorporation of a dispersion measure could be a good option.
In scenarios where the number of biomarkers is higher than the sample size (Table 4), the results showed that the logistic regression overfitted the data. Our approaches outperformed the min-max approach when biomarkers were independent. These results are in line with previous results and demonstrate the advantage of our approaches which, in addition to always being computationally feasible, do not overfit the data due to their inherent feature of dimensionality reduction involving all biomarkers across the summary information.
The performance of the analysed methods achieved on the real datasets showed similar conclusions to those deduced from the simulation. Although in none of the real examples analysed did the summary statistics-based approaches outperform the logistic approach, in both cases the proposed methods outperformed the MM approach.
The application of dimensionality reduction techniques before applying a method could be an alternative to avoid overfitting. The selection of biomarkers with high predictive power might seem to be a solution for dimensionality reduction. However, it has been shown that the predictive power alone of each biomarker may not be the best approach for selection. Pinsky and Zhou [37] demonstrated that in fact statistical correlation patterns had a greater implication on the combination of higher performing biomarkers. Therefore, a high number of biomarkers with low predictive ability could be available, yet their combination achieves higher performance. Therefore, methods that use information from a large number of biomarkers, while being computationally tractable, are necessary.
Our proposed distribution-free approach incorporates information from all the original biomarkers through their summary statistics, while the problem remains computationally tractable (p = 3). Pepe and Thomson [15] proposed non-parametric approaches to optimise linear models under AUC optimisation but with reduced dimensionality. Ma and Huang [38] and Wang et al. [39] introduced semiparametric approaches that approximated the empirical AUC by a sigmoid function. Komori et al. [40] proposed a method based on a boosting algorithm for maximisation of the AUC. They used cross-validation techniques to select the best model. None of those methods is fully non-parametric, and their performance relies on a good selection of some parameters.
Liu et al. [21] proposed a non-parametric approach that linearly combines the minimum and maximum values of the biomarkers, involving only a single coefficient search. Therefore, the problem is computable as it is reduced to estimating a single parameter regardless of the number of biomarkers. In addition to being able to take advantage of the information from multiple biomarkers that are measured with the same unit, another advantage of this min-max procedure is that repeated measurements of a single biomarker can be combined as two new markers and can therefore also be an option for longitudinal studies. However, using the information from the minimum and maximum biomarker values may not be sufficient in terms of discrimination, especially when the number of biomarkers is high. The proposed algorithm incorporates more information from the original variables than the min-max approach (three variables). A stepwise algorithm is performed to estimate the linear combination. All this implies that our proposal always remains computationally tractable (p = 3) and has inherent dimensionality reduction properties.
With regard to the strengths of our study, we can list the following. These algorithms are fully non-parametric approaches and do not require distribution assumption for their application. They use information from all biomarkers, incorporate more information than the min-max approach and are also computationally tractable regardless of the number of biomarkers. In addition, we verified their superiority in some simulated scenarios, highlighting that they are good alternatives in predictive models.
However, our study has some limitations. In the comparison analysis we used a unique standard method, the logistic regression. We did not perform a comparison with other methodologies as kernel approaches or machine learning algorithms; therefore, we cannot propose our algorithm as the best method in any scenario.
For future work, a more extensive study comparing more methods could be carried out. Applying dimensionality reduction techniques (such as principal component analysis) before applying the methods could also be explored. Extending the comparison study for a more extensive simulation, e.g., taking into account scenarios with larger sample sizes, is also proposed as a line of future work. It is also proposed to analyse the new methods proposed in other real datasets. An example could be the one analysed by Liu et al. (autism disease), where conditions were optimal for the MM approach, and it could thus be explored as to whether in this case the proposed approaches outperform the MM approach. Readers are also encouraged to adapt our proposed approaches using a different linear combination estimation (a different stepwise algorithm) as well as to explore their performances considering other target metrics.
In summary, in this article, we present to the scientific community a new approach combining summary statistics of biomarkers (min-max-median, min-max-IQR approaches) that has the advantage of being always computable and not being subject to any distributional assumptions. This approach aims to discriminate between two groups (disease and non-disease) and can therefore be a key to and helpful in clinical decision making. The Youden index, which is a widely used metric in the absence of consensus, was considered as an objective function. Comparisons in various scenarios allowed us to discover the scenarios in which the approaches are most optimal. Specifically:

•
Our proposed approaches outperform classical logistic regression using original biomarkers when the number of biomarkers is larger than the sample size. • Additionally, they performed well in scenarios of biomarkers with the same predictive ability and different covariance matrices between groups. • Our approaches performed better than the min-max approach when biomarkers are independent and in the real scenarios.