A Quantile General Index Derived from the Maximum Entropy Principle

We propose a linear separation method of multivariate quantitative data in such a way that the average of each variable in the positive group is larger than that of the negative group. Here, the coefficients of the separating hyperplane are restricted to be positive. Our method is derived from the maximum entropy principle. The composite score obtained as a result is called the quantile general index. The method is applied to the problem of determining the top 10 countries in the world based on the 17 scores of the Sustainable Development Goals (SDGs).


Introduction
Consider a data matrix, each row of which corresponds to a case, and each column represents a variable. Suppose that every variable has the meaning that a larger value indicates better. For example, ref. [1] investigated the efforts of countries to attain the SDGs (Sustainable Development Goals) and reported the 17 SDG scores for each country. The scores ranged from 0 to 100. In the report, a ranking of 163 countries on the basis of the average of the 17 scores was provided. We call such a procedure of ranking the simple sum method.
However, we sometimes find a paradoxical phenomenon in the simple sum method, in that a particular variable of a higher-score group is less than that of the remaining group. See Table 1 for illustration, where we separate the SDGs data into two groups: the 10 top countries on the basis of the simple sum method and the remaining 153 countries. The average values of each variable for the two groups are compared. On almost all the variables, the 10 top countries have larger averages than the remaining countries, as expected. However, there are reverse relations in the SDGs 12 and 13. The 10 top countries have an average value lower than the remaining countries on the two goals.
In this paper, we propose a linear weighting method that can avoid the reversal relation (in a random-decision sense). The higher-score group separated by the linear weight has average values greater than the remaining group with respect to all the variables. The idea behind the method is the objective general index (OGI; [2]), which is constructed to have a positive correlation with all the variables. The purpose of the OGI is the ranking and not the separation. The OGI is interpreted as a minimization problem of a free energy functional [3,4], which is the sum of the negative entropy and an internal energy functional. This interpretation also works in the current setting; see Section 2.
The problem of determining weights is unsupervised in the sense that no one knows the correct weights and classifications, which has been consistently discussed (e.g., [5,6]). There are many weighting methods for such purposes. Among them, the principal component analysis (PCA) is widely used. The PCA, however, does not always give positive weights; so, some modifications are necessary. It is known that a nonnegative version of the principal component analysis is a nonconvex and NP-hard optimization problem [7]. Another approach is the factor analysis, where a factor model refers to a set of multivariate distributions that have common latent factors (e.g., [8]). Although the factor analysis is The name of the quantile general index comes from the quantile regression developed by [11]. Indeed, the objective function we use is similar to those of the quantile regression; see the explicit form in Section 3. The essential difference here is that our problem is unsupervised, whereas the regression problems are supervised.
The general indices determine an ordering of the data. The problem of well ordering multivariate data was discussed by [12], where methods of ordering were classified into four categories: marginal ordering, reduced ordering, partial ordering, and conditional ordering. Our method is considered as marginal ordering on the weighted sum.
The paper is organized as follows. In Section 2, we define the quantile general index for continuous distributions and show that it is characterized by the maximum entropy principle. In Section 3, a finite-sample counterpart of the quantile general index is derived. In Section 4, a practical method that avoids the ambiguity of data lying on the separating hyperplane is proposed. We apply the method to the SDG data in Section 5, and we conclude in Section 6.

Quantile General Index for Continuous Distributions
The quantile general index for continuous probability distributions is defined first. The assumption of continuity avoids the difficulty caused by the non-smoothness of the objective function. The sample counterpart of the index is constructed in the subsequent section.
Suppose that we have a random vector x = (x 1 , . . . , x d ) following a probability distribution P on R d , where denotes the vector transpose. We assume that P has the probability density function p(x) so that P(A) = A p(x)dx for an event A ⊂ R d . For We deal with a class of general indices of x, where w = (w 1 , . . . , w d ) ∈ R d + and c ∈ R are called the weight vector and the threshold, respectively. Here R + denotes the set of positive numbers. The quantities w and c may depend on the underlying distribution P but do not depend on x itself.
For a given g of the form (1), the half spaces separated by the hyperplane g(x) = 0 are denoted by The quantile general index is defined as follows.

Definition 1.
A general index g(x) = ∑ i w i x i − c is called the quantile general index of x if it satisfies the following two equations: and The weight w is called the optimal weight.
Let us call H + g and H − g the positive and negative group, respectively. Equation (2) means that the fraction of the positive group is α. The threshold c is the upper α-quantile of the weighted sum w x because P(w x > c) = α by (2). We call α the acceptance ratio. Equation (3) implies that the average of each variable x i on the positive group is greater than that on the negative group. Therefore, the reversal relation observed in Table 1 does not occur if we adopt the quantile general index.
We now state the existence and uniqueness theorem of the quantile general index. For 0 < α < 1, we define the "check" loss function α : R → R by where u + = max(u, 0) and u − = max(−u, 0) are the positive and negative parts of u, respectively. See Figure 1 for the graph of α . The function α is used in quantile regression [13]. The derivative of α (u) for u = 0 is where I {u>0} is 1 if u > 0 and 0 otherwise. The subgradient (e.g., [14]) at u = 0 can be also defined but is not used here. We define a convex function F : The main theorem is stated as follows.
The optimal w is unique, whereas c may not be unique. Furthermore, the general index g(x) = ∑ i w i x i − c based on the minimizer (w, c) of F satisfies the conditions (2) and (3) of the quantile general index.
Proof. The proof of existence and uniqueness is given in Appendix A. We prove that the stationary condition of F is given by (2) and (3). The partial derivatives of F with respect to c and w i are and Note that P(H − g ) = 1 − P(H + g ), since P(g(x) = 0) = 0 from the assumption that x has a continuous distribution. Then, the equations ∂F/∂c = 0 and ∂F/∂w i = 0 (i = 1, . . . , d) are equivalent to (2) and (3). Example 1. Let x 1 and x 2 be independent and identically distributed according to a continuous distribution. By the uniqueness of the optimal weight and symmetry, we have w 1 = w 2 (=w). We denote the upper α-quantile of x 1 + x 2 by y α . Then, we have c/w = y α from (2) and (3). For example, if x i has the standard normal distribution and α = 1/2, then c = 0, and w = √ π/2.
The quantile general index is derived from the maximum entropy principle in line with [4]. The entropy of a density function p is defined by

Consider a class of transformations
The push-forward density of p by T is defined by This is the distribution of T(x) when the random variable x follows the distribution P. It is shown that the entropy of the push-forward density is We also define an internal energy by where α is the check loss function in (4). The following theorem characterizes the quantile general index in terms of entropy. The proof is straightforward.
The minimization problem of (5) is equivalent to The threshold c in (5) is given by c = ∑ d i=1 c i .

Quantile General Index for Finite Samples
The quantile general index defined in the preceding section is valid only for continuous distributions. It is useful to define the index also for finite samples. Let x (1) , . . . , x (n) ∈ R d be a sample of size n. We denote the i-th coordinate of x (t) by x ti . We deal with a class The empirical counterpart of the objective function (5) is for (w, c) ∈ R d + × R. (6).

Remark 1.
As described in Section 1, the objective function (6) is similar to that of the quantile regression defined by where y t is a response variable and w 1 , . . . , w d are regression coefficients. See [13] for a comprehensive study of the quantile regression.
The following theorem is proved in a similar way to Theorem 1. See Appendix A.
Theorem 3. Suppose that there is no hyperplane of R d that contains all x (t) . Then, the objective function F in (6) admits a minimizer (w, c). The weight vector w is unique. The threshold c is unique if nα is not an integer.
Each case x (t) is classified into positive and negative groups according to g t > 0 and g t < 0, respectively. If the case g t = 0 does not exist, then the fraction of the positive (resp. negative) group is α (resp. 1 − α), and the conditional expectation of x ti on the positive group is greater than that on the negative group. This is the desired dominance relation.
However, it is not always possible to classify the data into positive and negative groups, because g t may become 0 in some cases. Furthermore, the minimization of F(w, c) is not straightforward, since the function is not differentiable. In order to avoid these issues, we modify the method in Section 4.
For illustration, we calculate the quantile general index for the following examples.

Example 2.
Consider the bivariate data of sample size 4. Let the acceptance ratio be α = 1/2. In this data, any set of three points is not on a straight line. Therefore, there exists the quantile general index by Theorem 3. We show that the solution is w 1 = 2/3, w 2 = 4/3, and c = 8/3. We consider three disjoint subsets of R 2 + : Let w ∈ A. Then, we have w x (1) > w x (2) > w x (3) > w x (4) Hence, the optimal c is between w x (2) and w x (3) , since c is the upper 1/2-quantile of {w x (t) }. For such c, the objective function (6) becomes If F is minimized at some w ∈ A, then it must be w 1 = 1/2 and w 2 = 2 by the stationary condition, but this point does not belong to A. Hence, the optimal point does not exist in A. If w ∈ B, then we have w x (1) > w x (3) > w x (2) > w x (4) and the objective function is where w x (2) ≤ c ≤ w x (3) . It is shown again that the optimal point does not exist in B. Therefore, the optimal point should be located in C, the boundary of A and B. The objective function is where c = w x (2) = w x (3) = 4w 1 . The optimal solution is w 1 = 2/3, w 2 = 4/3, and c = 8/3. The quantile general index is given by The index does not provide a separation of the data because g 2 = g 3 = 0. In this case, however, a group {x (1) , x (2) } dominates {x (3) , x (4) } in the sense that the difference of averages is a positive vector.
If we set the acceptance ratio to α = 1/4, then it is proved in a similar way that the optimal w is w 1 = 3/4 and w 2 = 1. In this case, c is not unique: Therefore, g 1 > 0 and g 2 , g 3 , g 4 < 0 as long as 5/2 < c < 7/2. The separation provides a dominance relation:

Example 3. Consider the bivariate data
of sample size 4. Let α = 1/2. In a similar manner to the preceding example, the optimal parameters are shown to be w = (1, 1) and c = 4. The quantile general index is In this case, no separation of the sample into two groups provides a dominance relation. Indeed, all the possible combinations are which are not positive.

Practical Implementation
The quantile general index defined in the preceding section has the following two drawbacks.

•
The minimization is not straightforward since F is not differentiable. • The cases with g t = 0 are not assigned to positive or negative groups.
To overcome these issues, we approximate F as where ε is a positive constant, and the function α,ε : R → R is defined by The function is called the Moreau envelope of α . See Figure 2 for the graph of α,ε . It is shown that l α,ε uniformly converges to α , as ε → 0.
These formulas prove the second part of the following theorem. See Appendix A for the proof of the first part.

Theorem 4.
Suppose that there is no hyperplane of R d that contains all x (t) . Then, the objective function F ε in (8) admits a minimizer (w, c), and the optimal weight vector w is unique. Furthermore, the optimal (w, c) and J t ∈ [0, 1] defined in (9) satisfy The Equations (10) and (11) correspond to (2) and (3) for continuous distributions. The quantity J t is interpreted as the probability of assigning the case x (t) to the positive group. We call J t the optimal random decision. If the general index g t is greater than the threshold ε/α, the case t is definitely assigned to the positive group because J t = 1. Similarly, if the general index is less than −ε/(1 − α), it is definitely assigned to the negative group.
For numerical computation, we used a general-purpose optimization solver optim in R [15] with the L-BFGS method.

Application to the SDGs Index
We finally compute the quantile general indices of the SDGs data provided by [1], as introduced in Section 1. According to [1], countries with a fraction of missing values greater than 20% were removed from the data and then the missing values were imputed by regional averages. We applied the quantile general index with the acceptance ratio α = 10/163 and tolerance ε = 0.001. The result is summarized in Table 2. The optimal weight w is shown in the second column of the table. The threshold was c = 178.2. The other columns of Table 2 show the average of each variable in the 10 top countries and the remaining countries, respectively. In contrast to Table 1, we do not observe the reversal relation. Table 3 shows the general index g t and the optimal random decision J t of the 10 top countries. Table 2. For the SDGs data, the optimal weight w i , the average x + i of each score in the 10 top countries determined from the quantile general index (Cuba, Romania, Finland, Kyrgyz Republic, Ukraine, Chile, Poland, Georgia, Vietnam, Hungary), the average x − i on the remaining 153 countries, and the scaled differences We must be careful with interpretating the result. In particular, the optimal weights had high variation: the ratio of the largest weight (SDG 12) to the smallest weight (SDG 1) was about 0.49/0.049 = 10.0, which means that the SDG 1 had only 10% of the impact of the SDG 12 under the quantile general index. This may discourage people or governments contributing to the SDG 1. Our main message in this paper is that there were reversal relations in the SDGs 12 and 13 under the simple sum method, as observed in Table 1, and such a phenomenon can be avoided by the proposed method. Further discussion should be needed for the use of the quantile general index.

SDGs Weights Average of the 10 Top Countries Average of the Remaining Countries Scaled Difference
As a reviewer suggested, we also computed the Hirsch index [9] (or h-index) of the countries based on the original SDG scores. In the current setting, the h-index is defined as the fixed point of the graph {(i, s i )} 17 i=1 , where s i 's are the 17 SDG scores in descending order (normalized into the range [0, 17]). The 10 top countries based on the h-index are shown in Table 4. The top three were not changed from the original SDG ranking. We also observed the reversal relations in the SDGs 12 and 13 when we adopted the h-index for separation. See [10] for a study of the scaling behavior of the h-index.

Discussion
We proposed a quantile general index that avoids reversal relations in the separated groups. The weight was defined by the solution of the convex optimization problem (6) or (7) for given data. In Section 5, we applied the proposed method to the SDG data and obtained the 10 top countries based on it. The result actually satisfies the desired properties (10) and (11). A side effect is that the obtained weights sometimes had large variation, which may be controversial.
Various applications of our method are expected. For example, one could construct a regional competitive index (e.g., [16]) based on the quantile general index if it is necessary to select a given number of top regions. The method is also applicable to admission decisions based on entrance examinations in schools or companies, where a fixed fraction of candidates are supposed to pass. Further case studies are needed to support the validity of our approach.
The quantile general index (without approximation) introduced in Section 3 was reduced to a minimization problem of a nondifferentiable objective function. It is theoretically of interest to develop an exact algorithm and also to estimate the accuracy of the practical method developed in Section 4. Another problem is to find an algorithm that decides the separability of the data into two groups without the reversal relations. In Example 3, we enumerated all possible combinations to prove that the data was not separable. However, this algorithm requires a large amount computational time when the sample size is large. Faster algorithms would be welcomed. Finally, the relation between the quantile general index and the h-index is also completely unknown.

Data Availability Statement:
The SDGs data used in Section 1 and Section 5 is provided in [1].

Acknowledgments:
The author thanks Kentaro Minami for the insightful comments on numerical optimization, such as the concept of Moreau's envelope. He also thanks the associate editor and the two reviewers for their constructive comments.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Proofs
We give a key lemma for the proof of Theorems 1, 3, and 4. In general, we define for any real u and v.
We consider the following condition on the nondegeneracy of the distribution P of x. We denote the set of nonnegative numbers by R ≥0 . (C1) P(∑ i w i x i = c) < 1 for any (w, c) ∈ R d ≥0 × R with w = 0, This condition holds if P is absolutely continuous with respect to the Lebesgue measure on R d , as assumed in Section 2.
Theorem 1 is immediate from the following lemma.
Lemma A1. Suppose that E[ (w i x i )] < ∞ for all i = 1, . . . , d and w i > 0. If the condition (C1) is satisfied, then the function F in (A1) admits a minimizer, and the optimal w is unique. Conversely, if (C1) does not hold, then F is not bounded from below.
To prove the existence, we show that the sublevel set is compact for each a ∈ R. We define a function R : R d ≥0 × R → R ≥0 by Then, R is continuous and strictly positive unless (w, c) = (0, 0). Indeed, the continuity of R is a consequence of Lebesgue's dominated convergence theorem, and the strict positivity follows from the condition (C1). Let γ := inf ∑ i w i +|c|=1 R(w, c) > 0.
Since R is convex, and R(0, 0) = 0, we have R(w, c) ≥ γ ∑ i w i + |c| whenever ∑ i w i + |c| ≥ 1. For any (w, c) ∈ R d + × R, we have Since the functions w i → (− log w i ) + γw i and c → |c| have compact sublevel sets, the sublevel set of F is also compact.
In order to prove Theorems 3 and 4, it is enough to replace the distribution P by the empirical distribution P n = n −1 ∑ n t=1 δ x (t) , where δ a is the Dirac measure at a point a ∈ R d . In Theorem 3, the uniqueness of c when nα is not an integer follows from the observation that the optimal c for a fixed w must be w x (t) for some t.