Abstract
We introduce a novel entropy minimization approach for the solution of constrained linear regression problems. Rather than minimizing the quadratic error, our method minimizes the Fermi–Dirac entropy, with the problem data incorporated as constraints. In addition to providing a solution to the linear regression problem, this approach also estimates the measurement error. The only prior assumption made about the errors is analogous to the assumption made about the unknown regression coefficients: specifically, the size of the interval within which they are expected to lie. We compare the results of our approach with those obtained using the disciplined convex optimization methodology. Furthermore, we address consistency issues and present examples to illustrate the effectiveness of our method.
Keywords:
constrained linear regression; Fermi–Dirac entropy; convex optimization; ill-posed inverse problems MSC:
62J99; 15A29; 62B99
1. Introduction
In the statistical analysis of input–response systems, many prediction or modeling problems involve establishing a functional relationship between input data and output (response) measurements. The input data can be viewed as the control variables of the experiments or as appropriately chosen functions derived from the control data. These inputs are organized into a design matrix , where each row represents the components of the (transposed) input vector. In this context, the responses to the inputs are treated as real numbers. The linear regression problem consists of solving the following problem.
Find vector and scalar such that
Here, we temporarily use the standard notations in the statistical literature. Below, we use generic algebraic notations to solve the inverse problem. In this problem, the vector explains how the different inputs are coupled to obtain the outputs. Thus, the data of the problem consist of the observed outputs, modeled by the vector and the design matrix . The symbol stands for the N-vector with all components equal to 1 and, as noted, is a scalar.
The intuitive meaning of the constraint upon amounts to prior knowledge of a range for , where is the j-th column of i.e., of the sensitivity of the response to a change in the input (or control) variable. The specification of amounts to the specification of a range for the intercept of the regression. These values are either obtained from an underlying model or are to be inferred from the data, as shown in the first example in Section 3.
Let us begin by rewriting the problem as follows. We think of vectors as column vectors, and, for any vector or matrix we shall denote by or , respectively, its transpose; denotes the usual Euclidean scalar product of the vectors and ; is the Euclidean norm of ; and will denote the concatenation of the indicated objects, so, for example, concatenates as an extra column to matrix Let us introduce
Note that is now a given matrix. Then, problem (1) can be restated as
This is a standard constrained inverse problem. The two standard methods to solve it are as follows. Since the problem may have infinitely many solutions (the kernel of may be non-0), one way to choose among them is to solve the following variational problem:
Here, denotes the standard Euclidean norm in A related approach motivated by the fact that the output may not be in the range of consists of solving
Here, stands for a distance function on Usually, it is the distance derived from the standard Euclidean norm. However, we stress the fact that these may be taken as arbitrary norms. For such approaches in inverse problems and applications, see [1,2], and, for statistics and econometrics, see [3,4].
The consistency condition for problems (3) and (4) is that (the interior of the range of the constraint set), which may not be the case due to measurement errors. This leads to the statement of the problem as (5). Instead of this, we propose an extension of (1) or equivalently of (3), which has interesting interpretations. The extended version is stated as
Two interpretations can be drawn. From the perspective of mathematical programming, can be viewed as a slack variable that absorbs the discrepancy between the system’s response, represented by , and the observed signal Put differently, now lies within the range of the extended operator , which acts on the augmented domain
From a statistical standpoint, Equation (6) can be interpreted as simultaneously determining both the regression coefficients and the measurement noise. This is particularly relevant because it can help to uncover systematic errors in the measurement process—an important consideration when data are scarce or expensive to obtain. This approach is worth examining alongside the Bayesian framework for linear regression, as described in [5,6], which include numerous applications to classification and machine learning. However, unlike the standard Bayesian approach, which is tied to the methodology in (5) and relies on a parametric (Gaussian) noise model, our proposal is model-free and non-parametric. In contrast, the noise vector estimated through our method can serve as a starting point for the construction of a model of the noise present in the measurements.
To simplify the notations for the mathematical procedure, consider the matrix where stands for the N-dimensional identity matrix, for the unknown of the problem, and for the set of constraints. With this, we restate (6) as the problem
Observe that the dimension of the unknown vector is , whereas that of the data vector is so we have a truly ill-posed problem to solve. Two standard approaches to solving this class of problems were already outlined in (4) and (5). From the abstract point of view, they consist of minimizing a convex function subject to linear and domain constraints, and standard quadratic optimization methods, like discipline convex optimization or CVX [7], can be used.
Our proposal is similar in essence but different in detail. We propose to minimize a convex function (the Fermi–Dirac entropy) defined on the constraint set, subject to a linear constraint, that automatically yields an optimizer in the interior of the constraint set. It is in this detail that the difference from quadratic minimization lies.
We also mention that the combination of linear regression and the minimization of the quadratic distance is used in the context of reproducing kernel Hilbert spaces for learning and classification problems, as in the works [8,9]. Besides the standard applications in statistical analysis, and the newer applications in statistical learning already mentioned, we mention the application of linear regression in decoding, as considered by [10].
It is worth noting that our approach provides an alternative to the maximum entropy in the mean methods proposed in [11], as well as its measure-theoretic formulation in [12]. The key difference between our approach and these methods lies in their foundational principles. The maximum entropy in the mean method involves transforming the algebraic problem into one of determining a probability distribution over the set of constraints. In this framework, the solution to the algebraic problem corresponds to the expected value of a random variable with respect to an unknown probability distribution, which is determined using the standard maximum entropy method. In contrast, our approach starts with a Fermi–Dirac-type entropy defined directly on the set of constraints. By minimizing this entropy, we obtain a solution that lies directly within the constraint set.
For optimization purposes, it is useful to note that the entropy function is the Lagrange–Fenchel dual of the logarithm of the Laplace transform of a measure whose support is the constraint space.
The remainder of this paper is organized as follows. In the next section, we collect the mathematical details of the method. Some geometric aspects of the related inverse problem are examined in [13]. At the end of Section 2, we explicitly explain how we measure the quality of the solution. Although the measure is similar to that used in the quadratic optimization procedure, its origin is quite different. The main difference is that our proposal yields a solution (when it exists) in the interior of the constraint set. In Section 3, we consider several variations of the theme of two toy examples: a textbook example and a simulated example. The first of them is an example with a small number of data points, and we use them to exemplify how to determine the constraint set. We also examine the performance of the method for different sizes of a simulated design matrix and of a data vector, and we compare them to the solutions to the same problems obtained applying the CVX method. In Section 4, we address a consistency issue: we verify that our method is consistent with the obvious algebraic solution when the matrix is invertible or when it is of full rank with invertible. We conclude with some remarks.
2. The Entropy Minimization Approach
Clearly, problem (7) is an ill-posed linear inverse problem with convex constraints. Instead of using the traditional least squares methodology, our approach consists of devising a smooth convex function and solving
To begin with, to avoid excessive notations, we write with The correspondence with the prior notation is clear: the first labels correspond to , the K-th to , and the last N to . On the Borel sets of , we define a measure that charges its “corners” as follows:
We use to denote the unit point mass at the point a (that is, the Dirac delta measure at a). There are three reasons for this proposal. First, the convex hull of the support of is the interval . Second, the function introduced in Equation (10) is defined for all . Third, below, we need the invertibility of the mapping As the coordinates are separated, this follows from the fact that the equation has a solution if and only if Having mentioned these preliminaries, the Laplace transform of is easy to compute:
Moreover, the moment generating function is defined by
Then, the function that we seek is defined to be the Lagrange-Fenchel dual of which is given as
Making use of the preliminaries mentioned above, a calculation shows that
A good reference for these matters is [14]. It is known and standard to verify that is strongly convex and infinitely differentiable in and is strongly convex and infinitely differentiable in and reaches its minimum value in Moreover, for in the interior of the equation has a unique solution, which is easy to determine analytically. Furthermore, the Lagrange–Fenchel dual of is , and
Having made explicit the objects that we need, the solution to problem (8) can be obtained as stated in the following theorem.
Theorem 1.
Let be as in (13) and let be its Lagrange–Fenchel dual. Suppose that . Then, the solution to (8) is given by
Here, is the point at which achieves its maximum value. Moreover,
Proof.
To obtain the solution (15), we use the standard Lagrange multipliers technique to minimize , where we first define the Lagrangian function
Then, equating and , one obtains the following system for :
Keep in mind that the first K components of solve problem (3) and the last N are the estimated measurement error . Moreover, notice that, since is a strictly convex, infinitely differentiable function, its maximizer occurs at satisfying
with given by (15). In addition, since most numerical software packages are written to solve a minimization problem by default, instead of maximizing , it is convenient to minimize . (see https://metrumresearchgroup.github.io/bbr/ (accessed on 23 January 2025)) for an example. This combines the usual gradient method with a step reduction procedure at each iteration. This is convenient because the objective function may be very flat near the minimum.)
The Reconstruction Error
When one solves problem (8) or its variational version (15) numerically, the solution need not satisfy exactly. The reconstruction error simply measures how large the offset is with respect to the problem data. In methods that use the value of the objective function at the optimum, namely is, at the same time, a measure of the reconstruction error. In our approach, the minimum value does not measure the quality of the reconstruction error. Nevertheless, we know from Theorem 1 that
So, at the end, the quantitative reconstruction is the same. However, as our method allows us to estimate the additive noise, which is given by the last N components of let us call this vector and an estimate of the additive measurement error is given by
As we do not make any assumption about the statistical nature of the measurement noise, if it is supposed that the components of are a sample of some random variable, and the number of data points is large, one could use the output of our method combined with the standard statistical methodology to determine the distribution of the noise affecting the measurement process.
3. Numerical Examples
Here, we consider several examples. The first set of examples is built from one taken from the textbook by Stone ([4], Ch. 8), which we examine from several points of view to exemplify the difference between our method and the least squares optimization procedure. We consider various cases depending on whether we assume that there is an experimental error or not and depending on whether the number of data points is larger or smaller than the number of unknown parameters. After these variations, we present a simulated example along the lines suggested in the Introduction for the R package v. 4.4.1 CVXR by [15].
All examples are programmed in R. We use CVXR to implement CVX with the ECOS_BB solver (https://github.com/embotech/ecos, accessed on 23 January 2025), which is a branch-and-bound procedure for the solution of mixed-integer convex second-order cone problems. For our optimization method, we use the spectral projected gradient method for large-scale optimization with simple constraints and Barzilai–Borwein step length strategies [16] (R package BB [17]).
3.1. A Textbook Example
To motivate the least squares method, Ref. [4] presents data from a chemical engineering process for polymerization. The data consist of 6 runs of coded temperatures, , and the corresponding values of the process, given in the following vector:
The transpose of the matrix is
In fact, the original linear regression problem has a design matrix without the middle row. Since the solution obtained is poor, a non-linear regression of order two is suggested. So, suppose that the underlying model is (We have changed the notations relative to [4] to follow ours.) Our first task is to estimate the ranges for
Here, we consider the following possibility: the mid-point for the ranges of and is taken to be the average of the numerical computation of these quantities. The consecutive incremental quotients of y are
Note that the quotients decrease, suggesting that the curve bends down (that it may be concave), except that there could be a measurement error at the fourth or fifth data point. The average of the incremental quotients of the incremental quotients yields a mid-point for the range of
The mid-point for the range of the intercept is taken to be the average of and , which is To be conservative, we allow for a miscalculation of the quantities above and obtain that the ranges for and are as in Table 1, whereas that for is obtained by adding (and subtracting) to the mid-point the length of one step with the estimated slope at
Table 1.
Ranges for the unknowns.
After this preamble, we are ready to consider several possible cases and comparisons.
3.1.1. Data Without Errors
We solve problem (3), , with given as the vector (20), a matrix, whose transpose is shown in (21), and a solution constrained to the box . Our convex optimization method gives solution , which, not surprisingly, is similar to the one obtained by the least squares method in [4] and also similar to solutions obtained by optimization methods such as CVXR [7,15], which are based on minimizing (see our comments on consistency in Section 4 below). Substituting these estimated coefficients in the second-order polynomial model and evaluating for different h, we can compare our estimated values to the given data values y in Table 2.
Table 2.
Estimated values vs. registered values y.
3.1.2. Data with Errors
We assume now that the data have measurement errors. We take to set an ample box for the error (i.e., to be within ). So, this time, we use our method to solve problem (7), , with as before and the extended matrix . The solution is constrained to the same box as in the previous section, and, after solving, we obtain the estimated values . Inserting these into the polynomial model, we obtain the estimated values for different h. Our method also gives an estimation of the error , and we show all of these quantities, together with the data values, in Table 3.
Table 3.
Estimated values and errors vs. registered values y.
One can see that the estimated plus the estimated error sum up to the true data value.
Note that the extended problem is under-determined and always ill-posed since (in our example, , ). Since then
The determinant of this block matrix vanishes since
Therefore, as mentioned in Section 4, this enhances the convenience of our approach as it does not require inverting matrices or the quest for generalized inverses. We note that, in this case, a method like CVX also succeeds in finding the solution by applying an interior point method, which works well for small and medium-sized problems, but, for a larger problem, it switches to a first-order solver, which can be slow if the problem is not well conditioned [15]. In this particular example, CVXR yields the estimated coefficients ; however, the mean square error of the estimated values of y with these coefficients, namely , with respect to the observed values is , while, for the estimate obtained with our method, we have . We shall give further proof of the superior precision of our method with respect to CVX in Section 3.2.
3.1.3. Case in Which the Data Are Scarce
We now assume that it is very difficult or costly to measure different values of and that we only have two values, , corresponding to , respectively. The transpose of the matrix is
We keep the bounds for the solution the same as in the previous problems. Note that the problem is ill-posed since (, ). Then, solving while disregarding errors with our method, we obtain . Solving the same problem with the CVXR method yields the solution . In Table 4, we present the original data values (for all h), the estimations given by the solution with our method, and the estimations obtained with the solution given by CVXR.
Table 4.
Estimated values and and registered values y.
Here, our method performs better than CVX when the data are scarce. The values of yield a better estimate of the unobserved values of . Comparing the mean square errors for both methods and the observed data, we have and .
3.2. A Simulated Example
For this set of examples, we generated Gaussian data with N observations and K predictors, for different values of N and K. Thus, our design matrix is of size , with entries obtained by generating random numbers from a random variable. The outcome values are generated by where is a sample from a random variable, and the regression coefficients form an arithmetic progression of difference 1 from to (e.g., for , ). The goal is to recover the values of . We assume that the experimenter has certain knowledge of the range of values for the coefficients and the error, so she sets the constraint boxes as for the coefficients and (i.e., parameter ) for the error.
We tested this linear system for various values of N and K, considering , , or . Note that, due to the random nature of and , for each pair of values of , we repeat the reconstruction experiment 100 times and report the average of the square of the norm of the respective errors, namely for our method and for the CVX reconstruction. Moreover, for large values of N and K, it is convenient to scale down the design matrix and the error (e.g., multiply by for the cases of ). This is to prevent the exploding of the exponentials in the objective function , which could disrupt the convergence of the underlying solver.
Table 5 shows the square of the norms of the reconstructions resulting from the application of our method and that of the CVX method, for various combinations of . We see that, in all cases, our method excels CVX in accuracy, with around a 70–90% improvement in most cases.
Table 5.
Comparison of the norms of the reconstruction errors for CVX, , and our method, , for different values of N and K. The results are averaged over 100 runs.
4. A Consistency Issue
Here, we add a few consistency remarks. A natural question is the following: what does the method yield when is invertible? Let us apply the standard method of Lagrange multipliers to solve problem (8):
Form the Lagrangian, and we equate its gradient with respect to to 0 to obtain
We know that is invertible; therefore,
Consider now the dual problem
The first-order condition for to be the maximizer is that
Now, invoking (14) and combining the two identities, we obtain that
Similarly, when is of full rank and exists, a variation of the previous theme yields that is a solution to . The matrix is a generalized inverse of
Clearly, this is not surprising, but it is useful to know that the consistency is maintained in these particular cases. Once more, the strong features of the duality approach, even when is invertible but ill-conditioned, is very important, as there is no need to invert matrices to numerically find the maximizer of the dual problem.
5. Final Remarks
To summarize, the main features of our approach are as follows: we define a strictly convex function whose domain is the constraint set and whose minimization provides an explicit representation of the solution to problems (3) or its extended version (7). The extended version can be viewed as a regularization of the original problem, designed to address cases where the data do not lie in the range of the operator. Additionally, the approach ensures the tractability of the dual problem for the Lagrange multipliers, which is particularly advantageous from a numerical perspective, as highlighted in the final paragraph of the previous section.
Pending issues to address are as follows. Our approach does not impose any specific assumptions about the measurement errors, besides the fact that they lie within a bounded interval. It would be interesting to explore scenarios where a large number of measurements are feasible, using the output to determine the statistical properties of the measurement error. As demonstrated in Section 3.1.3, the method performs effectively even when data are scarce, but it would be worthwhile to investigate the asymptotic properties of the estimators.
Finally, by employing appropriate vectorization, we can derive a stylized version of the problem addressed in [18] using the method of maximum entropy in the mean. The problem consists of finding an matrix and an matrix that satisfy
Here, is a given matrix, and is a given matrix, and it is also required that the components of and satisfy some box constraint. It can be proven that, after vectorization, this problem reduces to a problem equal to (7).
Author Contributions
Conceptualization, methodology, formal analysis, investigation, A.A. and H.G.; software, data curation, A.A.; validation, A.A. and H.G.; writing—original draft preparation, H.G.; writing—review and editing, A.A. and H.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The software and data used in this study are available upon request from the first author.
Acknowledgments
We thank the referees for their comments and suggestions that improved the paper. A. Arratia is affiliated with the Soft Computing Research Group (SOCO) at the Intelligent Data Science and Artificial Intelligence Research Center and with the Institute of Mathematics of UPC-BarcelonaTech (IMTech).
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Bertero, M.; Boccacci, P. Inverse Problems and Imaging; Institute of Physics Publishing: Philadelphia, PA, USA, 1998. [Google Scholar]
- Engel, H.W.; Hanke, M.; Neubauer, A. Regularization of Inverse Problems; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1996. [Google Scholar]
- Mittelhammer, R.C.; Judge, G.G.; Miller, D.J. Econometric Foundations; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Stone, C.J. A Course in Probability and Statistics; Duxbury Press: Belmont, CA, USA, 1996. [Google Scholar]
- Rasmussen, C.E.; Williams, C.R. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, UK, 2006. [Google Scholar]
- Deisenroth, M.P.; Faisal, A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Grant, M.; Boyd, S.; Ye, Y. Disciplined Convex Programming. In Global Optimization: From Theory to Implementation; Springer: New York, NY, USA, 2006; pp. 155–210. [Google Scholar]
- Cucker, F.; Zhou, D.-X. Learning Theory: An Approximation Theory Point of View; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
- Paulsen, V.I.; Raghupathi, M. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Candes, E.; Tao, T. Decoding by Linear Programming. IEEE Trans. Inf. Theory 2005, 51, 4203–4215. [Google Scholar] [CrossRef]
- Golan, A.; Judge, G.G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
- Golan, A.; Gzyl, H. A Generalized Maxentropic Inversion Procedure for Noisy Data. Appl. Math. Comput. 2002, 127, 249–260. [Google Scholar] [CrossRef]
- Gzyl, H. A Geometry in the Set of Solutions to Ill-Posed Linear Problems with Box Constraints: Applications to Probabilities on Discrete Sets. J. Appl. Anal. 2024. [Google Scholar] [CrossRef]
- Borwein, J.M.; Lewis, A.S. Convex Analysis and Nonlinear Optimization, 2nd ed.; CMS-Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Anqi, F.; Narasimhan, B.; Boyd, S. CVXR: An R-Package for Disciplined Convex Optimization. J. Stat. Softw. 2020, 94, 1–34. [Google Scholar]
- Raydan, M. Barzilai-Borwein Gradient Method for Large-Scale Unconstrained Minimization Problem. SIAM J. Optim. 1997, 7, 26–33. [Google Scholar] [CrossRef]
- Varadhan, V.; Gilbert, P.D. BB: An R Package for Solving a Large System of Nonlinear Equations and for Optimizing a High-Dimensional Nonlinear Objective Function. J. Stat. Softw. 2009, 32, 1–26. [Google Scholar] [CrossRef]
- Marsh, T.L.; Mittelhammer, R.; Scott Cardell, N. Generalized Maximum Entropy Analysis of the Linear Simultaneous Equation Model. Entropy 2014, 16, 825–853. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).