dCATCH: a numerical package for d-variate near G-optimal Tchakaloff regression via fast NNLS

: We provide a numerical package for the computation of a d -variate near G-optimal 1 polynomial regression design of degree m on a ﬁnite design space X ⊂ R d , by few iterations of a basic 2 multiplicative algorithm followed by Tchakaloff-like compression of the discrete measure keeping the 3 reached G-efﬁciency, via an accelerated version of the Lawson-Hanson algorithm for Non-Negative 4 Least Squares (NNLS) problems. This package can solve on a personal computer large-scale problems 5 where card ( X ) × dim ( P d 2 m ) is up to 10 8 − 10 9 , being dim ( P d 2 m ) = ( 2 m + d d ) = ( 2 m + d 2 m ) . Several numerical 6 tests are presented on complex shapes in d = 3 and on hypercubes in d > 3. 7


11
In this paper we present the numerical software package dCATCH [1] for the computation of a this paper and the content (subroutine names) of the dCATCH software package.

LS
Least i.e. polynomials in P d m vanishing on supp(µ) vanish everywhere on Ω.

53
In the theory of optimal designs, a key role is played by the diagonal of the reproducing kernel for µ in P d m (Ω) (also called the Christoffel polynomial of degree m for µ) where {p j } is any µ-orthonormal basis of P d m (Ω). Recall that K µ m (x, x) can be proved to be independent of the choice of the orthonormal basis. Indeed, a relevant property is the following estimate of the L ∞ -norm in terms of the L 2 µ -norm of polynomials Now, by (1) and µ-orthonormality of the basis we get which entails that max x∈Ω K µ m (x, x) ≥ N m .

54
Then, a probability measure µ * = µ * (Ω) is then called a G-optimal design for polynomial regression of degree m on Ω if Observe that, since Ω K µ m (x, x) dµ = N m for every µ, an optimal design has also the following property 55 K µ * m (x, x) = N m , µ * -a.e. in Ω. 56 Now, the well-known  (a cornerstone of optimal design theory), asserts that the difficult min-max problem (4) is equivalent to the much simpler maximization problem where G µ m is the Gram matrix (or information matrix in statistics) of µ in a fixed polynomial basis {φ i } 57 of P d m (Ω). Such an optimality is called D-optimality, and ensures that an optimal measure always 58 exists, since the set of Gram matrices of probability measures is compact and convex; see e.g. [5,12] for 59 a general proof of these results, valid for continuous as well as for discrete compact sets.

60
Notice that an optimal measure is neither unique nor necessarily discrete (unless Ω is discrete 61 itself). Nevertheless, the celebrated Tchakaloff Theorem ensures the existence of a positive quadrature 62 formula for integration in dµ * on Ω, with cardinality not exceeding N 2m = dim(P d 2m (Ω)) and which is 63 exact for all polynomials in P d 2m (Ω). Such a formula is then a design itself, and it generates the same 64 orthogonal polynomials and hence the same Christoffel polynomial of µ * , preserving G-optimality 65 (see [15] for a proof of Tchakaloff Theorem with general measures).

66
We recall that G-optimality has two important interpretations in terms of statistical and 67 deterministic polynomial regression.

68
From a statistical viewpoint, it is the probability measure on Ω that minimizes the maximum 69 prediction variance by polynomial regression of degree m, cf. e.g. [5].

70
On the other hand, from an approximation theory viewpoint, if we call L µ * m the corresponding weighted Least Squares projection operator L ∞ (Ω) → P d m (Ω), namely by (2) we can write for every f ∈ L ∞ (Ω) (where the second inequality comes from µ * -orthogonality of the projection), which gives that is a G-optimal measure minimizes (the estimate of) the weighted Least Squares uniform operator 71 norm.

72
We stress that in this paper we are interested in the fully discrete case of a finite design space 73 Ω = X, so that any design µ is identified by a set of positive weights (masses) summing up to 1 and 74 integrals are weighted sums. Since in the present context we have a finite design space may think a design µ as a vector of non-negative weights u = (u 1 , · · · , u M ) attached to the points, 78 such that u 1 = 1 (the support of µ being identified by the positive weights). Then, a G-optimal 79 (or D-optimal) design µ * is represented by the corresponding non-negative vector u * . We write for the Christoffel polynomial and similarly for other objects (spaces, operators, 81 matrices) corresponding to a discrete design. At the same time, L ∞ (Ω) = ∞ (X), and L 2 In order to compute an approximation of the desired u * , we resort to the basic multiplicative algorithm proposed by Titterington in the '70s (cf. [16]), namely with initialization u(0) = (1/M, . . . , 1/M). Such an algorithm is known to be convergent sublinearly to a D-optimal (or G-optimal by the Kiefer-Wolfowitz Equivalence Theorem) design, with an increasing sequence of Gram determinants where V is a Vandermonde-like matrix in any fixed polynomial basis of P d m (X); cf., e.g., [7,10]. Observe 84 that u(k + 1) is indeed a vector of positive probability weights if such is u(k). In fact, the Christoffel is positive on X, and calling µ k the probability measure on X associated with the 86 weights u(k) we get immediately ∑ i u i (k by (3) in the discrete case Ω = X.

88
Our implementation of (7) is based on the functions is a suitably ordered total-degree product Chebyshev basis of the minimal box [a 1 , [17] for the construction and enumeration of the required "monomial" degrees. Though the initial basis 99 is then orthogonalized, the choice of the Chebyshev basis is dictated by the necessity of controlling the 100 conditioning of the matrix. This would be on the contrary extremely large with the standard monomial 101 basis, already at moderate regression degrees, preventing a successful orthogonalization.
Indeed, the second function dORTHVAND computes a Vandermonde-like matrix in a u-orthogonal polynomial basis on X, where u is the probability weight array. This is accomplished essentially by numerical rank evaluation for C = dCHEBVAND(n, X) and QR factorization (with Q orthogonal rectangular and R square invertible), where √ u = ( √ u 1 , . . . , √ u M ). The matrix C 0 has full rank and corresponds to a selection of the columns of C (i.e., of the original 104 basis polynomials) via QR with column pivoting, in such a way that these form a basis of P d n (X), 105 since rank(C) = dim(P d n (X)). A possible alternative, not yet implemented, is the direct use of a 106 rank-revealing QR factorization. The in-out parameter "jvec" allows to pass directly the column index 107 vector corresponding to a polynomial basis after a previous call to dORTHVAND with the same degree 108 n, avoiding numerical rank computation and allowing a simple "economy size" QR factorization of Summarizing, U is a Vandermonde-like matrix for degree n on X in the required u-orthogonal basis of P d n (X), that is where jvec = (j 1 , . . . , j N n ) is the multi-index resulting from pivoting. Indeed by (8) we can write the scalar product (p h , p k ) 2 u (X) as for 1 ≤ h, k ≤ N n , which shows orthonormality of the polynomial basis in (9). 111 We stress that rank(C) = dim(P d n (X)) could be strictly smaller than dim(P d n ) = ( n+d d ), when there 112 are polynomials in P d n vanishing on X that do not vanish everywhere. In other words, X lies on a 113 lower-dimensional algebraic variety (technically one says that X is not P d n -determining [13]). This 114 certainly happens when card(X) is too small, namely card(X) < dim(P d n ), but think for example also 115 to the case when d = 3 and X lies on the 2-sphere S 2 (independently of its cardinality), then we have Iteration (7) is implemented within the third function dNORD whose name stands for 118 d-dimensional Near G-Optimal Regression Designs, which calls dORTHVAND with n = m. Near 119 optimality is here twofold, namely it concerns both the concept of G-efficiency of the design and the 120 sparsity of the design support.

121
We recall that G-efficiency is the percentage of G-optimality reached by a (discrete) design, measured by the ratio , knowing that G m (u) ≤ 1 by (3) in the discrete case Ω = X. Notice that G m (u) can be easily computed 122 after the construction of the u-orthogonal Vandermonde-like matrix U by dORTHVAND, as G m (u) = 123 N m /(max i row i (U) 2 2 ) .

124
In the multiplicative algorithm (7), we then stop iterating when a given threshold of G-efficiency 125 (the input parameter "gtol" in the call to dNORD) is reached by Since convergence is sublinear and in practice 127 we see that 1 − G m (u(k)) = O(1/k), for a 90% G-efficiency the number of iterations is typically in the 128 tens, whereas it is in the hundreds for 99% one and in the thousands for 99, 9%. When a G-efficiency 129 very close to 1 is needed, one could resort to more sophisticated multiplicative algorithms, see e.g.

131
In many applications however a G-efficiency of 90-95% could be sufficient (then we may speak of near G-optimality of the design), but though in principle the multiplicative algorithm converges to an optimal design µ * on X with weights u * and cardinality N m ≤ card(supp(µ * )) ≤ N 2m , such a sparsity is far from being reached after the iterations that guarantee near G-optimality, in the sense that there is a still large percentage of non-negligible weights in the near optimal design weight vector, say Following [18,19], we can however effectively compute a design which has the same G-efficiency of 132 u(k) but a support with a cardinality not exceeding N 2m = dim(P d 2m (X)), where in many applications 133 N 2m card(X), obtaining a remarkable compression of the near optimal design. quadratures, which asserts that for every measure on a compact set Ω ⊂ R d there exists an algebraic 136 quadrature formula exact on P d n (Ω)), with positive weights, nodes in Ω and cardinality not exceeding In the present discrete case, i.e. where the designs are defined on Ω = X, this theorem implies that for every design µ on X there exists a design ν, whose support is a subset of X, which is exact for integration in dµ on P d n (X). In other words, the design ν has the same basis moments (indeed, for any basis of P d n (Ω))

141
In matrix terms this can be seen as the fact that the underdetermined {p j }-moment system large-scale d-variate problems (see the next subsection for a brief description and discussion).

153
We observe that working with an orthogonal polynomial basis of P d n (X) allows to deal with the 154 well-conditioned matrix U n in the Lawson-Hanson algorithm.

155
The overall computational procedure is implemented by the function where dCATCH stands for d-variate CAratheodory-TCHakaloff discrete measure compression.

158
It works for any discrete measure on a discrete set X. Indeed, it could be used, other than for design In the present framework we call dCATCH with n = 2m and u = u(k), cf. (10), i.e. we solve In such a way the compressed design generates the same scalar product of u(k) in P d m (X), and hence the same orthogonal polynomials and the same Christoffel function on X keeping thus invariant the G-efficiency with a (much) smaller support.    5)) by the near G-optimal weights u(k) and w, respectively, with the same reasoning used to obtain (6) and by (13) we can write the operator norm estimates Moreover, since L w m p = p for any p ∈ P d m (X), we can write the near optimal estimate Notice that L w m f is constructed by sampling f only at the compressed support {ξ } ⊂ X. The error 171 depends on the regularity of f on D ⊃ X, with a rate that can be estimated whenever D admits a 172 multivariate Jackson-like inequality, cf.
[23]. Let A ∈ R N×M and b ∈ R N . The NNLS problem consists in seeking x ∈ R M that solves This is a convex optimization problem with linear inequality constraints that define the feasible region,  Recall that for a given point x in the feasible region, the index set {1, . . . , M} can be partitioned into two sets: the active set Z, containing the indices of active constraints x i = 0, and the passive set P, containing the remaining indices of inactive constraints x i > 0. Observe that an optimal solution x of (14) satisfies Ax = b and, if we denote by P and Z the corresponding passive and active sets respectively, x also solves in a least square sense the following unconstrained least squares subproblem x P = argmin y A P y − b 2 2 , where A P is the submatrix containing the columns of A with index in P , and similarly x P is the 180 subvector made of the entries of x whose index is in P .

181
The Lawson-Hanson algorithm, starting from a null initial guess x = 0 (which is feasible),

182
incrementally builds an optimal solution by moving indices from the active set Z to the passive set P  In this section we perform several tests on the computation of d-variate near G-optimal Tchakaloff 222 designs, from low to moderate dimension d. In practice, we are able to treat, on a personal computer, 223 large-scale problems where card(X) × dim(P d 2m ) is up to 10 8 − 10 9 , with dim(P d 2m ) = ( 2m+d d ) = ( 2m+d 2m ).
Recall that the main memory requirement is given by the N 2m × M matrix U T in the compression 225 process solved by the LHDM algorithm, where M = card(X) and N 2m = dim(P d 2m (X)) ≤ dim(P d 2m ).

228
All the tests are performed on a workstation with a 32 GB RAM and an Intel Core i7-8700 CPU @ 229 3.20GHz.

242
We perform a tests at regression degree m = 10 on the 5-bubble shown in Figure 1b. The initial 243 support X consists in the M = 18915 points within 64000 low discrepancy Halton points, falling in the 244 closure of the multibubble. Results are shown in Figure 1 and Table 3.

Hypercubes: Chebyshev grids 246
In the recent paper [19], a connection has been studied between the statistical notion of 247 G-optimal design and the approximation theoretic notion of admissible mesh for multivariate polynomial 248 approximation, deeply studied in the last decade after [13] (see, e.g., [27,28] with the references 249 therein). In particular, it has been shown that near G-optimal designs on admissible meshes of 250 suitable cardinality have a G-efficiency on the whole d-cube that can be made convergent to 1. For  Table 3. Results for the multibubble numerical test: compr = M/mean(cpts) is the mean compression ratio obtained by the three methods listed; t LH /t Titt is the ratio between the execution time of LH and that of Titterington algorithm; t LH /t LHDM (t LH I /t LHDM ) is the ratio between the execution time of LH (LHI) and that of LHDM; cpts is the number of compressed Tchakaloff points and momerr is the final moment residual.

258
We perform three tests in different dimension spaces and at different regression degrees. Results

259
are shown in Figure 2 and Table 4, using the same notation above. (c) d = 5, n = 2, M = 1048576. Figure 2. The evolution of the cardinality of the passive set P along the iterations of the three LH algorithms for Chebyshev nodes' tests.  Table 4. Results of numerical tests on M = (2km) d Chebyshev's nodes, with k = 4, with different dimensions and degrees: compr = M/mean(cpts) is the mean compression ratio obtained by the three methods listed; t LH /t Titt is the ratio between the execution time of LH and that of Titterington algorithm; t LH /t LHDM (t LH I /t LHDM ) is the ratio between the execution time of LH (LHI) and that of LHDM; cpts is the number of compressed Tchakaloff points and momerr is the final moment residual.

Hypercubes: low-discrepancy points 261
The direct connection of Chebyshev grids with near G-optimal designs discussed in the previous  (12)).

274
Results are shown in Figure 3 and Table 5, using the same notation as above. the computational complexity of a QR factorization of a matrix of size n r × n c , with n c ≤ n r , is high, 278 namely 2(n 2 c n r − n 3 c /3) ≈ 2n 2 c n r (see, e.g., [30]). 279 Titterington algorithm performs a QR factorization of a M × N m matrix at each iteration, with the following overall computational complexity wherek is the number of iterations necessary for convergence, that depends on the desired G-efficiency.

280
On the other hand, the computational cost of one iteration of the Lawson-Hanson algorithm, 281 fixed the passive set P, is given by the solution of an LS problem of type (15), which approximately is 282 2N 2m |P| 2 that is the cost of a QR decomposition of a matrix of size N 2m × |P|. However, as experimental 283 results confirm, the evolution of the set P along the execution of the algorithm may vary significantly 284 depending on the experiment settings, so that the exact overall complexity is hard to estimate. Lower 285 and upper bounds are available, but may lead to heavy under-and over-estimations, respectively; cf.

286
[31] for a discussion on complexity issues.    Table 5. Results of numerical tests on Halton points: compr = M/mean(cpts) is the mean compression ratio obtained by the three methods listed; t LH /t Titt is the ratio between the execution time of LH and that of Titterington algorithm; t LH /t LHDM (t LH I /t LHDM ) is the ratio between the execution time of LH (LHI) and that of LHDM; cpts is the number of compressed Tchakaloff points and momerr is the final moment residual.

288
In this paper we have presented dCATCH [1], a numerical software package for the computation of problems for a personal computer.

298
The present package, dCATCH works for any discrete measure on a discrete set X. Indeed, it 299 could be used, other than for design compression, also in the compression of d-variate quadrature 300 formulas, even on lower-dimensional manifolds, to give an example. 301 We may observe that with this approach we can compute a d-variate compressed design starting 302 from a high-cardinality sampling set X, that discretizes a continuous compact set (see sections 4.2 303 and 4.3). This design allows an m-th degree near optimal polynomial regression of a function on 304 the whole X, by sampling on a small design support. We stress that the compressed design is 305 function-independent and thus can be constructed "once and for all" in a pre-processing stage. This 306 approach is potentially useful, for example, for the solution of d-variate parameter estimation problems,

307
where we may think to model a nonlinear cost function by near optimal polynomial regression on a 308 discrete d-variate parameter space X; cf., e.g., [32,33] for instances of parameter estimation problems 309 from mechatronics applications (Digital Twins of controlled systems) and references on the subject.