An Adaptive Procedure for the Global Minimization of a Class of Polynomial Functions

: The paper deals with the problem of global minimization of a polynomial function expressed through the Frobenius norm of two-dimensional or three-dimensional matrices. An adaptive procedure is proposed which applies a Multistart algorithm according to a heuristic approach. The basic step of the procedure consists of splitting the runs of different initial points in segments of ﬁxed length and to interlace the processing order of the various segments, discarding those which appear less promising. A priority queue is suggested to implement this strategy. Various parameters contribute to the handling of the queue, whose length shrinks during the computation, allowing a considerable saving of the computational time with respect to classical procedures. To verify the validity of the approach, a large experimentation has been performed on both nonnegatively constrained and unconstrained problems.


Introduction
Global optimization over continuous spaces is an important field of the modern scientific research and involves interesting aspects in computer science.Given an objective function f modeling a system of points which depends on a set of parameters, the task is to determine the set of parameters which optimizes f .For simplicity, in most cases the optimization problem is stated as a minimization problem.
For high dimensional problems, numerical methods are suggested which construct converging sequences of points by employing local optimizers, for example search methods.In general, procedures of this type are stochastic and rely on intensive computation to explore the solution space, without too deep enquires concerning convergence proofs.Roughly speaking, there are two main approaches: a sampling approach and an escaping approach.Examples of the sampling approach are Pure Random Search algorithms [1], where an extensive sampling of independent points is considered, and Multistart algorithms [2,3], which apply local optimizers to randomly generated starting points; then the best among the found local optima is considered the global optimum.Sampling algorithms have been shown [3] to converge in infinite time with probability 1, but in practice there is no guarantee that the best local solution found in a finite amount of computation is really the global optimum.In the escaping approach, various methods are devised for escaping from an already found local optimum, reaching other feasible points from which to restart a new local optimizer (see for example the Simulated Annealing [4]).
In practice, most algorithms succeed for small and medium-sized problems, but the chance of finding the global optimum worsens considerably when the number of the dimensions increases, due to the necessary limitation of the sampling set size.In any case, the computation of a local solution can be very expensive and the convergence of differently started local procedures to the same local optimum is frequent and should be avoided.
In this paper we propose an adaptive strategy for finding the global minimum of a polynomial function expressed through the Frobenius norm of matrices and tensors.The considered problem is outlined in Section 2. The Multistart algorithm, on which the adaptive strategy is based, is described in Section 3. It implements a particularly tailored local minimization procedure with an ad-hoc stopping condition.The basic step of the adaptive strategy is implemented through a priority queue and consists in splitting the runs starting from different initial points in segments of fixed length.The processing order of the various segments is interlaced, discarding those which appear less promising.The adaptive procedure is described in Section 4. Finally, in Section 5 the validity of the adaptive procedure is verified through a large experimentation on both nonnegatively constrained and unconstrained problems.

The Problem
A widely studied problem in literature is the one of nonlinear global minimization where D ⊆ R N is a region determined by a set of constraints and f : D → R + is a polynomial function of degree d ≥ 2 which models the problem's objective.Classical examples are D = R N + of nonnegativity constraints, and D = R N when no constraint is given.Problem (1) is NP-hard [5].In this paper we consider specifically objective functions expressed through the Frobenius norm of an array A whose elements are low degree polynomials in the entries of x: We examine in particular the cases where A(x) is a two-dimensional or a three-dimensional matrix.For example, in the two dimensional case, a problem considered in the experimentation looks for two low-rank matrices W ∈ R m×k given a nonnegative matrix M ∈ R m×n + and an integer k < n.Problem (3), known as Nonnegative Matrix Factorization (NMF), was first proposed in [6] for data analysis and from then on has received much attention.It is a particular case of Problem (1) with f (x) as in (2), N = (m + n)k, D = R N + and x a vectorization of the pair (W, H).In a columnwise vectorization, A(x) is the matrix of elements Since f is a polynomial, the gradient ∇ f and the Hessian ∇ 2 f of f are available.From a theoretical point of view, a simple way to solve Problem (1) in the unconstrained case would be to look for the stationary points x where ∇ f (x) = 0 and choose whichever x minimizes f , but this way does not appear to be practical when N is large, even if d is not too large, because finding a global minimum can be a very difficult task when f has many local minima.For this reason, numerical methods have been proposed which do not try to solve ∇ f (x) = 0. Anyway, many numerical methods take advantage of derivative information to improve the computation.
When the dimension N is very large, also numerical methods which require at the same time O(N) active points might be impracticable.This is the case, for example, of methods such as Nelder-Mead algorithm [7] which works on a polytope of N + 1 vertices, or Differential Evolution [8] which generates at each iteration a new population of more than N + 1 points, or Simulated Annealing [9] which for the continuous minimization uses a downhill simplex of N + 1 points.In this paper we propose a method which allows solving Problem (1) also when N is very large.To validate its performances against other methods, it is necessary to restrict our experimentation to dimensions where the other methods could still compute reliable solutions.

A Multistart Algorithm
Multistart algorithms are widely used heuristics for solving Problem (1) [2,3].Over the years, many variants and generalizations have been produced.We consider here its basic form which consists of the following steps: • start with a fixed number q of independent initial points x (r) 0 , r = 1, . . ., q, randomly generated.

•
For r = 1, . . ., q, a local optimizer, say L, is applied to x (r) 0 to find an approximation of a local minimum.We assume L to be an iterative procedure which, in the constrained case, takes on the nonnegativity of x and denote by x (r) ν , ν = 1, 2, . .., the sequence of points computed until a suitable stopping condition is satisfied.The corresponding function values f (x (r) ν ) estimate the quality of the approximation.Both the chosen procedure L and the stopping condition employed by L are of crucial importance.Ideally, the stopping condition should verify whether the local minimum has been reached within a given tolerance , but of course more practical tests must be used.

•
When all the q points x (r) 0 have been processed, the point x (r) ν with the smallest function value f (x (r) ν ) is selected as an approximation of the global minimum.It is evident that the minimum found in this way is not guaranteed to suitably approximate the global minimum, even if a large number of starting points is considered.
In the experimentation we have implemented a version of the basic Multistart algorithm, here denoted by MS, as reference.The chosen local optimizer L is particularly suitable for the problems we are considering and exploits the differentiability of the function f (x).Its execution time turns out to be much smaller than what is generally expected from a library method, which relies on general purpose procedures as local optimizers and does not exploit the differentiability information.The stopping condition we have implemented monitors the flatness of the approximation by assuming x (r) ν as an acceptable approximation of a local minimum when (max g − min g)/mean g ≤ , where g = f (x (r) Of course, if an index ν exists such that f (x (r) ν ) = 0, the algorithm stops and condition (4) is not tested.The tolerance is set equal to some power of the machine eps.As usual, a maximum number d max of allowed iterations is imposed to each run.
As an example, we have applied MS to one of the problems described in Section 5.3, namely problem 10 × 5 × 15 (5,3), generating q = 5 initial points.Figure 1a shows the log plot of the histories f (x (r) ν ), ν = 1, 2, . .., computed by MS for r = 1, . . ., q using the stopping condition (4) with = 10 −12 .All the sequences converge to the same local minimum, but one (plotted with black solid line) requires 300 iterations and appears to be faster than the others which require from 400 to 500 iterations.The number of iterations of the overall execution is 2100.
Letting the local algorithm run until convergence for all the starting points is a time-consuming procedure, in particular when the iterates obtained from many different starting points lead to the same local minimum.A reasonable approach, that we presently propose, modifies this procedure adaptively.It is based on an efficient premature recognition of the more promising runs of the local algorithm and interrupts with high probability the runs which may lead to already found local minima or to local minima of larger value than those already found.

An Adaptive Procedure
Because of the large computational cost of the strategy described in the previous section, the straightforward implementation can be profitably applied only to small problems.For this reason we propose here a heuristic approach, which we denote by the name AD, suitable for large problems whose basic step consists in the splitting of the runs in segments of a fixed length λ.Each segment, identified by the index r, r = 1, . . ., q, stores the partial solution produced at the end of the λ iterations, which will be used as the starting point for the next run with the same index r.The processing order of the various segments is modified by interlacing the computation of segments with different indices r.This strategy requires segments corresponding to different r's to be available at any moment of the computation.The segments, which appear to be the more promising ones depending on the behavior of the computed function values, are carried on and the other segments are discarded.
The strategy is accomplished using a priority queue Q whose items have the form where • r ∈ {1, . . ., q} is the index of the segment; it identifies the seed chosen for the random generation of the initial point following a uniform distribution between 0 and 1.
is the number of iterations computed until now by L in all the segments with the same index r, i.e., starting with x (r) h is the partial solution computed by L at the end of the current rth segment.
h ) is the array containing the function values of the last three points computed by L in the current rth segment.
) is an estimate of the decreasing rate of the function.
is the priority of the item, which rules the processing order of the various segments according to the minimum policy.
The belonging of an element to an item Y is denoted by using the subscript (Y), for example r (Y) denotes the index of item Y.
During the computation three global quantities, g min , x min , and r min , play an important role: • g min is the minimum of all the function values computed so far relative to all the items already processed • x min is the computed point corresponding to g min and r min is the index of the item where g min has been found.
The priority is defined by χ = γ + ξ, where γ = log 10 ( f (x)/g min ) measures the accuracy of the last computed iterate in the current segment and ξ = h/λ measures the age of the computation of the rth seed.Between two items which have the same age, this expression favors the one with the better accuracy and, between two items which have comparable accuracies, it favors the younger one, that is the quicker one.The inclusion of the age into the priority is important to avoid that only the best item is processed and all the other items are ignored.In fact, with the increase of the age, also items which had been left behind take their turn.
At the beginning the initial points x (r) 0 , r = 1, . . ., q, are randomly generated and g min = min r=1,...,q f (x (r) 0 ) is set.The initial priorities are defined by χ (r) = log 10 ( f (x (r) 0 )/g min ) and the queue is formed by the q items {r, h, where h = δ = 0 and g is a void array.
The length of the queue, i.e., the number of items in Q, is found by calling the function Length.At the beginning the length is q and it is modified by the functions: A further function Maxdelta(Q) returns the largest value among the quantities δ of all the items in Q.
During the computation the following quantities deriving from g = g (Y) , h = h (Y) and δ = δ (Y) are referred (their expressions have been tested by an ad hoc experimentation): • ϑ = (max g − min g)/mean g measures the flatness of the last computed values of f according to (4), • ϕ = δ/mean g measures the decreasing rate of f .If ϕ < 0 the last computed iteration increases over the preceding one.Only small increases, corresponding to limited oscillations, are tolerated, measures the discarding level.The larger c, the higher the probability of discarding the item.Among items which have comparable values f (x), the older one has greater probability to be discarded.
Three other functions are needed: The local minimization algorithm L is applied for the predefined number λ of iterations (in the experiments we assume λ = 10) and returns x, g, the minimum g of the function values of the current segment, its corresponding point x, the number ν of performed iterations and the decreasing rate δ.Since algorithm AD is based on the splitting into segments of the run corresponding to a single starting point, the local optimizer L needs to be stationary, in order to get a computation which gives the same numerical values as if a single run had been performed without segmentation.

•
The following Boolean function Control verifies whether an item is promising (i.e., must be further processed) or not.For notes (a)-(e) see the text.
return True; The decision on which course to choose depends mainly on the flatness, but other conditions are also taken into consideration as described in the following notes: (a) large oscillations are not allowed, (b) the flatness level has been reached, (c) the selected item decreases more quickly than other items in the queue, (d) the selected item has the same index r of the item where g min has been found, (e) if none of the preceding conditions holds, the selected item is discarded with probability given by its discarding level.
After the initialization of the queue, the adaptive process AD evolves as follows: an item Y is extracted from the queue according to its priority χ (Y) .When enough iterations (6λ iterations in the experiments) have been performed for stabilizing the initial situation, the function Control is applied, and if Y is recognized promising, the function L is applied to point x (Y) .A new item is so built and inserted back into Q.If the computed f ( x) is smaller than g min , then g min , x min and r min are updated.Otherwise, if Y is not promising, no new item is inserted back into Q, i.e., Y is discarded.At the end, x min and f (x min ) are returned.A bound d max is imposed on the number of iterations of the overall execution.
Our proposed algorithm is implemented in the following function Chi.
function Chi q, λ, , d max ; for r = 1, . . ., q let x (r) 0 = Random (N); initialize r min , x min , g min ,; initialize Q according to (5); end if end while return x min and f (x min ); Figure 1b shows the log plots of the histories f (x (r) ν ) obtained by applying AD to the same problem of the example considered in Section 3 with MS, using also the same q = 5 initial points.Only the sequence plotted with the black solid line survives in the queue, while all the other sequences are discarded from the queue at different steps of the computation.At the end 800 overall iterations have been performed, compared with the 2100 of MS, as it is evident from the two figures, pointing out the outperformance of AD on MS.

Experimentation
The experimentation is performed with a 3.2 GHz 8-core Intel Xeon W processor machine using IEEE 754-64 bit double precision floating point format (with eps = 2.22 × 10 −16 ) and is carried out on the three nonnegatively constrained test problems described in Sections 5.1-5.3 and on the unconstrained test problem described in Section 5.4.The following methods are applied to each test problem.

•
Class 1 methods: library versions of Nelder-Mead method (denoted in the tables by the name NM), Differential Evolution (denoted DE) and Simulated Annealing (denoted SA).These methods are run for at most 200 iterations and are applied for comparison purpose.

•
Class 2 methods: the Multistart procedure described in Section 3 (denoted MS) and the adaptive procedure described in Section 4 (denoted AD).Two values for the tolerance in (4) are used by MS and AD, namely 1 = 10 −8 ∼ eps 1/2 and 2 = 10 −12 ∼ eps 3/4 .The names MS1 and AD1 refer to 1 and the names MS2 and AD2 refer to 2 .The number of randomly generated initial points is q = 10.
The bound to the number of iterations for each run of MS is d max = 1000, while the overall number of iterations of AD is bounded by d max = 5000.
With Class 2 methods the procedure L used as local minimizer differs according to the presence of constraints or not.In the test problems two different situations occur:

•
The problem has a region D related to nonnegativity constraints.In this case, a block nonlinear Gauss-Seidel scheme [10] is implemented as L, in order to find constrained local minima of f (x).The vector x is decomposed into µ disjoint subvectors, cyclically updated until some convergent criterium is satisfied.The convergence of this scheme, more specifically called in our case Alternating Nonnegative Least Squares (ANLS), is analyzed in [10].We apply it for the particular case of Problem (1) with f (x) as in (2) in the two-dimensional case (µ = 2 in Sections 5.1 and 5.2) and in the three-dimensional case (µ = 3 in Section 5.3).

•
The problem is unconstrained (Section 5.4).In this case the local minimizer L is a library procedure based on Levenberg-Marquardt method.
The aim of the experimentation is: • To analyze, in terms of both execution time and accuracy, the effects on the performance of Class 2 methods of the two chosen values for the tolerance used in the stopping condition (4).Of course, we expect that a weaker request for the tolerance ( 1 = 10 −8 ) would result in lower accuracy and time saving than a stronger request ( 2 = 10 −12 ).This issue is experimentally analyzed by comparing the performances of method MS1 with those of method MS2 and the performances of method AD1 with those of method AD2.

•
To compare the performances of the adaptive and non adaptive procedures of Class 2. We wonder whether the adaptive procedure, with its policy of discarding the less promising items, can overlook the item which eventually would turn out to be the best one.This issue is experimentally analyzed by comparing the performances of method MS1 with those of method AD1 and the performances of method MS2 with those of method AD2.

•
To compare the performances of Class 1 and of Class 2 methods.Naturally, we expect that in the case of constrained problems the execution times of Class 1 methods, which implement general purpose procedures, would be larger than those of Class 2, specifically tailored to functions of type (2).Beforehand, it is not clear whether the latter ones would pay the time saving with a lower accuracy.
For each problem we assume that an accurate approximation of the solution f * of (1) is known.In the tables two performance indices are listed: • the execution time (denoted time) in seconds.In order to get a more reliable measure for this index, the same problem is run 5 times with each starting point and the average of the measures is given in the tables as time.
• the quantity lerr = − log 10 ( f (x min ) − f * ), rounded to the first decimal digit, as a measure of accuracy.Approximately, lerr gives the number of exact decimal digits obtained by the method.The larger lerr, the more accurate the method.When a method produces a very good value f (x min ) such that f (x min ) − f * ∼ eps, the case is marked by the symbol +.
We give now a description of the test problems together with their numerical results.

The Nonnegative Matrix Factorization
The first test problem is the NMF problem (3) already outlined in Section 2, i.e., given M ∈ R m×n + and k < n, we look for the basis matrix W ∈ R m×k + and the coefficient matrix for ν = 1, 2, . .., are computed.
Although the original problem (3) is nonconvex, subproblems ( 6) are convex and can be easily dealt with.In the experimentation we use for (6) the Greedy Coordinate Descent method (called GCD in [11]), which is specifically designed for solving nonnegative constrained least squares problems.The attribute "Greedy" refers to the selection policy of the elements updated during the iteration, based on the largest decrease of the objective function.In our tests this method has shown to be fast and reliable.The corresponding code can be found as Algorithm 1 in [11].
For the experimentation, given the integers m, n and h ≤ min{m, n}, two matrices A ∈ R m×h The dimension of the solution space is N = (m + n) k.In the cases k = h we expect f * = 0.The results are shown in Table 1.In the table we note that Class 1 methods, which use general purpose local optimizers, are more time consuming than Class 2 methods, as expected.Moreover, this gap increases with the dimension N.
Averagely, Class 1 methods are more accurate than Class 2 methods.Within the Class 2, each adaptive method is faster than the corresponding non adaptive one, as can be seen by comparing the times in column AD1 with those in column MS1 and the times in column AD2 with those in column MS2.Obviously, MS1 and AD1 give lower accuracies than MS2 and AD2 respectively, due to a weaker request on the flatness imposed by (4).By comparing the measures lerr in column AD2 with those in column MS2, we see that AD2 does not lose accuracy with respect to MS2.

The Symmetric NMF
In some applications, matrix M ∈ R n×n + is symmetric, then the objective function to be minimized becomes F , under the constrain W ≥ O.This problem is known as symmetric NMF (SymNMF).As suggested in [12], the solution W ∈ R n×k + can be found through a nonsymmetric penalty problem of the form min α being a positive parameter which acts on the violation of the symmetry.Hence, applying ANLS, we solve alternatively the two subproblems In [12], the sequence of penalizing parameters is constructed by setting α ν = β ν max M, with β 0 = 1 and β ν is modified according to a geometric progression of the form β ν = ζ ν with the fixed ratio ζ = 1.01.We suggest instead to let β ν be modified adaptively, as shown in [13].Symmetric NMF problems arise naturally in clustering.Given n points p i in an Euclidean space and the number k of the required clusters, the clustering structure is captured by a symmetric matrix M, called similarity matrix, whose elements m i,j represent the similarity between p i and p j .These similarity values are computed through the Gaussian kernel σ 2 , for i = j, and e i,i = 0, (7) where σ is a suitable scaling parameter, followed by the normalized cut The experimentation deals with two sets C 1 and C 2 of n points randomly generated in R 2 (see Figure 2 for n = 50).
Set C 1 is generated with n = 10, 50, 100 and set C 2 is generated with n = 50, 100, 200.The number of the required clusters is fixed to k = 5 for all the cases.The dimension of the solution space is N = 2nk.The results are shown in Table 2. On these problems, Class 2 methods outperform Class 1 methods in both the computational time and the accuracy.Moreover, by comparing the measures lerr in column AD1 with those in column MS1 and the measures lerr in column AD2 with those in column MS2, we see that, for each value of the tolerance , the adaptive method and the corresponding non adaptive one share the same accuracy.

The Nonnegative Factorization of a 3rd-Order Tensor
Let T be a 3rd-order tensor, i.e., a three-dimensional array whose elements are t i,j,k , with i = 1, . . ., m, j = 1, . . ., n and k = 1, . . ., .A common way to represent graphically T is through its (frontal) slices T k , k = 1, . . ., , where T k is the matrix obtained by fixing to k the third index.
If T can be expressed as the outer product of three vectors u, v and z, i.e., T = u • v • z, where • denotes the vector outer product, T is called rank-one tensor or triad.
Among the many factorizations of a tensor defined in the literature, we consider here the CANDECOMP/PARAFAC (in the following CP) decomposition (see [14]) into the sum of triads, which extends in a natural way the decomposition of matrices in sum of dyads.Given an integer ρ, three matrices U ∈ R m×ρ , V ∈ R n×ρ and Z ∈ R ×ρ are computed such that where u s , v s and z s are the sth columns of U, V and Z (see [14,15]).Elementwise, (8) is written as The objective function of the CP decomposition of T is The smallest integer ρ for which (8) holds with equality, i.e., min f (U, V, Z) = 0, is the tensor rank of T.
By flattening its slices, the elements of T can be arranged into a matrix.For example, one can leave the ith index and linearize the two other indices, obtaining a matrix which is denoted by T (i) .Thus, T (1) ∈ R m×n , T (2) ∈ R n×m and T (3) ∈ R ×m n are the matrices whose elements are These matrices are written in a compact way by using the notation of the Khatri-Rao product, defined as follows.Given two matrices A ∈ R m×h and B ∈ R n×h , the Khatri-Rao product A B is the where a i and b i are the ith columns of A and B respectively, and ⊗ denotes the Kronecker product of two vectors.With this notation (8) becomes In the experimentation we consider the nonnegatively constrained CP decomposition and minimize (9) by applying ANLS as the local minimizer to the three matrices given in (10).Having fixed nonnegative U 0 and V 0 , the computation proceeds alternating on three combinations as follows The dimension of the solution space is N = (m + n + ) ρ.In the cases ρ = h we expect f * = 0.The results are shown in Table 3. From the point of view of the computational time, the performances of Class 1 and Class 2 methods agree with those seen in the other tables.From the point of view of the accuracy, Class 2 methods with the stronger request 2 for the tolerance are competitive with Class 1 methods.

Tensor Factorization for the Matrix Product Complexity
The theory of matrix multiplication is strictly related to tensor algebra.In particular, the concept of tensor rank plays an important role in the determination of the complexity of the product of matrices [16].We consider here the case of square matrices.
Given A, B ∈ R n×n , the elements of C = A B can be written in the form and K r,s ∈ R n 2 ×n 2 is expressed by the Kronecker product K r,s = I n ⊗ H r,s where the n × n matrix H r,s has elements h i,j = δ r,i δ s,j .The associated 3rd order tensor T ∈ R n 2 ×n 2 ×n 2 is the one whose slides are K r,s , i.e., the elements of T are t i,j,k = (K r,s ) i,j , with k = n(r − 1) + s.Most elements of T are null.A reduction of the representation is obtained when A is triangular.For example, in the case of an upper triangular matrix A of size n, a can be represented by the n(n + 1)/2 elements a T = [a 1,1 , a 1,2 , a 2,2 , . . ., a 1,n , . . ., a n,n ], with a resulting reduction in the first dimension of the matrices K r,s .If B is a full matrix, the associated tensor has n 2 × n(n + 1)/2 × n 2 elements.
A fact of fundamental relevance is that the complexity of the computation of the product A B, i.e., the minimum number of nonscalar multiplications sufficient to compute the product, is equal to the tensor rank of T. In the experimentation we consider problems of the form (9) for a given integer ρ.If the minimum is equal to zero, then the tensor rank of T is lower than or equal to ρ.The difficulty which characterizes these problems is due to the presence of large regions of near flatness of f , leading to a slow convergence to local minima.The difficulty increases when the parameter ρ decreases.
The dimension of the solution space is N = 3n 2 ρ when both A and B are full, and is N = (2n 2 + n(n + 1)/2)ρ when one of the matrix is triangular.In the tests we consider the following cases with matrices A and B of size n = 3 and ρ such that f * = 0. Problems TF3(ρ): A is a triangular matrix and B is a full matrix, with ρ = 16, 17, 18, Problems FF3(ρ): A and B are full matrices, with ρ = 24, 25, 26, 27.
The results are shown in Table 4.The low accuracies of Class 1 methods point out the effective difficulty of the problems.In one case a method of Class 1 fails to give an acceptable solution and this fact is marked by the symbol −.In this case there is not a great discrepancy between the computational times of Class 1 methods and non adaptive Class 2 methods, as can be seen by comparing the times in columns NM, DE and SA with those in columns MS1 and MS2.Typically, the adaptive methods lower the computational time while maintaining better accuracies.Only for the first problem, a much better accuracy of AD2 is paid by a much larger computational time.
From the point of view of the accuracy, Class 2 methods definitely outperform Class 1 methods.

Summary of the Results
First, we analyze the effect of the chosen tolerance on Class 2 methods.The experiments confirm the expectation that stronger tolerance requests, which pay in terms of larger execution time, in general lead to more accurate results.The comparison of the performances of Class 1 and Class 2 methods shows that the accuracy of MS2 and AD2 does not appear to suffer too much with respect to the accuracy of Class 1 methods, in spite of the fact that they enjoy a much smaller execution time.
In order to summarize the results of the experimentation, we plot in Figure 3 the average of all the values of lerr, taken as a measure of the methods accuracy on all the problems, versus the averaged time.The larger the ordinate value, the more accurate the method.From the figure we see that, on average, the adaptive procedure AD2 not only is faster, but also gives comparable accuracy with MS2.This is due to the fact that the computational effort of AD2 is focused where it is more useful and is not wasted.
To study how time depends on N, we need data corresponding to large values of N. A problem for which it is possible to obtain these data is the symmetric NMF applied to clustering.In fact, in this case it is easy to construct problems of the same type with increasing dimensions.
Figure 4 gives an idea of the time performance of the methods when they are applied to the cluster problem C 2 .Class 2 methods have been run with increasing dimensions until n = 2000, corresponding to size N = 20,000.The execution times of AD2 (dashed line) and MS2 (solid line) are plotted together with the times of NM (solid line), chosen as the representative of Class 1 for N ≤ 2000 (larger dimensions for Class 1 methods could not be considered).For the lower dimensions, the use of the adaptive method does not give advantages, because the initialization overhead (due to the 6λ iterations performed for stabilization before starting the adaptive strategy).This is shown by the overlap of the plots of AD2 and MS2 in the initial tract.The lines which appear in the log-log plot suggest that the asymptotical behavior of the computational cost for increasing N is of the form α N β .Very similar values of β suggest that the costs of MS2 and NM have roughly the same order, while the cost of AD2 has a lower order.It is evident that the multiplicative constant α, whose log is readable from the vertical offset of the lines, of MS2 and AD2 is very lower than the one of NM.

Conclusions
In this paper an adaptive strategy, called AD, has been introduced for finding the global minimum of a polynomial function expressed through the Frobenius norm of matrices and tensors.The strategy is based on a heuristic approach to the Multistart algorithm.To compare the proposed method with some classical library methods, an extensive experimentation has been performed on both nonnegatively constrained and unconstrained problems, which include the nonnegative factorization of matrices and tensors and the determination of the tensor rank associated to the matrix product complexity.The proposed method allows a considerable saving of the computational time and comparable accuracy with respect to classical procedures.

Figure 1 .
Figure 1.(a) History plots of MS; (b) History plots of AD.

+
and B ∈ R n×h are generated with random elements uniformly distributed over [0, 1] and the matrix M = A B T is constructed.The test problem (3) requires computing the NMF of M, i.e., given an integer k ≤ h, two matrices W ∈ R m×k + and H ∈ R n×k + such that M ≈ W H T are to be found.The following cases are considered m = 20, n = 10, h = 5, with k = 3 and k = 5, m = 40, n = 30, h = 10, with k = 5 and k = 10.

Figure 3 .
Figure 3. Average lerr (taken as a measure of accuracy) versus time.

Figure 4 .
Figure 4. Log-log plot of the execution times of AD2, MS2, and NM versus N.

Table 1 .
Execution times and accuracies for nonsymmetric Nonnegative Matrix Factorization (NMF) problems.

Table 2 .
Execution times and accuracies for symmetric Nonnegative Matrix Factorization (NMF) problems.

Table 4 .
Execution times and accuracies for the tensor factorization associated to matrix product complexity.