1. Introduction
Multistart methods [
1,
2,
3,
4,
5] are approaches used to increase the possibility of finding a global optimum instead of a local optimum in global optimization problems. This family of methods consists of making comparisons between many local optima obtained by running a chosen optimization procedure with respect to different starting configurations. Alternatively, multistart is adopted when a set of local optima is desired by the optimization task. For these reasons, multistart methods can be applied both to unconstrained and constrained optimization, both to derivativefree and derivativebased procedures, and both to metaheuristic and theoreticalbased optimization methods.
In global optimization problems, other approaches include populationbased methods, like Genetic Algorithms (see, [
6,
7,
8,
9]) and Swarm Intelligencebased methods (e.g., Particle Swarm Optimization methods, see [
10,
11,
12]). These natureinspired methods are somehow similar to multistart methods since they are based on a set of starting guesses; however, this set is used as a swarm of interacting agents that move in the domain of the loss function, looking for a global minimizer. Even if these latter methods are very efficient, they suffer a lack of understanding of the convergence properties, and it is not clear how much their efficiency is preserved if applied to largescale problems [
13]. Therefore, in highdimensional domains, multistart methods based on optimization procedures with wellknown convergence properties can be preferred.
In the literature, sometimes the term multistart is used to denote general global search methods, e.g., a random search or a restart method can be defined as multistart (see [
14]), or viceversa (see [
15]). Nonetheless, in this work and according to [
5], we denote as multistart methods only those methods that consist in running the same optimization procedure with respect to a set of
$N\in \mathbb{N}$ distinct starting points
${\mathit{x}}_{1}^{\left(0\right)},\dots ,{\mathit{x}}_{N}^{\left(0\right)}\in {\mathbb{R}}^{n}$. Obviously, this approach can be modified into more specialized and sophisticated methods (e.g., [
15,
16,
17,
18]), but it is also useful in its basic form; indeed, it is implemented even in the most valuable computational frameworks (e.g., see [
19]). However, the main difficulty of using a multistart approach is that the number
N of starting points typically must be quite large, due to the unknown number of local minima. For this reason, parallel computing is very important in this context, and the real exploitation of multistart seems to be restricted to specific algorithms that are able to take advantage of the computer hardware for parallelization [
16,
17,
20,
21]. According to [
21,
22], three main parallelization approaches can be determined: (
i) parallelization of the loss function computation and/or the derivative evaluations; (
$ii$) parallelization of the numerical methods performing the linear algebra operations; and (
$iii$) modifications of the optimization algorithms in a manner that improves the intrinsic parallelism (e.g., the number of parallelizable operations). In this work, we focus on the parallelization schemes of the first case (
i), specifically on the parallelization of the derivative evaluations, because they are typically used for generating general purpose parallel software [
21].
The main drawback of parallelization for multistart methods is the difficulty of finding a tradeoff between efficiency and easy implementation, especially for solving optimization problems of moderate dimensions on nonHigh Performance Computing (nonHPC) hardware. Typically, the simplest parallelization approach consists in distributing the computations among the available machine workers (e.g., via routines as [
23,
24]); however, very rarely is this also the most efficient method. Alternatively, a parallel program specifically designed for the optimization problem and for the hardware can be developed, but the time spent in writing this code may not be worth it. Of course, the cost of the parallelization of the derivative evaluations also depends on the computation methods used; when the gradient of the loss function is not available, Finite Differences are typically adopted in literature but, as observed in [
22], the advent of Automatic Differentiation in recent decades has presented new interesting possibilities, allowing for the adoption of this technique for gradient computation in optimization procedures (e.g., see [
25,
26,
27,
28,
29]).
The reverse Automatic Differentiation (AD), see [
30] (Ch. 3.2)), was originally developed by Linnainmaa [
31] in the 1970s, to be rediscovered by Neural Network researchers (who were not aware of the existence of AD [
32]) in the 1980s under the name of backpropagation [
33]. Nowadays, reverse AD characterizes almost all the training algorithms for Deep Learning models [
32]. AD is a numerical method useful for computing the derivatives of a function, through its representation as an augmented computer program made of a composition of elementary operations (arithmetic operations, basic analytic fuctions, etc.) for which the derivatives are known. Then, using the chain rule of calculus, the method combines these derivatives to compute the overall derivative [
32,
34]. In particular, AD is divided into two subtypes: forward and reverse AD. The forward AD, developed in the 1960s [
35,
36], conceptually is the simplest one; it computes the partial derivative of a function
$\partial f/\partial {x}_{i}$ by recursively applying the chain rule of calculus with respect to the elementary operations of
f. On the other hand, reverse AD [
31,
33] “backwardly” reads the composition of elementary operations constituting the function
f; then, still exploiting the chain rule of calculus, it computes the gradient
$\nabla f$. Both the AD methods can be extended to vectorial functions
$\mathit{F}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$ and used for computing the Jacobian. Due to the nature of the two AD methods, usually the reverse AD is more efficient if
$n>m$ because it can compute the gradient of the
jth output of
$\mathit{F}$ at once; otherwise, if
$n<m$, the forward AD is the more efficient method because it computes the partial derivatives with respect to
${x}_{i}$ for all the outputs of
$\mathit{F}$ at once (see [
30] (Ch. 3)). The characteristic described above for the reverse AD is the base for the new multistart method proposed in this work, where we focus on the unconstrained optimization problem of a scalar function
$f:{\mathbb{R}}^{n}\to \mathbb{R}$.
To improve the efficiency in exploiting the chain rule, the computational frameworks focused on AD (e.g., [
37]) are based on computational graphs, i.e., they compile the program as a graph where operations are nodes and data “flow” through the network [
38,
39,
40]. This particular configuration guarantees high parallelization and high efficiency (both for function evaluations and for ADbased derivative computations). Moreover, the computational graph construction is typically automatic and optimized for the hardware available, keeping the implementation relatively simple.
In this paper, we describe a new efficient multistart method where the parallelization is not explicitly defined, thanks to the reverse Automatic Differentiation (see [
30] (Ch. 3.2)) and the compilation of a computational graph representing the problem [
39,
40]. The main idea behind the proposed method is to define a function
$G:{\mathbb{R}}^{nN}\to \mathbb{R}$ such that
$G({\mathit{x}}_{1},\dots ,{\mathit{x}}_{N}):=f\left({\mathit{x}}_{1}\right)+\cdots +f\left({\mathit{x}}_{N}\right)$, for any set of
$N\in \mathbb{N}$ vectors in
${\mathbb{R}}^{n}$, where
$f:{\mathbb{R}}^{n}\to \mathbb{R}$ is the loss function of the optimization problem and
N is fixed. Since the gradient of
G with respect to
$({\mathit{x}}_{1},\dots ,{\mathit{x}}_{N})\in {\mathbb{R}}^{nN}$ is equivalent to the concatenation of the
N gradients
$\nabla f\left({\mathit{x}}_{1}\right),\dots ,\nabla f\left({\mathit{x}}_{N}\right)$, by applying the reverse AD on
G we can compute the
N gradients of the loss function very efficiently; therefore, we can implicitly and efficiently parallelize a multistart method of
N procedures running one ADbased optimization procedure for the function
G. The main advantage of this approach is the good tradeoff between efficiency and easy implementation. Indeed, nowadays the AD frameworks (e.g., [
37]) compile functions and programs as computational graphs that automatically but very efficiently exploit the available hardware; then, with the proposed method, the user just needs to define the function
G and the differentiation through the reverse AD, obtaining a multistart optimization procedure that (in general) is more timeefficient than a direct parallelization of the processes, especially excluding an HPC context. Moreover, the method can be extended, implementing it by using tailored shallow Neural Networks and taking advantage of the builtin optimization procedures of the Deep Learning frameworks.
The work is organized as follows. In
Section 2, we briefly recall and describe the AD method. In
Section 3, we start introducing a new formulation of the multistart problem that is useful for the exploitation of the reverse AD, and the time complexity estimations are illustrated. Then, we show numerical results illustrating a context where the proposed method is advantageous with respect to a classic parallelization. In
Section 4, we show how to implement the ADbased multistart approach using a tailored shallow Neural Network (NN) in those cases where the user wants to take advantage of the optimization procedures available in most of the Deep Learning frameworks. In particular, we illustrate an example where the new multistart method is implemented as a shallow NN and used to find three level set curves of a given function. Finally, some conclusions are drawn in
Section 5.
3. Reverse Automatic Differentiation for MultiStart
Let us consider the unconstrained optimization problem
where
$f:{\mathbb{R}}^{n}\to \mathbb{R}$ is a given function, and let us approach the problem with a gradientbased optimization method (e.g., the steepest descent). Moreover, we assume
f is the composition of elementary operations so that it is possible to use the reverse AD to compute its gradient.
The main idea behind the reverse ADbased multistart method is to define a function
$G:{\mathbb{R}}^{nN}\to \mathbb{R}$ such that it is a linear combination of loss function evaluations at
N vectors, i.e.,
for each
${\mathit{x}}_{i}\in {\mathbb{R}}^{n}$,
${\lambda}_{i}\in {\mathbb{R}}^{+}$ fixed,
$i=1,\dots ,N$. Then, for any set of points
$\{{\widehat{\mathit{x}}}_{1},\dots ,{\widehat{\mathit{x}}}_{N}\}\subset {\mathbb{R}}^{n}$ where
f is differentiable, the gradient of
G at
$\widehat{\mathit{\xi}}={[{\widehat{\mathit{x}}}_{1}^{T},\dots ,{\widehat{\mathit{x}}}_{N}^{T}]}^{T}$ is equal to the concatenation of the vectors
${\lambda}_{1}\nabla f\left({\widehat{\mathit{x}}}_{1}\right),\dots ,{\lambda}_{N}\nabla f\left({\widehat{\mathit{x}}}_{N}\right)$. Therefore, using the reverse AD to compute the gradient
${\nabla}_{\mathit{\xi}}G\left(\widehat{\mathit{\xi}}\right)$, we obtain a very fast method to evaluate the
N exact gradients
$\nabla f\left({\widehat{\mathit{x}}}_{1}\right),\dots ,\nabla f\left({\widehat{\mathit{x}}}_{N}\right)$, where
$\mathit{\xi}$ denotes the domain variable of
G in
${\mathbb{R}}^{nN}$. Therefore, we can apply the steepest descent method for
G, which actually corresponds to applying
N times the steepest descent methods to
f. In the following, we formalize this idea.
Definition 1 ($\mathit{N}$concatenation). For each fixed $N\in \mathbb{N}$, for any function $\mathit{f}:\Omega \subseteq {\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$, a function $\mathit{F}:{\Omega}^{N}\subseteq {\mathbb{R}}^{nN}\to {\mathbb{R}}^{mN}$ is an Nconcatenation
of $\mathit{f}$ iffor each set of N vectors $\{{\mathit{x}}_{1},\dots ,{\mathit{x}}_{N}\}\in \Omega $. Notation 1. For the sake of simplicity, from now on vectors ${[{\mathit{x}}_{1}^{T},\dots ,{\mathit{x}}_{N}^{T}]}^{T}\in {\mathbb{R}}^{nN}$ will be denoted by $({\mathit{x}}_{1},\dots ,{\mathit{x}}_{N})$. Analogously, vectors returned by an Nconcatenation of a function $\mathit{f}$ will be denoted by $(\mathit{f}\left({\mathit{x}}_{1}\right),\dots ,\mathit{f}\left({\mathit{x}}_{N}\right))$.
Definition 2 ($\mathit{\lambda}$concatenation). For each fixed $\mathit{\lambda}\in {\mathbb{R}}^{N}$, for any function $\mathit{f}:\Omega \subseteq {\mathbb{R}}^{n}\to {\mathbb{R}}^{m}$, a function $\mathit{F}:{\Omega}^{N}\subseteq {\mathbb{R}}^{nN}\to {\mathbb{R}}^{mN}$ is a λconcatenation
of $\mathit{f}$ iffor each set of N vectors $\{{\mathit{x}}_{1},\dots ,{\mathit{x}}_{N}\}\in \Omega $. Remark 1. Obviously, an Nconcatenation of $\mathit{f}$ is a λconcatenation of $\mathit{f}$ with $\mathit{\lambda}=\mathit{e}={[1,\dots ,1]}^{T}\in {\mathbb{R}}^{N}$.
Given these definitions, the idea for the new multistart method is based on the observation that the steepest descent method for
G, starting from
${\mathit{\xi}}^{\left(0\right)}$ and with steplength factor
$\alpha \in {\mathbb{R}}^{+}$, represented by
is equivalent to
if the gradient of
G is an
Nconcatenation of the gradients of
f, i.e,
Analogously, if the gradient of
G is a
$\mathit{\lambda}$concatenation of the gradients of
f, with
$\mathit{\lambda}\in {\mathbb{R}}^{N}$,
$\mathit{\lambda}>\mathit{0}$, then (
6) is equivalent to
i.e., is equivalent to applying
N times the steepest descent methods to
f, with steplength factors
$\alpha {\lambda}_{1},\dots ,\alpha {\lambda}_{N}$.
Remark 2 (Extension to other gradientbased methods). It is easy to see that we can generalize these observations to other gradientbased optimization methods than the steepest descent (e.g., momentum methods [41,42]). Let $\mathcal{M}$ be the function characterizing the iterations of a given gradientbased method, i.e, such thatfor each $n\in \mathbb{N}$, each steplength $\alpha \in {\mathbb{R}}^{+}$ and each objective function $g:{\mathbb{R}}^{n}\to \mathbb{R}$. Then, the iterative process for G, starting from ${\mathit{\xi}}^{\left(0\right)}$, and with respect to a gradientbased method characterized by $\mathcal{M}$ isthat is equivalent to the multistart approach with respect to f:Moreover, we can further extend the generalization if we assume that, for each $\lambda \in {\mathbb{R}}^{+}$ and each $\alpha \in {\mathbb{R}}^{+}$, the following holds:where $m:{\mathbb{R}}^{+}\to {\mathbb{R}}^{+}$ is a fixed function. Indeed, if the gradient of G is a λconcatenation of the gradients of f, Equation (12) changes intowhere ${\alpha}_{i}:=m\left({\lambda}_{i}\right)\alpha $, for each $i=1,\dots ,N$. 3.1. Using the Reverse Automatic Differentiation
To actually exploit the reverse AD for a multistart steepest descent (or another gradientbased method), we need to define a function
G with gradient a
$\mathit{\lambda}$concatenation of the gradients of
f, the objective function of problem (
3).
Proposition 1. Let us consider a function $f:{\mathbb{R}}^{n}\to \mathbb{R}$. Let $\mathit{F}:{\mathbb{R}}^{nN}\to {\mathbb{R}}^{N}$ be an Nconcatenation of f, for a fixed $N\in \mathbb{N}$, and let ${L}_{\mathit{\lambda}}:{\mathbb{R}}^{N}\to \mathbb{R}$ be the linear functionfor a fixed $\mathit{\lambda}\in {\mathbb{R}}^{N}$, $\mathit{\lambda}>\mathit{0}$. Then, for each set of $N\in \mathbb{N}$ points where f is differentiable, the gradient of the function $G:={L}_{\mathit{\lambda}}\circ \mathit{F}$ is a λconcatenation of the gradient of f.
Proof. The proof is straightforward since
for each set of
$N\in \mathbb{N}$ points
$\{{\mathit{x}}_{1},\dots ,{\mathit{x}}_{N}\}\subset {\mathbb{R}}^{n}$ where
f is differentiable. □
Assuming that the expression of
${\nabla}_{\mathit{x}}f$ is unknown, a formulation such as (
14) obtained from
$G={L}_{\mathit{\lambda}}\circ \mathit{F}$ seems to give no advantages without AD and/or tailored parallelization. Indeed, the computation of the gradient
${\nabla}_{\mathit{\xi}}G$ through classical gradient approximation methods (e.g., Finite Differences) needs
N different gradient evaluations for each point
${\mathit{x}}_{1}^{\left(k\right)},\dots ,{\mathit{x}}_{N}^{\left(k\right)}$, respectively, and each one of them needs
$O\left(n\right)$ function evaluations (assuming no special structures for
f); then, e.g., the time complexity of the gradient approximation with Finite Differences for
G (denoted by
${\nabla}_{\mathit{\xi}}^{\mathrm{FD}}G$) is
where
$\mathrm{T}(\phantom{\rule{4pt}{0ex}}\xb7\phantom{\rule{4pt}{0ex}})$ denotes the time complexity.
On the other hand, reverse AD permits the efficient operation of the gradientbased method (
11), equivalent to the
N gradientbased methods, even for large values of
n and
N. Indeed, from [
30] (Ch. 3.3), for any function
$g:{\mathbb{R}}^{n}\to \mathbb{R}$, it holds that the time complexity of a gradient evaluation of
g with reverse AD (in a point where the operation is defined) is such that
where
${\nabla}^{\mathrm{AD}}$ denotes the gradient computed with reverse AD. The following lemma characterizes the relationship between the time complexity of
${\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G$ and the time complexity of
f.
Lemma 1. Let f, $\mathit{F}$, and ${L}_{\mathit{\lambda}}$ be as in Proposition 1. Let $G:{\mathbb{R}}^{nN}\to \mathbb{R}$ be such that $G:={L}_{\mathit{\lambda}}\circ \mathit{F}$, and let us assume thatThen, the time complexity of ${\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G$ is Proof. The proof is straightforward, due to (
17) and (
18). □
Comparing (
19) and the time complexity (
16) of the Finite Differences gradient approximation, we observe that reverse AD is much more convenient both because it computes the exact gradient and because of the time complexity. In the following example, we illustrate the concrete efficiency of using
${\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G$ to compute
N gradients of the
ndimensional Rosenbrock function, assuming its gradient is unknown.
Example 1 (Reverse AD and the $\mathit{n}$dimensional Rosenbrock function). Let $f:{\mathbb{R}}^{n}\to \mathbb{R}$ be the ndimensional Rosenbrock function [43,44,45]:Assuming a good implementation of f, we have that the time complexity of the function is $\mathrm{T}\left(f\right)=O(logn)$. Then, it holds that $\mathrm{T}\left({\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G\right)=O(Nlogn)$. In this particular case, we observe that the coefficient $C\in {\mathbb{R}}^{+}$, such thatfor sufficiently large N and n is very small; specifically, for the given example, we have $C<{10}^{4}$; moreover, we observe that the reverse AD can compute $N={2}^{12}=4096$ gradients of f in a space of dimension $n={2}^{12}=4096$ in less than one second (see Figure 3). All the computations are performed using a Notebook PC with Intel Core i5 Processor dualcore (2.3 GHz) and 16 GB DDR3 RAM. 3.2. An Insight into AD and Computational Graphs
Equation (
19) of Lemma 1 tells us that the time complexity of computing
N times the gradient
${\nabla}_{\mathit{x}}^{\mathrm{AD}}f$ is
$O(N\xb7\mathrm{T}(f\left)\right)$; then, with a proper parallelization procedure of these
N gradient computations, the construction of
G seems to be meaningless. However, the special implementation of
G as a computational graph, made by the frameworks used for AD (see
Section 1), guarantees an extremely efficient parallelization, both for the evaluation of
G and the computation of
${\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G$. For example, during the graph compilation, the framework TensorFlow [
37] identifies separate subparts of the computation that are independent and splits them [
40], optimizing the code execution “aggressively” [
39]; in particular, this parallelization does not consist only in sending subparts to different machine workers but also in optimizing the data reading procedure such that each worker can perform more operations almost at the same time (similarly to implementations that exploit vectorization [
46]). Then, excluding a HighPerformance Computing (HPC) context, the computational graphs representing
G and
${\nabla}_{\mathit{\xi}}^{\mathrm{AD}}G$ make the ADbased multistart method (almost always) more timeefficient than a multistart method where the code explicitly parallelizes the
N gradient computations of
${\nabla}_{\mathit{x}}^{\mathrm{AD}}f$ or the
N optimization procedures (see
Section 3.3 below); indeed, in the parallelized multistart case, each worker can compute only one gradient/procedure at a time, and the number of workers is typically less than
N. Moreover, the particular structure of the available frameworks makes the implementation of the ADbased method very easy.
As written above, the timeefficiency properties of a computational graph are somehow similar to the ones given by the vectorization of an operation; we recall that the vectorization (e.g., see [
46]) of a function
$f:{\mathbb{R}}^{n}\to \mathbb{R}$ is the code implementation of
f as a function
$\tilde{\mathit{F}}:{\bigcup}_{N=1}^{\infty}{\mathbb{R}}^{N\times n}\to {\bigcup}_{N=1}^{\infty}{\mathbb{R}}^{N}$ such that, without the use of for/whilecycles, the following holds:
for each matrix
$X\in {\mathbb{R}}^{N\times n}$,
$N\in \mathbb{N}$, and where
${\mathit{x}}_{i}$ denotes the transposed
ith row of
X. In this case, especially when
$N\gg 1$, using
$\tilde{\mathit{F}}$ is much more timeefficient than parallelizing
N computations of
f, thanks to the particular data structure read by the machine workers, which is optimized according to memory allocation, data access, and workload reduction. Therefore, we easily deduce that the maximum efficiency for the ADbased method is obtained when
G is built using
$\tilde{\mathit{F}}$ instead of
$\mathit{F}$ (if possible).
Summarizing the content of this subsection, the proposed ADbased multistart method exploits formulation (
6) to build an easy and handy solution for an implicit and highly efficient parallelization of the procedure, assuming access to AD frameworks.
3.3. Numerical Experiments
In this section, we illustrate the results of a series of numerical experiments that compare the computational costs (in time) of three multistart methods for problem (
3):
 (i)
ADbased multistart steepest descent, exploiting vectorization (see
Section 3.2);
 ($ii$)
ADbased multistart steepest descent, without exploiting vectorization;
 ($iii$)
Parallel multistart steepest descent, distributing the N optimization procedures among all the available workers. In this case, the gradient of f is computed using the reverse AD, for a fair comparison with case (i) and case ($ii$).
The experiments have been performed using a Notebook PC with Intel Core i5 Processor dualcore (2.3 GHz) and 16 GB DDR3 RAM (see Example 1); in particular, each multistart method has been executed alone, as the unique running application on the PC (excluded mandatory applications in the background). The methods are implemented in Python, using the TensorFlow module for reverse AD; the parallelization of method (
$iii$) is based on the
multiprocessing builtin module [
24].
Since we want to analyze the time complexity behavior of the three methods varying the domain dimension
n and the number of starting points
N, we perform the experiments considering the Rosenbrock function (see (
20), Example 1). This function is particularly suitable for our analyses since its expression is defined for any dimension
$n\ge 2$, and it has always a local minimum in
$\mathit{e}:={\sum}_{i=1}^{n}{\mathit{e}}_{i}$.
According to the different natures of the three methods considered, the implementation of the Rosenbrock function changes. In particular, for method (
i), in the AD framework we implement a function
G that highly exploit vectorization, avoiding any forcycle (see Algorithm 1); for method (
$ii$), we implement
G cycling among the set of
N vectors with respect to which we must compute the function (see Algorithm 2); for method (
$iii$), we do not implement
G, only the Rosenbrock function (see Algorithm 3).
Algorithm 1 GRosenbrock implementation—Method (i) 
Input: X, matrix of $n\in \mathbb{N}$ columns, $N\in \mathbb{N}$ of rows; Output: y, output scalar value of G built with respect to (20).
 1:
${X}_{\xb7,2:n}\leftarrow $ submatrix of X given by all the columns, except the first one;  2:
${X}_{\xb7,1:n1}\leftarrow $ submatrix of X given by all the columns, except the last one;  3:
$Y\leftarrow $$100\times {({X}_{\xb7,2:n}{{X}_{\xb7,1:n1}}^{^}2)}^{^}2+{(1{X}_{\xb7,1:n1})}^{^}2$  4:
$y\leftarrow $ sumup all the values in Y  5:
return: y

We test the three methods varying all the combinations of
N and
n, with
$N\in \{20\xb7i\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i=1,\dots 10\}$ and
$n\in \{20\xb7i\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.166667em}{0ex}}i=1,\dots 5\}$, for a total number of 50 experiments. The parameters of the steepest descent methods are fixed:
$\alpha =0.01$, maximum number of iterations
${k}_{max}={10}^{4}$, stopping criterium’s tolerance
$\tau ={10}^{6}$ for the gradient’s norm, and fixed starting points
${\mathit{x}}_{1}^{\left(0\right)},\dots ,{\mathit{x}}_{N}^{\left(0\right)}$ sampled from the uniform distribution
$\mathcal{U}\left({[2,3]}^{n}\right)$; computations are performed in single precision, the default precision of TensorFlow. The choice of a small value for
$\alpha $, together with the sampled starting points, and the “flat” behavior of the function near to the minimum
$\mathit{e}$ ensure that the steepest descent methods always converge toward
$\mathit{e}$ using all the iterations allowed, i.e.,
${\mathit{x}}_{i}^{\left({k}_{max}\right)}\simeq \mathit{e}$ but
$\Vert \nabla f\left({\mathit{x}}_{i}^{\left({k}_{max}\right)}\right)\Vert >\tau ={10}^{6}$, for each
$i=1,\dots ,N$.
Algorithm 2 GRosenbrock implementation—Method ($ii$) 
Input: X, matrix of $n\in \mathbb{N}$ columns, $N\in \mathbb{N}$ of rows; Output: y, output scalar value of G built with respect to (20).
 1:
$y\leftarrow 0$  2:
for $i=1,\dots ,N$ do  3:
${\mathit{x}}_{1:n1}\leftarrow $ subrow ${X}_{i,1:n1}$  4:
${\mathit{x}}_{2:n}\leftarrow $ subrow ${X}_{i,2:n}$  5:
${\mathit{y}}_{i}\leftarrow 100\times {({\mathit{x}}_{2:n}{{\mathit{x}}_{1:n1}}^{^}2)}^{^}2+{(1{\mathit{x}}_{1:n1})}^{^}2$  6:
${y}_{i}\leftarrow $ sum up values in ${\mathit{y}}_{i}$  7:
$y\leftarrow y+{y}_{i}$  8:
end for  9:
return: y

Algorithm 3 Rosenbrock implementation—Method ($iii$) 
Input: x, vector in ${\mathbb{R}}^{n}$; Output: y, output scalar value of (20).
 1:
${\mathit{x}}_{1:n1}\leftarrow $ subrow ${X}_{i,1:n1}$  2:
${\mathit{x}}_{2:n}\leftarrow $ subrow ${X}_{i,2:n}$  3:
$\mathit{y}\leftarrow 100\times {({\mathit{x}}_{2:n}{{\mathit{x}}_{1:n1}}^{^}2)}^{^}2+{(1{\mathit{x}}_{1:n1})}^{^}2$  4:
$y\leftarrow $ sumup values in $\mathit{y}$  5:
return: y

Looking at the computation times of the methods, reported in
Table 1,
Table 2 and
Table 3 and illustrated in
Figure 4, the advantage of using the ADbased methods is evident, in particular when combined with vectorization; indeed, we observe that in its slowest case,
$(N,n)=(200,100)$, method (
i) is still faster than the fastest case,
$(N,n)=(20,20)$, of both method
$\left(ii\right)$ and method
$\left(iii\right)$.
The linear behaviour of the methods’ time complexity with respect to
N is confirmed (see
Figure 4); however, the parallel multistart method’s behavior is afflicted by higher noise, probably caused by the machine’s management of the jobs’ queue. Nonetheless, the linear behaviors of the methods are clearly different: the ADbased multistart methods are shown to be faster than the parallel multistart method, with a speedup factor between
$\times 3$ and
$\times 6$ (method (
$ii$)) and between
$\times 40$ and
$\times 100$ (method (
i)).
3.4. Scalability and RealWorld Applications
The experiments illustrated in the subsection above are a good representation of the general behavior that we can expect to observe in a realworld scenario, even further increasing the number n of dimensions and/or the number N of optimization procedures.
We recall that the we are assuming that the loss function f is the composition of known elementary operations, so that it is possible to use AD frameworks for the implementation of G (built on f). Under these assumptions, the experiments with the Rosenbrock function are a good representation of the general behavior of the proposed method with respect to a “classically parallelized” multistart approach. Indeed, the main difference is only in the elementary operations necessary for implementing f and G, varying the time complexity $\mathrm{T}\left(f\right)$. Therefore, even in the worst case of no vectorization of f and/or G, the experiments show that the ADbased method is faster with respect to the parallelized multistart (see method ($ii$) and method ($iii$)). Moreover, the implementation of the ADbased method is much simpler; indeed, the user needs only to run one optimization procedure on the function G instead of writing a code for optimally parallelizing N procedures with respect to f.
Our ADbased approach is intended to be an easy, efficient, and implicit parallelization of a general purpose multistart optimization procedure, applicable to problems with f that are relatively simple to define/implement and typically solved by excluding an HPC context. On the other hand, for highly complex and expensive optimization problems and/or loss functions, a tailored parallelization of the multistart procedure is probably the most efficient approach, possibly designed for that specific optimization problem and run on an HPC. Of course, the ADbased method cannot be applied to optimization problems with a loss function that is not implementable in an AD framework.
3.5. Extensions of the Method and Integration in Optimization Frameworks
The proposed ADbased multistart method has been developed for general gradientbased optimization methods for unconstrained optimization (see Remark 2). Nonetheless, it has the potential to be extended to secondorder methods (e.g., via matrixfree implementations [
47] (Ch. 8)) and/or to constrained optimization. Theoretically, the method is also relatively easy to be integrated in many existing optimization frameworks or software tools, especially if they already include AD for evaluating derivatives (e.g., see [
29]); indeed, the user just needs to apply these tools/frameworks with respect to the function
G, instead of
f, by asking to compute the gradient with reverse AD. Moreover, as we will see in the next section, the ADbased method can also take advantage of the efficient training routines of the Deep Learning frameworks for solving optimization problems.
However, the extensions mentioned above need further analysis before being implemented, and some concrete limitations for the method still exist. For example, linesearch methods (e.g., see [
47,
48,
49,
50]) for the steplengths of the
N multistart procedures have not been considered yet; indeed, the current formulation has only one shared steplength
$\alpha $ for all the
N procedures. Similarly, distinct stopping criteria for the procedures are not defined. In future work, we will extend the ADbased multistart method to the cases listed in this subsection.
4. MultiStart with Shallow Neural Networks
In this section, we show how to exploit the available Neural Network (NN) frameworks to build a shallow NN that, trained on fake input–output pairs, performs a gradientbased optimization method with respect to N starting points and a given loss function function f. Indeed, the usage of NN frameworks (typically, coincident with or part of AD frameworks) let us exploit the already implemented gradientbased optimization methods defined for NN training, also taking advantage of their highly optimized code implementation. Therefore, from a practical point of view, this approach is useful to easily implement a reverse ADbased multistart method with respect to the builtin optimizers of the many available NN frameworks.
In the following definition, we introduce the archetype of such a NN. Then, we characterize its use with two propositions, to explain how to use the NN for reverse ADbased multistart.
Definition 3 (MultiStart Optimization Neural Network). Let $f:{\mathbb{R}}^{n}\to \mathbb{R}$ be a loss function, and let $\mathcal{N}$ be an NN with the following architecture:
 1.
One input layer of n units;
 2.
One layer of N units and fully connected with the input layer. In particular, for each $i=1,\dots ,N$, the unit ${u}_{i}$ of the layer returns a scalar outputwhere ${\mathit{w}}_{i}={[{w}_{1i},\dots ,{w}_{ni}]}^{T}\in {\mathbb{R}}^{n}$ is the vector of the weights of the connections between the input layer units and ${u}_{i}$.
Then, we define $\mathcal{N}$ as multistart optimization NN (MSONN) of N units with respect to f.
Remark 3. We point the reader to the fact that (21) does not depend on the NN inputs but only on the layer weights. Then, an MSONN does not change its output, varying the inputs. The reasons behind this apparently inefficient property are explained in the next propositions. Proposition 2. Let $\mathcal{N}$ be an MSONN of N units with respect to f. Let us endow $\mathcal{N}$ with a training loss function ℓ such that,for any input–output pair $(\mathit{x},\mathit{y})\in {\mathbb{R}}^{n}\times {\mathbb{R}}^{N}$, where $\lambda \in {\mathbb{R}}^{+}$ is fixed, and $\mathit{w}=({\mathit{w}}_{1},\dots ,{\mathit{w}}_{N})\in {\mathbb{R}}^{nN}$ denotes the vector collecting all the weights of $\mathcal{N}$. Moreover, for the training process, let us endow $\mathcal{N}$ with a gradientbased method that exploits the backpropagation, defined as in Remark 2. Given N vectors ${\mathit{x}}_{1}^{\left(0\right)},\dots {\mathit{x}}_{N}^{\left(0\right)}\in {\mathbb{R}}^{n}$, let us initialize the weights of $\mathcal{N}$ such that the weight vector ${\mathit{w}}_{i}={\mathit{w}}_{i}^{\left(0\right)}$ is equal to ${\mathit{x}}_{i}^{\left(0\right)}$, for each $j=1,\dots ,n$ and $i=1,\dots ,N$. Then,
 1.
For any training set, the updating of the weights of $\mathcal{N}$ at each training epoch is equivalent to the multistart step (14), computed with reverse AD and such that ${\lambda}_{1}=\cdots ={\lambda}_{N}=\lambda $;  2.
For each $k\in \mathbb{N}$, the weights ${\mathit{w}}^{(k+1)}=({\mathit{w}}_{1}^{(k+1)},\dots ,{\mathit{w}}_{N}^{(k+1)})$ of $\mathcal{N}$, after $k+1$ training epochs, are such thatwhere ${\mathit{x}}_{i}^{(k+1)}$ are the vectors defined by (14) and ${\lambda}_{1}=\cdots ={\lambda}_{N}=\lambda $.
Proof. Since the second item is a direct consequence of the first item, we prove only item 1.
The optimization method for the training of
$\mathcal{N}$ is characterized by a function
$\mathcal{M}$ that satisfies (
10) and (
13). Then, the training of
$\mathcal{N}$ consists of the following iterative process that updates the weights:
independently of the data used for the training (see (
22)). We recall that
${\nabla}^{\mathrm{AD}}$ denotes the gradients computed with reverse AD and that it is used in (
23) because the optimization method exploits the backpropagation for hypothesis.
Now, by construction, we observe that $\ell ={L}_{\lambda}\circ \mathit{F}$, with ${L}_{\lambda}$ and $\mathit{F}$ defined as in Proposition 1. Then, due to Proposition 1 and Remark 2, the thesis holds.
□
Proposition 3. Let $\mathcal{N}$ be as in the hypotheses of Proposition 2, with the exception of the loss function ℓ that is now defined asfor any input–output pair $(\mathit{x},\mathit{y})\in {\mathbb{R}}^{n}\times {\mathbb{R}}^{N}$ and where $\lambda \in {\mathbb{R}}^{+}$ is fixed. Let $\mathcal{T}$ be a training set where the input–output pairs have fixed output $\mathit{y}={[{y}^{*},\dots ,{y}^{*}]}^{T}\in {\mathbb{R}}^{N}$. Then,  1.
Given the training set $\mathcal{T}$, the updating of the weights of $\mathcal{N}$ at each training epoch is equivalent to the multistart step (14) applied to the merit function ${(f\left(\mathit{x}\right){y}^{*})}^{2}$, computed with reverse AD and such that ${\lambda}_{1}=\cdots ={\lambda}_{N}=\lambda $;  2.
For each $k\in \mathbb{N}$, the weights ${\mathit{w}}^{(k+1)}=({\mathit{w}}_{1}^{(k+1)},\dots ,{\mathit{w}}_{N}^{(k+1)})$ of $\mathcal{N}$, after $k+1$ training epochs, are such thatwhere ${\mathit{x}}_{i}^{(k+1)}$ are the vectors defined by the multistart process of item 1.
Proof. The proof is straightforward, from the proof of Proposition 2. □
With the last proposition, we introduced the minimization of a merit function. The reason is that multistart methods can be useful not only in looking for one global optimum but also when a set of global/local optima is asked for. An example is the level curve detection problem where, for a given function $f:{\mathbb{R}}^{n}\to \mathbb{R}$ and a fixed ${y}^{*}\in \mathbb{R}$, we look for the level curve set ${Y}^{*}=\{\mathit{x}\in {\mathbb{R}}^{n}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}f\left(\mathit{x}\right)={y}^{*}\}$ minimizing a merit function, e.g., ${(f\left(\mathit{x}\right){y}^{*})}^{2}$. In this case, a multistart approach is very important since the detection of more than one element of ${Y}^{*}$ is typically sought (when ${Y}^{*}$ is not empty or given by one vector only). In the next section, we consider this case problem for the numerical tests.
We conclude this section with remarks concerning the practical implementation of MSONNs.
Remark 4 (GitHub Repository and MSONNs). Remark 5 (Avoiding overflow). In some cases, if the objective function f is characterized by large values, an overflow may occur while using an MSONN to minimize f with an ADbased multistart approach. In particular, the overflow may occur during the computation of $G\left({\mathit{\xi}}^{\left(k\right)}\right)=({L}_{\lambda}\circ \mathit{F})\left({\mathit{\xi}}^{\left(k\right)}\right)={\sum}_{i=1}^{N}{\lambda}_{i}f\left({\mathit{x}}_{i}^{\left(k\right)}\right)$ if the parameters ${\lambda}_{i}$ are not small enough. One of the possible (and easiest) approaches is to select ${\lambda}_{1}=\cdots ={\lambda}_{N}=1/N$; then, in the case illustrated in Proposition 3, it is equivalent to select ℓ as the Mean Square Error (MSE) loss function.
4.1. Numerical Example
In this section, we report the results about the use of an MSONN to find level curve sets of the Himmelblau function
The Himmelblau function [
51] is characterized by nonnegative values, with lower bound
$M=0$. In particular,
f has four local minima:
these local minima are also global minima since
$f\left({\mathit{x}}_{i}^{*}\right)=0$, for each
$i=1,\dots ,4$.
For our experiments, we consider an MSONN
$\mathcal{N}$ defined as in Proposition 3, with
$\lambda =1/N$ (see Remark 5). The NN is implemented using the TensorFlow (TF) framework [
37], and we endow
$\mathcal{N}$ with the Adam optimization algorithm [
52], the default algorithm for training NNs in TF, with a small and fixed steplength
$\alpha ={10}^{3}$. We use Adam to show examples with a generic gradientbased method, and we set
$\alpha ={10}^{3}$ to emphasize the efficiency of the ADbased multistart even when a large number of iterations are required to obtain the solutions.
Remark 6 (Adam satisfies (13)). In these experiments, we can use the Adam optimization algorithm because Equation (13) holds for Adam, with $m\left(\lambda \right)=1$. Actually, (13) holds from a theoretical point of view, due to a small parameter ϵ introduced in the algorithm to avoid zero division errors during the computations. Nonetheless, in practice, Proposition 3 and Remark 2 still hold for Adam if we set ϵ such that $\u03f5/\lambda =\u03f5N$ is sufficiently small (e.g., $\u03f5N={10}^{7}$). Specifically, the Adam optimization algorithm updates the NN weights according to the rulewhere all the operations are intended to be elementwise; ${\widehat{\mathit{m}}}^{(k+1)}$ and ${\widehat{\mathit{v}}}^{(k+1)}$ are the first and second biascorrected moment estimates, respectively; and $\u03f5>0$ is a small constant value (typically $\u03f5={10}^{7}$) to avoid divisions by zero [52]. Then, by construction (see [52]), we observe that the biascorrected moment estimates ${\widehat{\mathit{m}}}^{(k+1)},{\widehat{\mathit{v}}}^{(k+1)}$ with respect to a loss function ℓ are such thatwhere ${\widehat{\mathit{m}}}^{\prime \phantom{\rule{0.166667em}{0ex}}(k+1)},{\widehat{\mathit{v}}}^{\prime \phantom{\rule{0.166667em}{0ex}}(k+1)}$ are the biascorrected moment estimates with respect to another loss function ${\ell}^{\prime}$ such that $\ell =\lambda {\ell}^{\prime}$, $\lambda \in {\mathbb{R}}^{+}$. Therefore, assuming $\u03f5=0$ and considering the effective step, the following holds:on the other hand, assuming $\u03f5>0$, the two methods are equivalent but characterized by different parameters to avoid divisions by zero, i.e., Finding Level Curve Sets of the Himmelblau Function
Given the MSONN
$\mathcal{N}$ described above, we show three cases of levelcurve search: the case with
${y}^{*}=100$ (set denoted by
${Y}_{100}^{*}$), the case with
${y}^{*}=10$ (set denoted by
${Y}_{10}^{*}$), and the case with
${y}^{*}=0$ (set denoted by
${Y}_{0}^{*}=\{{\mathit{x}}_{1}^{*},\dots ,{\mathit{x}}_{4}^{*}\}$, see (
26)); in particular, the latter case is equivalent to the global minimization problem of the function. For all the cases, for the multistart method we select the
$N={10}^{4}$ points
${\mathit{x}}_{1}^{\left(0\right)},\dots ,{\mathit{x}}_{N}^{\left(0\right)}$ of the regular grid as starting points
where
$h=15/99$ (see
Figure 5). Then, for each
${y}^{*}=100,10,0$, we train
$\mathcal{N}$ for
$K=\mathrm{25,000}$ epochs (i.e.,
K multistart optimization steps), initializing the weights with
${\mathit{x}}_{1}^{\left(0\right)},\dots ,{\mathit{x}}_{N}^{\left(0\right)}$. We recall that, for each
$k=0,\dots ,K1$, the
$(k+1)$th training epoch of
$\mathcal{N}$ is equivalent to
$N={10}^{4}$ Adam optimization steps with respect to the vectors
${\mathit{x}}_{1}^{\left(k\right)},\dots ,{\mathit{x}}_{N}^{\left(k\right)}\in {\mathbb{R}}^{2}$. The training is executed on a Notebook PC with Intel Core i5 Processor dualcore (2.3 GHz) and 16 GB DDR3 RAM (same PC of Example 1 and
Section 3.3).
Now, for each
k = 5000, 15,000, 25,000, in
Table 4 we report the computation time and the Mean Absolute Error (MAE) of the points
${\mathit{x}}_{i}^{\left(k\right)}$ with respect to the target value
${y}^{*}$, i.e.,
Looking at the values in the table and at
Figure 6, we observe the very good performances of the new multistart method. Indeed, not only all the
N sequences are convergent toward a solution, but we observe that the average time of one minimization step, characterized by the computation of
$N={10}^{4}$ gradients in
${\mathbb{R}}^{2}$, is equal to
$1.6\times {10}^{3}$ s. The result is interesting because the method is efficient without the need for defining any particular parallelization routine to manage the
N optimization procedures at the same time.
5. Conclusions
We presented a new multistart method for gradientbased optimization algorithms, based on the reverse AD. In particular, we showed how to write N optimization processes for a function $f:{\mathbb{R}}^{n}\to \mathbb{R}$ as one optimization process for the function $G={L}_{\mathit{\lambda}}\circ \mathit{F}$, computing the gradient with the reverse AD. Then, assuming no HPC availability, this problem formulation defines an easy and handy solution for an implicit and highly efficient parallelization of the multistart optimization procedure.
Specifically, the method is not supposed to be applied to complex and expensive optimization problems, where detailed and tailored parallelized multistart methods are probably the best choice; on the contrary, the ADbased multistart method is intended to be an easy and efficient alternative for parallelizing general purpose gradientbased multistart methods.
The efficiency of the method has been positively tested on a standard personal computer with respect to 50 cases of increasing dimension and the number of starting points obtained from an ndimensional test function. These experiments have been performed using a naive steepestdescent optimization procedure.
We observed that the method has the potential to be extended to secondorder methods and/or to constrained optimization. In the future, we will focus on extending the method to these cases and implementing custom line search methods and/or distinct stopping criteria for the N optimization procedures.
In the end, we presented a practical implementation of the ADbased multistart method as a tailored shallow Neural Network, and we tested it on three different values for the level curve set identification problem on the Himmelblau function, where the latter one is equivalent to the global minimization problem. This example highlights another advantage of the new method: the possibility to use NNs for exploiting the already implemented and efficient gradientbased optimization methods defined in the NN frameworks for the models’ training.