An Analysis of the Directional-Modifier Adaptation Algorithm Based on Optimal Experimental Design

Academic Editor: Dominique Bonvin Received: 1 November 2016; Accepted: 15 December 2016; Published: 22 December 2016 Abstract: The modifier approach has been extensively explored and offers a theoretically-sound and practically-useful method to deploy real-time optimization. The recent directional-modifier adaptation algorithm offers a heuristic to tackle the modifier approach. The directional-modifier adaptation algorithm, supported by strong theoretical properties and the ease of deployment in practice, proposes a meaningful compromise between process optimality and quickly improving the quality of the estimation of the gradient of the process cost function. This paper proposes a novel view of the directional-modifier adaptation algorithm, as an approximation of the optimal trade-off between the underlying experimental design problem and the process optimization problem. It moreover suggests a minor modification in the tuning of the algorithm, so as to make it a more genuine approximation.


Introduction
Real-Time Optimization (RTO) aims at improving the performance and safety of industrial processes by means of continually-adjusting their inputs, i.e., the degrees of freedom defining their operating conditions, in response to disturbances and process variations.RTO makes use of both model-based and model-free approaches.The model-free approaches have the clear advantage of being less labor intensive, as a model of the process is not needed, but the increasing number of inputs that can be adjusted when running the process has made them decreasingly attractive.
Model-based techniques have received an increasing interest as the capability of running a large amount of computations online has become standard.Arguably, the most natural approach to model-based RTO is the two-step approach, where model parameter estimation and model-based optimization are alternated so as to refine the process model and adapt the operational parameters accordingly [1,2].Unfortunately, the two-step approach requires the process model to satisfy very strict criteria in order for the scheme to reach optimality [3,4].This issue is especially striking in the case of structural mismatch between the model and the process and can make the two-step scheme ineffective or even counterproductive [5][6][7].
The idea of not only adapting the model parameters, but also the gradient of the cost function can be traced back to [8] and allows for guaranteeing that the resulting scheme reaches optimality upon convergence [7,9,10].Unlike the two-step approach, adapting the gradient of the cost function allows one to tackle structural model-plant mismatches efficiently, which cannot be efficiently addressed via the adaptation of the model parameters alone.The original idea has been further improved; see, e.g., [7,[9][10][11][12][13].These contributions have converged to the modern Modifier Adaptation (MA) approach, which has been successfully deployed on several industrial processes; see [14][15][16][17].The MA approach has been recently further developed along a number of interesting directions; see [16,[18][19][20][21].
In a run-to-run scenario where estimations of the uncertain parameters are carried out after every run, the input for any run does not only maximize the process performance for the coming run, but also influences the performance of the subsequent runs through the estimation of the process parameters.This observation is generally valid when parameter estimation is performed between runs and pertains to the MA approach.Taking this influence into account leads one to possibly depart from applying to the process an input that is optimal according to the best available estimation of the parameters at the time and adopt an input that strikes a compromise between process optimality and gathering relevant information for the next parameter estimation.In that sense, the MA approach can be construed as a mix of an optimization problem and an experimental design problem.The problem of tailoring experimental design specifically for optimization in a computationally-tractable way has been recently studied in [22], where the problem of designing inputs for a process so as to gather relevant information for achieving process optimality is tackled via an approximate optimality loss function.
The recently-proposed Directional-Modifier Adaptation (DMA) algorithm [23,24] and its earlier variant the dual-modifier adaptation approach [25] offer a practical way for the MA approach to deal with the compromise between process optimality and gaining information.Indeed, at each process run, the DMA algorithm delivers an input that seeks a compromise between maximizing the process performance and promoting the quality of the estimation of the process gradients.The DMA approach handles this compromise by adopting inputs that depart from the nominal ones in directions corresponding to the largest covariance in the estimated gradients of the process Lagrange function.The DMA algorithm is easy to deploy and has strong theoretical properties, e.g., it converges rapidly and with guarantees to the true process optimum.The directional-modifier adaptation algorithm additionally makes use of iterative schemes to update the modifiers used in the cost model, so as to reduce the computational burden of performing classical gradient estimations.
In this paper, we propose to construct the DMA algorithm from a different angle, based on a modification of the optimality loss function [22].This construction delivers new theoretical insights into the DMA algorithm and suggests minor modifications that make the DMA algorithm a more genuine approximation of the optimal trade-off between process optimality and excitation.For the sake of simplicity, we focus on the unconstrained case, though the developments can arguably be naturally extended to constrained problems.
The paper is structured as follows.Section 2 proposes some preliminaries on the selection of an optimality loss function for the considered experimental design problem and proposes a computationally-tractable approximation, following similar lines as [22].Section 3 investigates the MA approach as a special case of the previous developments, proposes to tackle it within the proposed theoretical framework and shows that the resulting algorithm has the same structure as the DMA algorithm, but with some notable differences.Simple examples are presented throughout the text to illustrate and support the concepts presented.

Optimal Experimental Design
In this paper, we consider the problem of optimizing a process in a run-to-run fashion.The process is described via the cost function φ (u, p), where u gathers the set of inputs, or degrees of freedom, available to steer the process, and p gathers the parameters available to adjust the cost function using the measurements gathered on the plant.Function φ is assumed to be everywhere defined and smooth.This assumption is arguably not required, but will make the subsequent analysis less involved.The N-run optimization problem can then be formulated as: where u k is the vector of decision variables applied at run k.Here, we seek the minimization of the average process performance over the N runs.The cost function φ (u, p) associated with the process is not available in practice, such that at any run k, the input u k is typically chosen according to the best parameter estimation pk available at that time.It is important to observe here that the parametric cost function (1) encompasses parametric mismatch between the plant and the model, but also any structure adjusting the cost function according to the data, such as the MA approach; see Section 3.1.Ideally, one ought to seek solving the optimization model: where E pk stands for the expected value over pk .For the sake of simplicity, we will focus in this paper on the two-run problem, i.e., using N = 2 in Problem (2).In the following, we will assume that there exists a vector of parameter p real for which φ (u k , p real ) captures effectively the cost function of the real process.This assumption is locally fulfilled, up to a constant term, by the MA approach.When estimations of the parameters pk are conducted between the runs using the latest measurements gathered on the process, a difficulty in using (2) stems from the fact that it can yield an inadequate sequence of decisions u 0,...,N−1 .We motivate this statement next, via a simple example.

Failure of Problem (2): An Example
Consider the optimization model φ (u) = u 2 + p 2 yielding the the two-run problem: where Σ k is the covariance of the estimation of parameter pk and µ k its expected value.If the distribution of the estimated parameter p1 is independent of the input u 0 , then Problem (3) takes the trivial solution u 0,1 = 0, which yields the best performance on the real cost function φ (u, p real ), regardless of the actual parameter value p real or of its estimated value p0 available for deciding the input u 0 .However, since the estimated parameter p1 is obtained from the run based on u 0 , it is in fact not independent of the decision variables.Indeed, let us assume that the estimation of p1 is provided between the two runs via the least-square fitting problems: where y meas 0 ∈ R m is the measurements taken on the process during or after the run based on u 0 , y (u, p) is the corresponding measurement model, Σ meas is the covariance of the measurement noise and Σ 0 the covariance associated with the parameter estimation p0 .Consider then the measurement model: The solution to (4) is then explicitly given by: Assuming that µ p0 = p real = 0 and E y meas 0 = 0, we then observe that if the measurement noise is independent between the various runs, we have: After removing the constant terms, Problem (3) becomes: An interesting situation occurs for Σ meas ≤ Σ 2 0 , i.e., when the covariance of the measurements is sufficiently low; see Figure 1.The solution to (8) then reads as: while the sequence u 0 = u 1 = 0 should clearly be used in order to minimize the cost of the real two-run process, even in the sense of the expected value.This trivial example illustrates a fundamental limitation of Problem (2) in successfully achieving the goal of minimizing the cost over a two-or N-run process.(8).The level curves report the cost of (8) as a function of u 0 and Σ meas , with Σ 0 = 1.The dashed lines report the optimal input u 0 for various values of Σ meas .For Σ meas low enough, the problem has two non-zero solutions.

Modified Optimality Loss Function
A sensible approach inspired from the work presented in [22] consists of selecting the input u 0 according to: where ∆ 0 is labeled the optimality loss function and e gathers the noise on the estimation of the process parameters and the measurement noise, i.e.: Problem (10) seeks a compromise between the expected process performance at the coming run via the first term in (10) and the expected process performance at the subsequent run via the second term.The performance of the second term depends on the input selected in the first run via the parameter estimation performed between the two runs.We assume hereafter that e follows a normal, centered distribution, and we use for the estimated parameter p1 the least-square fitting problem: where p1 is the parameter estimation following the first run.The optimality loss function ∆ 0 proposed in [22] was designed for the specific purpose of performing experimental design dedicated to capturing the process parameters most relevant for process optimization.However, it was not designed to be used within the two-run problem considered here.In this paper, we propose to use a slightly modified version of (10), so as to avoid a potential difficulty it poses.For the sake of brevity and in order to skip elaborate technical details, let us illustrate this difficulty via the following simple example.Consider the cost function and measurement model: The least-square problem (12) reads as: which takes the explicit form: The optimality loss function ∆ 0 then reads as: and has the expected value: It is worth observing that a similar optimality loss function has also been used in [25] in order to quantify the loss of optimality resulting from uncertain parameters.Problem (10) can then be equivalently written as: However, since in practice, p real is not available to solve Problem (18), a surrogate problem must be solved, using p real ≈ p0 .It reads as: An issue occurs here, which is illustrated in Figure 2. Because the expected value of the optimality loss function computed in a stand-alone fashion in (17) misses the correlation between the control input u 0 and the initial estimation error e 0 that arises via the optimization problem (19), using (19) as a surrogate for (18) can be counterproductive in the sense that the performance of Problem (18) is worse than the one of the nominal problem.
In this paper, we address this issue by taking an approach to the optimality loss function that departs slightly from (16). and the one resulting from ( 18) or (19) on the proposed example.The displayed cost is calculated according to (10) and reads as E e The left graph displays the cost resulting from using (18), which delivers a better expected performance than using the nominal input.The right graph displays the cost resulting from using (19), where p real ≈ p0 is used.In this example, this approximation is detrimental to the performance of Problem (10), resulting in a worse performance than the nominal one.

Problem Formulation
For a given initial estimation p0 , initial estimation error e 0 and measurement error e 1 and using p real = p0 − e 0 , the estimation problem solved after the first run can be formulated as: p1 (u 0 , Σ, p0 , e) = arg min y (u 0 , p) − (y (u 0 , p0 − e 0 ) + e 1 ) 2 where we use the notation: and consider e 0 , e 1 to be uncorrelated.Defining: the modified optimality loss function can be formulated as: This reformulation allows for construing the optimality loss function from the point of view of the experimenter, by considering p0 as a fixed variable arising as a realization of the estimation of the unknown parameter p real rather than a stochastic one.In ( 20) and ( 23), the actual parameter p real is then, from the experimenter point of view, a stochastic variable, reflecting the uncertainty of the experimenter concerning the real parameter.The resulting two-run problem reads as: We observe here that the cost function proposed in ( 24) is different from the original one in (19).
From the optimality principle, Problem (24) delivers an expected performance that is better or no worse than the expected performance yielded by applying the nominal input u 0 = u * ( p0 ).A simple example of the proposed optimality-loss approach is provided in Section 2.5.Unfortunately, solving Problem ( 24) is in general difficult.In the next section, we consider a second-order approximation instead, following a line also adopted in [22].

Second-Order Approximation of the Modified Optimality Loss Function
The optimality loss function ( 23) is difficult to use in practice.A second-order approximation of ( 23) can be deployed as a tractable surrogate problem in (24).We develop this second-order approximation next.We observe that the following equality trivially holds: The sensitivity of the parameter estimations p1 to the errors e can be obtained via the implicit function theorem applied to the fitting problem (20); it reads as: where: is the Fisher information matrix of (20), and: We note that from optimality that ∆ ≥ 0 always holds and: which motivates a second-order approximation of ∆ at e = 0.The Taylor expansion of ∆ in e reads as: We can then form the second-order approximation of the modified optimality loss function ∆.
Proof.We observe that: where for the sake of clarity, the arguments are omitted when unambiguous.Using the fact that φ u (u * ( p0 ) , p0 ) = 0, it follows that: where all functions are evaluated at e = 0. We use then the equality u * p = − (φ * uu ) −1 φ * up to get (31).
Using (28), we observe that: E e [V (u 0 , Σ, p0 , e)] = F (u 0 , Σ, p0 It follows that the expected value of the optimality loss function reads as: It is useful to observe that even though a modified optimality loss function has been selected here, its approximation (37) is nonetheless very similar to the one proposed in [22].Hence, the real difference lies in its interpretation as an approximation of the modified function (23) rather than (16).Here, it is useful to introduce the following lemma: Lemma 2. If the following conditions hold: 1.
the noise e has a multivariate normal and centered distribution 2.
for all p ∈ P, u * (p) exists, is smooth, unique and satisfies the Second-Order Sufficient Condition (SOSC) condition of optimality.

3.
the parameter estimation problem (20) has a unique solution p1 (u 0 , Σ, p0 , e) satisfying SOSC for any e and is smooth and polynomially bounded in e 4.
Then, the inequality: holds locally for some constant c > 0, where . is the matrix two-norm.
Proof.Because all functions are smooth and bounded by polynomials, the function ∆ is also smooth and bounded by polynomials.It follows that: is also smooth and polynomially bounded.Additionally, the bound: holds locally for some c > 0 as a result of Taylor's theorem.Then, Inequality (38) follows directly from Lemma 3.
For the sake of clarity, the deployment of Problem (41) in a run-to-run algorithm is detailed in Algorithm 1.

Illustrative Example: Observability Problem
We consider again the example (13), i.e.: where we consider p0 = p real + e 0 as known a priori with E[e 0 ] = 0, and p1 is provided by the estimation problem: and takes the explicit solution: The optimality loss for the second run then reads as: and its expected value takes the form: Ignoring the constant terms and since E[e 0 ] = 0, the two-stage optimal experimental design then picks the input u 0 according to: We observe that in this simple case, the proposed approximation ( 41) is identical to the original problem (24) and to Problem (19).This equivalence does not hold in general.The behavior of Problem (41) in this simple case is reported in Figures 3 and 4. In particular, we observe that the expected performance of Problem (41) on this example is consistently better than the one of the nominal approach.It is important to understand here that in this specific example, the difference between Figures 2 and 4 lies in the cost function that evaluates the performance of the nominal and proposed approach.Indeed, because of the approximation p real = p0 , the original approach (19) appears potentially counterproductive under its targeted performance metric (10).Instead, the proposed performance metric ( 24) is the one that can be minimized via exploiting measurements for subsequent optimizations.In general, however, the inputs selected by ( 10) and ( 24) are different.Comparison of the nominal and optimal experimental design on the proposed example for p0 = 0.The displayed cost is calculated according to the cost proposed in (24), which reads as 2 in this example.It can be observed that the optimal experimental design approach has two solutions, due to the non-convexity of the problem.The left graph illustrates the nominal and optimal experimental design performance on the proposed example.The displayed cost is calculated according to the cost proposed in (24), which 2 in this example.The right graph displays the corresponding inputs.Observe that the right-hand graph ought to be compared to the right-hand graph of Figure 2.

Link to the Modifier Approach and the DMA Approximation
In this section, we draw a connection between the proposed developments and the well-proven modifier approach tackled via the recent Directional-Modifier Adaptation (DMA) algorithm [23,24].In particular, we show that the DMA approach can be construed as an approximation of Problem (41).

The Modifier Approach
In the context of RTO, instead of considering uncertain model parameters, the Modifier Approach (MA) tackles the difficulty of working with uncertain process models by introducing a modification of the gradient of the cost function in the optimization problem.The MA then considers a model of the cost function in the form: where p is a set of parameters that modifies the gradient of the process model.Hence, instead of refining the process model, the MA focuses on adjusting the cost gradient at the solution in order to reach optimality for the real process.At each run, measurements of the cost function can be used to improve the estimation of the process gradients via numerical differences.The measurements obtained at each run can be written as: while the measurement model reads as: Here, we consider the inputs prior to u 0 as fixed, since they are already realized, and we consider that a parameter estimation p0 is available from these previous measurements, with associated covariance Σ 0 .It can be verified that: for the semi-positive, rank-one weighting matrix Q = cδuδu T .The close resemblance of the DMA problem (61) to Problem (58) offers a deeper understanding of the procedure at play in the DMA algorithm.More specifically, Problem (58) is identical to the DMA problem (61) if: We observe here that ∇φ = ∇φ 0 + p0 , such that Σ ∇φ ≡ Σ p0 mathematically holds.Since δu is the dominant unitary eigenvector of Σ ∇φ and is therefore also the dominant unitary eigenvector of Σ 2 ∇φ , it follows that matrix Q is given by: max Observing ( 62) and ( 63), it follows that the classical DMA method picks an input using: According to these observations, a reasonable choice for the scaling constant c can be: It is useful to remark here that dismissing the information provided by ∇ 2 φ 0 may be advantageous when φ 0 does not reflect adequately the curvature of the cost function of the real process.In such a case, the weighting provided by ∇ 2 φ 0 in (58) can arguably be misleading.Including estimations of the 2 nd -order sensitivities in the MA approach has been investigated in [19].

Illustrative Example
We illustrate here the developments proposed above via a simple quadratic example, which nonetheless captures a number of observations that ought to be made.Consider the cost model: such that the nominal optimal input is trivially given by: Problem (53) then reads as: while the approximate problem (58) reads as: Note that Problem (68) is unbounded for Σ ∇φ I < R −1 Σ p0 2 , while (67) can have a well-defined solution; see Figures 5 and 6 for an illustration.This situation occurs here when the measurement noise is small while the current parameter estimation is highly uncertain and is discontinued when the parameter estimation becomes reliable, such that Σ p0 becomes small.Note that this can be addressed in practice via an ad hoc regularization or by, e.g., bounding the input correction u 0 − u * ( p0 ) in Problem (58).
Adopting a matrix Q delivering a full-rank approximation of Σ 2 p0 does not help the DMA algorithm adopting directions (see the light-grey dashed line) that point to the direction of the solution to (67); hence, ignoring ∇ 2 φ 0 is problematic here.
The DMA-based problem (60) reads as: The behaviors of the DMA problem (69) and its proposed counterpart (68) are reported in Figures 5-8.In Figures 5 and 6, the two problems are compared for the setup: resulting in an unbounded problem for both problems.In this case, the DMA approach (69) with a reduced choice of c would ensure a bounded problem, while a regularization or trust-region technique for Problem (68) would deliver a solution.We observe in Figures 5 and 6 that ignoring the term ∇ 2 φ 0 in the DMA problem can lead the algorithm to favor a solution that departs significantly from the ones proposed by (53).In Figure 7, the two problems are compared for the setup: In this case, ignoring the term ∇ 2 φ 0 = Q = I does not yield any difficulty.However, because all parameters p0 have a very similar covariance, the rank-one approximation of Σ p0 misleads the DMA algorithm into selecting a solution that departs significantly from the one of (53).Finally, in Figure 8, the two problems are compared for the setup: (72) In this last case, both the DMA problem (69) and (68) deliver solutions that are very close to the one of Problem (53), i.e., in this scenario, ignoring the term ∇ 2 φ 0 and forming a rank-one approximation do not affect the solution significantly.69) with Q given by (63).In this example, the rank-one approximation for Q leads the DMA algorithm to propose a solution that is far from the one of (67).69) with Q given by (63).In this example, all problems deliver very similar solutions.

Conclusions
In this paper, we have proposed a novel view of real-time optimization and of the modifier approach from an experimental design perspective.While some methods are available to handle the trade-off between process optimality and the gathering of information for the performance of future runs, this paper proposes a formal framework to construe this trade-off as an optimization problem and develops a tractable approximation of this problem.The paper then shows that the recent directional-modifier adaptation algorithm is a special formulation of this approximation.This observation allows one to further justify the directional-modifier adaptation algorithm from a theoretical standpoint and to consider a refined tuning of the algorithm.The theory presented in the paper is illustrated via simple examples.holds for some suite β k ≥ 0, with β k = 0 for k odd.This is a direct consequence of the generalized Isserlis theorem [27,28], which states that the expected value of any even-order moment of a multivariate normal centered distribution is a sum of products between k/2 entries of the covariance matrix Σ, while odd-order moments are null.It then also holds that: and the inequality: holds locally.
We observe that this Lemma appears to be a simple special case of the Theorem proposed by [26] on the delta method, restricted to the normal distribution.

Figure 1 .
Figure 1.Illustration for Problem(8).The level curves report the cost of (8) as a function of u 0 and Σ meas , with Σ 0 = 1.The dashed lines report the optimal input u 0 for various values of Σ meas .For Σ meas low enough, the problem has two non-zero solutions.

Figure 2 .
Figure 2.Comparison of the performance resulting from using the nominal input u 0 = u * ( p0 ) and the one resulting from (18) or(19) on the proposed example.The displayed cost is calculated

Figure 3 .
Figure 3.Comparison of the nominal and optimal experimental design on the proposed example for p0 = 0.The displayed cost is calculated according to the cost proposed in(24), which reads as

Figure 4 .
Figure 4.The left graph illustrates the nominal and optimal experimental design performance on the proposed example.The displayed cost is calculated according to the cost proposed in(24), which

Figure 5 .Figure 6 .
Figure 5. Example of the problem where the quadratic approximation (58) is unbounded, while (53) has a solution.The black lines report the level curves of the cost of Problem (67); the grey lines report the level curves of the cost of Problem (68); and the light grey lines report the level curves of the cost of Problem (69) with Q given by (63).In this example, ignoring the contribution of ∇ 2 φ 0 in the Directional-Modifier Adaptation (DMA) algorithm leads it to privilege directions (light grey dashed line) that are significantly different from the ones privileged by (67) (grey dashed line).The latter point to the solution of the original Problem (53).

Figure 7 .
Figure 7. Illustration for Section 3.3, setup (71).The black lines report the level curves of the cost of Problem (67); the grey lines report the level curves of the cost of Problem (68); and the light grey lines report the level curves of the cost of Problem (69) with Q given by (63).In this example, the rank-one approximation for Q leads the DMA algorithm to propose a solution that is far from the one of (67).

Figure 8 .
Figure 8. Illustration for Section 3.3, setup (72).The black lines report the level curves of the cost of Problem (67); the grey lines report the level curves of the cost of Problem (68); and the light grey lines report the level curves of the cost of Problem (69) with Q given by (63).In this example, all problems deliver very similar solutions.