Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions

For piecewise linear functions f : R n ↦ R we show how their abs-linear representation can be extended to yield simultaneously their decomposition into a convex f ˇ and a concave part f ^ , including a pair of generalized gradients g ˇ ∈ R n ∋ g ^ . The latter satisfy strict chain rules and can be computed in the reverse mode of algorithmic differentiation, at a small multiple of the cost of evaluating f itself. It is shown how f ˇ and f ^ can be expressed as a single maximum and a single minimum of affine functions, respectively. The two subgradients g ˇ and − g ^ are then used to drive DCA algorithms, where the (convex) inner problem can be solved in finitely many steps, e.g., by a Simplex variant or the true steepest descent method. Using a reflection technique to update the gradients of the concave part, one can ensure finite convergence to a local minimizer of f, provided the Linear Independence Kink Qualification holds. For piecewise smooth objectives the approach can be used as an inner method for successive piecewise linearization.


Introduction and Notation
There is a large class of functions f : R n Þ Ñ R that are called DC because they can be represented as the difference of two convex functions, see for example [1,2]. This property can be exploited in various ways, especially for (hopefully global) optimization. We find it notationally and conceptually more convenient to express these functions as averages of a convex and a concave function such that f pxq " 1 2 p q f pxq`p f pxqq with q f pxq convex and p f pxq concave.
Throughout we will annotate the convex part by superscriptqand the concave part by superscriptp, which seems rather intuitive since they remind us of the absolute value function and its negative. Since we are mainly interested in piecewise linear functions we assume without much loss of generality that the functions f and the convex and concave components are well defined and finite on all of the Euclidean space R n . Allowing both components to be infinite outside their proper domain would obviously generate serious indeterminacies, i.e., NaNs in the numerical sense. As we will see later we can in fact ensure in our setting that pointwise p f pxq ď f pxq ď q f pxq for all x P R n , which means that we actually obtain an inclusion in the sense of interval mathematics [3]. This is one of the attractions of the averaging notation. We will therefore also refer to p f and q f as the concave and convex bounds of f . p2 f q´f " p q f`p f q´1 2 p q f`p f q " p q f´1 2 p f q`p p f´1 2 q f q " 1 2 rp2 q f´p f q`p2 p f´q f qs .
If in Equation (3) by chance we had originally´p f " 1 2 q f ą 0 so that f " 1 2 q f with the condition number κp q f ,´0.5 q f q " 3 we would get after the doubling and subtraction the condition number κp2.5 q f ,´2 q f q " 9. So it is obviously important that the original algorithm avoids as much as possible calculations that are ill-conditioned in that they even just partly compensate each other.
Throughout the paper we assume that the functions in question are evaluated by a computational procedure that generates a sequence of intermediate scalars, which we denote generically by u, v and w. The last one of these scalar variables is the dependent, which is usually denoted by f . All of them are continuous functions u " upxq of the vector x P R n of independent variables. As customary in mathematics we will often use the same symbol to identify a function and its dependent variable. For the overall objective we will sometimes distinguish them and write y " f pxq. For most of the paper we assume that the intermediates are obtained from each other by affine operations or the absolute value function so that the resulting upxq are all piecewise linear functions.
The paper is organized as follows. In the following Section 2 we develop rules for propagating the convex/concave decomposition through a sequence of abs-linear operations applied to intermediate quantities u. This can be done either directly on the pair of bounds pq u, p uq or on their average u and their halved distance δu " 1 2 pq u´p uq. In Section 3 we organize such sequences into an abs-linear form for f and then extend it to simultaneously yield the convex/concave decomposition. As a consequence of this analysis we get a strengthened version of the classical max´min representation of piecewise linear functions, which reduces to the difference of two polyhedral parts in max-and min-form. In Section 4 we develop strict rules for propagating certain generalized gradient pairs pq g, p gq of pq u, p uq exploiting convexity and the cheap gradient principle [5]. In Section 5 we discuss the consequences for the DCA when using limiting gradients pq g, p gq, solving the inner, linear optimization problem (LOP) exactly, and ensuring optimality via polyhedral reflection. In Section 6 we demonstrate the new results on the nonconvex and piecewise linear chained Rosenbrock version of Nesterov [6]. Section 7 contains a summary and preliminary conclusion with outlook. In the Appendix A we give the details of the necessary and sufficient optimality test from [7] in the present DC context.

Propagating Bounds and/or Radii
In Equation (3) we already assumed that doubling is done componentwise and that for a difference v " w´u of DC functions w and u, one defines the convex and concave parts by pw´uq " q w´p u and { pw´uq " p w´q u , respectively. This yields in particular for the negatioñ p´uq "´p u and z p´uq "´q u .
For piecewise linear functions we need neither the square formula Equation (2) nor the more general decompositions for products. Therefore we will not insist on the sign conditions even though they would be also maintained automatically by Equation (4) as well as the natural linear rules for the convex and concave parts of the sum and the multiple of a DC function, namely pw`uq " p q w`q uq and { pw`uq " p p w`p uq , pc uq " c q u and z pc uq " c p u if c ě 0 , pc uq " c p u and z pc uq " c q u if c ď 0 .
However, the sign conditions would force one to decompose simple affine functions upxq " a J x`β as upxq " maxp0, a J x`βq`minp0, a J x`βq " 1 2 pq upxq`p upxqq , which does not seem such a good idea from a computational point of view.
The key observation for this paper is that as is well known (see e.g., [8]), one can propagate the absolute value operation according to the identity |u| " maxpu,´uq " 1 2 maxpq u`p u,´q u´p uq " maxpq u,´p uq`1 2 pp u´q uq ðñ | |u| " 2 maxpq u,´p uq and x |u| " p u´q u .
Here the equality in the second line can be verified by shifting the difference 1 2 pp u´q uq into the two arguments of the max. Again we see that when applying the absolute value operation to an already positive convex function u " 1 2 q u ě 0 we get | |u| " 2q u and x |u| "´q u so that the condition number grows from κpq u, 0q " 1 to κp2q u,´q uq " 3. In other words, we observe once more the danger that both component functions drift apart. This looks a bit like simultaneous growth of numerator and denominator in rational arithmetic, which can sometimes be limited through cancelations by common integer factors. It is currently not clear when and how a similar compactification of a given convex/concave decomposition can be achieved. The corresponding rule for the maximum is similarly easy derived, namely maxpu, wq " 1 2 maxpq u`p u, q w`p wq " 1 2 pmaxpq u´p w, q w´p uq`pp u`p wqq .
When u and w as well as their decomposition are identical we arrive at the new decomposition u " maxpu, uq " 1 2 ppq u´p uq`2p uq, which obviously represents again some deterioration in the conditioning. While it was pointed out in [4] that the DC functions u " 1 2 pq u`p uq themselves form an algebra, their decomposition pairs pq u, p uq are not even an additive group, as only the zero p0, 0q has a negative partner, i.e., an additive inverse. Naturally, the pairs pq u, p uq form the Cartesian product between the convex cone of convex functions and its negative, i.e., the cone of concave functions. The DC functions are then the linear envelope of the two cones in some suitable space of locally Lipschitz continuous functions. It is not clear whether this interpretation helps in some way, and in any case we are here mainly concerned with piecewise linear functions.

Propagating the Center and Radius
Rather than propagating the pairs pq u, p uq through an evaluation procedure as defined in [5] to calculate the function value f pxq at a given point x, it might be simpler and better for numerical stability to propagate the pair This representation resembles the so-called central form in interval arithmetic [9] and we will call therefore u the central value and δu the radius. In other words, u is just the normal piecewise affine intermediate function and the δu is a convex distance function to the hopefully close convex and concave part. Should the potential blow-up discussed above actually occur, this will only effect δu but not the central value u itself. Moreover, at least theoretically one might decide to reduce δu from time to time making sure of course that the corresponding q u and p u as defined in Equation (7) stay convex and concave, respectively. The condition number now satisfies the bound Recall here that all intermediate quantities u " upxq are functions of the independent variable vector x P R n . Naturally, we will normally only evaluate the intermediate pairs u and δu at a few iterates of whatever numerical calculation one performs involving f so that we can only sample the ratio ρupxq " |δupxq{upxq| pointwise, where the denominator is hopefully nonzero. We will also refer to this ratio as the relative gap of the convex/concave decomposition at a certain evaluation point x. The arithmetic rules for propagating radii of the central form in central convex/concave arithmetic are quite simple. Lemma 1 (Propagation rules for central form). With c, d, x P R two constants and an independent variable Proof. The last rule follows from Equation (6) by δp|u|q " 1 2´| |u|´x |u|¯" maxpq u,´p uq´1 2 pp u´q uq " maxpq u´δu,´p u´δuq`2 δu " maxpu,´uq`2 δu " |u|`2 δu .
The first equation in Equation (8) means that for all quantities u that are affine functions of the independent variables x the corresponding radius δu is zero so that q u " u " p u until we reach the first absolute value. Notice that δv does indeed grow additively for the subtraction just like for the addition. By induction it follows from the rules above for an inner product that where the c j P R are assumed to be constants. As we can see from the bounds in Lemma 1 the relative gap can grow substantially whenever one performs an addition of values with opposite sign or applies the absolute value operation. In contrast to interval arithmetic on smooth functions one sees that the relative gap, though it may be zero or small initially immediately jumps above 1 when one hits the first absolute value operation. This is not really surprising since the best concave lower bound on upxq " |x| itself is p upxq " 0 so that δu " |x|, q upxq " 2|x| and thus ρupxq " 1 constantly. On the positive side one should notice that throughout we do not lose sight of the actual central values upxq, which can be evaluated with full arithmetic precision. In any case we can think of neither ρ nor κ ď 1`ρ as small numbers, but we must be content if they do not actually explode too rapidly. Therefore they will be monitored throughout our numerical experiments.
Again we see that the computational effort is almost exactly doubled. The radii can be treated as additional variables that occur only in linear operations and stay nonnegative throughout. Notice that in contrast to the (nonlinear) interval case we do not loose any accuracy by propagating the central form. It follows immediately by induction from Lemma 1 that any function evaluated by a evaluation procedure that comprises a finite sequence of • initializations to independent variables • multiplications by constants • additions or subtractions • absolute value applications is piecewise affine and continuous. We will call these operations and the resulting evaluation procedure abs-linear. It is also easy to see that the absolute values |¨| can be replaced by the maximum maxp¨,¨q or the minimum minp¨,¨q or the positive part function maxp0,¨q or any combination of them, since they can all be mutually expressed in terms of each other and some affine operations. Conversely, it follows from the min-max representation established in [10] (Proposition 2.2.2) that any piecewise affine function f can be evaluated by such an evaluation procedure. Consequently, by applying the formulas Equations (4)-(6) one can propagate at the same time the convex and concave components for all intermediate quantities. Alternatively, one can propagate the centered form according to the rules given in Lemma 1. These rules are also piecewise affine so that we have a finite procedure for simultaneously evaluating q u and p u or u and δu as piecewise linear functions. The combined computation requires about 2-3 times as many arithmetic operations and twice as many memory accesses. Of course due to the interdependence of the two components it is not possible to evaluate just one of them without the other. As we will see the same is true for the generalized gradients to be discussed later in Section 4.

Forming and Extending the Abs-Linear Form
In practice all piecewise linear objectives can be evaluated by a sequence of abs-linear operations, possibly after min and max have been rewritten as minpu, wq " 1 2 pu`w´|u´w|q and maxpu, wq " 1 2 pu`w`|u´w|q .
Our only restriction is that the number s of intermediate scalar quantities, say z i , is fixed, which is true for example in the max´min representation. Then we can immediately cast the procedure in matrix-vector notation as follows: Lemma 2 (Abs-Linear Form). Any continuous piecewise affine function f : x P R n Þ Ñ y P R can be represented by where z P R s , Z P R sˆn , M, L P R sˆs strictly lower triangular, d P R, a P R n , b P R s and |z| denotes the componentwise modulus of the vector z.
It should be noted that the construction of this general abs-linear form requires no analysis or computation whatsoever. However, especially for our purpose of generating a reasonably tight DC decomposition, it is advantages to reduce the size of the abs-normal form by eliminating all intermediates z j with j ă s for which |z j | never occurs on the right hand side. To this end we may simply substitute the expression of z j given in the j-th row in all places where z j itself occurs on the right hand side. The result is what we will call a reduced abs-normal form, where after renumbering, all remaining z j with j ă s are switching variables in that |z j | occurs somewhere on the right hand side. In other words, all but the last column of the reduced, strictly lower triangular matrix L are nontrivial. Again, this reduction process is completely mechanical and does not require any nontrivial analysis, other than looking up which columns of the original L were zero. The resulting reduced system is smaller and probably denser, which might increase the computation effort for evaluating f itself. However, in view of Equation (9) we must expect that for the reduced form the radii will grow slower if we first accumulate linear coefficients and then take their absolute values. Hence we will assume in the remainder of this paper that the abs-normal form for our objective f of interest is reduced.
Based on the concept of abs-linearization introduced in [11], a slightly different version of a (reduced) abs-normal form was already proposed in [12]. Now in the present paper, both z and y depend directly on z via the matrix M and the vector b, but y does no longer depend directly on |z|. All forms can be easily transformed into each other by elementary modifications. The intermediate variables z i can be calculated successively for 1 ď i ď s by where Z i , M i and L i denote the ith rows of the corresponding matrix. By induction on i one sees immediately that they are piecewise affine functions z i " z i pxq, and we may define for each x the signature vector σpxq " psgnpz i pxqqq i"1...s P t´1, 0, 1u s .
Consequently we get the inverse images P σ " tx P R n : sgnpzpxqq " σu for σ P t´1, 0, 1u s , which are relatively open polyhedra that form collectively a disjoint decomposition of R n . The situation for the second example of Nesterov is depicted in Figure 3 in the penultimate section. There are six polyhedra of full dimension, seven polyhedra of co dimension 1 drawn in blue and two points, which are polyhedra of dimension 0. The point p0,´1q with signature p0,´1, 0q is stationary and the point p1, 1q with signature p1, 0, 0q is the minimizer as shown in [7]. The arrows indicate the path of our reflection version of the DCA method as described in Section 5. When σ is definite, i.e., has no zero components, which we will denote by 0 R σ, it follows from the continuity of zpxq that P σ has full dimension n unless it is empty. In degenerate situations this may also be true for indefinite σ but then the closure of P σ is equal to the extended closure s Pσ " tx P R n : σpxq ăσu Ą closepPσq (14) for some definite 0 Rσ ą σ. Here the (reflexive) partial ordering ă between the signature vectors satisfies the equivalence as shown in [13]. One can easily check that for any σ ąσ there exists a unique signature We callσ " σ Źσ the reflection of σ atσ, which satisfies alsoσ ąσ and we have in fact Hence the relation between σ andσ is symmetric in that also σ "σ Źσ. Therefore we will call pσ,σq a complementary pair with respect toσ. In the very special case z i " x i for i " 1 . . . n " s´1 the s P σ are orthants and their reflections at the origin t0u " s P 0 Ă R n are their geometric opposites s Pσ withσ "´σ. Here one can see immediately that all edges, i.e., one-dimensional polyhedra, have Cartesian signatures˘e i for i " 1 . . . n and belong to s P σ or s Pσ for any given σ. Notice thatx is a local minimizer of a piecewise linear function if and and only if it is a local minimizer along all edges of nonsmoothness emanating form it. Consequently, optimality of f restricted to a complementary pair is equivalent to local optimality on R n , not only in this special case, but whenever the Linear Independence Kink Qualification (LIKQ) holds as introduced in [13] and defined in the Appendix A. This observation is the basis of the implicit optimality condition verified by our DCA variant Algorithm 1 through the use of reflections. The situation is depicted in Figure 3 where the signatures p´1,´1,´1q and p1,´1, 1q as well as p1,´1, 1q and p1, 1,´1q form complementary pairs at p0,´1q and p1, 1q, respectively. At both reflection points there are four emanating edges, which all belong to one of the three polyhedra mentioned.
Applying the propagation rules from Lemma 1, one obtains with δx " 0 P R n the recursion where the modulus is once more applied componentwise for vectors and matrices. Hence, we have again in matrix vector notation which yields for δz the explicit expression Here, ν is the so-called switching depth of the abs-linear form of f , namely the largest ν P N such that p|M|`|L|q ν ‰ 0, which is always less than s due to the strict lower triangularity of M and L. The unit lower triangular pI´|M|´2|L|q is an M-matrix [14], and interestingly enough does not even depend on x but directly maps |z| " |zpxq| to δz " δzpxq. For the radius of the function itself, the propagation rules from Lemma 1 then yield This nonnegativity implies the inclusion Equation (1) already mentioned in Section 1, i.e.: Theorem 1 (Inclusion by convex/concave decomposition). For any piecewise affine function f in abs-linear form, the construction defined in Section 2 yields a convex/concave inclusion Moreover, the convex and the concave parts q f pxq and p f pxq have exactly the same switching structure as f pxq in that they are affine on the same polyhedra P σ defined in (13). (16) and (17) ensure that δ f pxq is nonnegative at all x P R n such that

Proof. Equations
It follows from Equation (17) that the radii δz i pxq are like the |z i pxq| piecewise linear with the only nonsmoothness arising through the switching variables zpxq themselves. Obviously this property is inherited by δ f pxq and the linear combinations q f pxq " f pxq`δ f pxq and p f pxq " f pxq´δ f pxq, which completes the proof.
Combining Equations (16) and (18) with the abs-linear form of the piecewise affine function f and definingz " pz, δzq P R 2s , one obtains for the calculation off pxq "ỹ " py, δyq the following abs-linear formz with the vectors and matrices defined bỹ Then, Equations (19) and (20) yield i.e., Equations (16) and (18). As can be seen, the matricesM andL have the required strictly lower triangular form. Furthermore, it is easy to check, that the switching depth of the abs-linear form of f carries over to the abs-linear form forf in that also p|M|`|L|q ν ‰ 0 " p|M|`|L|q ν`1 . However, notice that this system is not reduced since the s radii are not switching variables, but globally nonnegative anyhow. We can now obtain explicit expressions for the central values, radii, and bounds for a given signature σ.
Corollary 1 (Explicit representation of the centered form). For any definite signature σ S 0 and all x P P σ we have with Σ " diagpσq z σ pxq " pI´M´LΣq´1pc`Zxq and |z σ pxq| " Σz σ pxq ě 0 (21) where the restrictions of the functions and their gradients to P σ are denoted by subscript σ. Notice that the gradients are constant on these open polyhedra.
As one can see the computation of the gradient ∇ f σ requires the solution of one unit upper triangular linear system and that of both ∇ q f σ and ∇ p f σ one more. Naturally, upper triangular systems are solved by back substitution, which corresponds to the reverse mode of algorithmic differentiation as described in the following section. Hence, the complexity for calculating the gradients is exactly the same as that for calculating the functions, which can be obtained by one forward substitution for f σ and an extra one for δ f σ and thus q f σ and p f σ . The given ∇ f σ , ∇ q f σ and ∇ p f σ are proper gradients in the interior of the full dimensional domains P σ . For some or even many σ the inverse image P σ of the map x Þ Ñ sgnpzpxqq may be empty, in which case the formulas in the corollary do not apply. Checking the nonemptiness of P σ for a given signature σ amounts to checking the consistency of a set of linear inequalities, which costs the same as solving an LOP and is thus nontrivial. Expressions for the generalized gradients at points in lower dimensional polyhedra are given in the following Section 4. There it is also not required that the abs-linear normal form has been reduced, but one may consider any given sequence of abs-linear operations.

The Two-Term Polyhedral Decomposition
It is well known ( [15], Theorem 2.49) that all piecewise linear and globally convex or concave functions can be represented as the maximum or the minimum of a finite collection of affine functions, respectively. Hence, from the convex/concave decomposition we get the following drastic simplification of the classical min-max representation given, e.g., in [10].
where furthermore p f pxq ď f pxq ď q f pxq.
The max-part of this representation is what is called a polyhedral function in the literature [15]. Since the min-part is correspondingly the negative of a polyhedral function we may also refer to Equation (26) as a DP decomposition, i.e., the difference of two polyhedral functions.
We are not aware of a publication that gives a practical procedure for computing such a collection of affine functions α i`a J i x, i " 1 . . . k, and β j`b J j x, j " 1 . . . l, for a given piecewise linear function f . Of course the critical question is in which form the function f is specified. Here as throughout our work we assume that it is given by a sequence of abs-linear operations. Then we can quite easily with index sets I i " t1, . . . , k i u, 1 ď i ďm, and J i " t1, . . . , l i u, 1 ď i ďn, since one has to consider all possibilities of selecting one affine function each from one of them max andn min groups, respectively. Obviously, (28) involves ś m i"1 k i and ś n i"1 i affine function terms in contrast to the first representation (27)  Proof. We will consider the representations (27) from which (26) can be directly obtained in the form (28). Firstly, the independent variables x j are linear functions of themselves with gradient a " e j and inhomogeneity α " 0. Then for multiplications by a constant c ą 0 we have to scale all affine functions by c. Secondly, addition requires appending the expansions of the two summands to each other without any computation. Taking the negative requires switching the sign of all affine functions and interchanging the max and min group. Finally, to propagate through the absolute values we have to apply the rule (6), which means switching the signs in the min group, expressing it in terms of max and merging it with the existing max group. Here merging means pairwise joining each polyhedral term of the old max-group with each term in the switched min-group. Then the new min-group is the old one plus the old max-group with its sign switched.
We see that taking the absolute value or, alternatively, maxima or minima generates the strongest growth in the number of polyhedral terms and their size. It seems clear that this representation is generally not very useful because the number of terms will likely blow up exponentially. This is not surprising because we will need one affine function for each element of the polyhedral decompositions of the domain of the max and min term. Typically, many of the affine terms will be redundant, i.e., could be removed without changing the values of the polyhedral terms. Unfortunately, identifying those already requires solving primal or dual linear programming problems, see, e.g., [16]. It seems highly doubtful that this would ever be worthwhile. Therefore, we will continue to advocate dealing with piecewise linear functions in a convenient procedural abs-linear representation.

Computation of Generalized Gradients and Constructive Oracle Paradigm
For optimization by variants of the DCA algorithm [17] one needs generalized gradients of the convex and the concave component. Normally, there are no strict rules for propagating generalized gradients through nonsmooth evaluation procedures. However, exactly this is simply assumed in the frequently invoked oracle paradigm, which states that at any point x P R n the function value f pxq and an element g P B f pxq can be evaluated. We have argued in [18] that this is not at all a reasonable assumption.
On the other hand, it is well understood that for the convex operations: Positive scaling, addition, and taking the maximum the rules are strict and simple. Moreover, then the generalized gradient in the sense of Clarke B q f pxq Ă R n is actually a subdifferential in that all its elements define supporting hyperplanes. Similarly B p f pxq might be called a superdifferential in that the tangent planes bound the concave part from above.
In other words, we have at all x P R n and for all increments ∆x which imply for q g P B q f pxq and p g P B p f pxq that where the lower bound on the left is a concave function and the upper bound is convex, both with respect to ∆x. Notice that the generalized superdifferential B p f being the negative of the subdifferential of´p f is also a convex set. Now the key question is how we can calculate a suitable pair of generalized gradients pq g, p gq P B q f pxqˆB p f pxq. As we noted above the convex part and the negative of the concave part only undergo convex operations so that for v " c u and Finally, for v " |u| we find by Equation (6) that Bp v " B p u´B q u as well as 1 2 Bq v " B maxpq u,´p uq " where we have used that u " 1 2 pq u`p uq in Equation (32). The sign of the arguments u of the absolute value function are of great importance, because they determine the switching structure. For this reason, we formulated the cases in terms of u rather than in the convex/concave components. The operator convt¨u denotes taking the convex hull or envelope of a given usually closed set. It is important to state that within an abs-linear representation the multipliers c will stay constant independent of the argument x, even if they were originally computed as partial derivatives by an abs-linearization process and thus subject to round-off error. In particular their sign will remain fixed throughout whatever algorithmic calculation we perform involving the piecewise linear function f . So, actually the case c " 0 could be eliminated by dropping this term completely and just initializing the left hand side v to zero.
Because we have set identities we can propagate generalized gradient pairs p∇q u, ∇p uq P B q uˆB p u and perform the indicated algebraic operations on them, starting with the Cartesian basis vectors ∇q x j " ∇p x j " ∇x j " e j since q x j " p x j " x j for j " 1 . . . n .
The result of this propagation is guaranteed to be an element of B q fˆB p f . Recall that in the merely Lipschitz continuous case generalized gradients cannot be propagated with certainty since for example the difference v " w´u generates a proper inclusion Bv Ă Bw´Bu. In that vein we must emphasize that the average 1 2 p∇ q f`∇ p f q need not be a generalized gradient of f " 1 2 p q f`p f q as demonstrated by the possibility that p f "´q f algebraically but we happen to calculate different generalized gradients of q f and´p f at a particular point x. In fact, if one could show that B f " 1 2 pB q f`B p f q one would have verified the oracle paradigm, whose use we consider unjustified in practice. Instead, we can formulate another corollary for sufficiently piecewise smooth functions. Definition 1. For any d P N, the set of functions f : R n Þ Ñ R, y " f pxq, defined by an abs-normal form z " Fpx, z, |z|q , y " ϕpx, zq, with F P C d pR n`s`s q and ϕ P C d pR n`s q, is denoted by C d abs pR n q.
Once more, this definition differs slightly from the one given in [7] in that y depends only on z and not on |z| in order to match the abs-linear form used here. Then one can show the following result:

Corollary 4 (Constructive Oracle Paradigm).
For any function f P C 2 abs pR n q and a given point x there exist a convex polyhedral function | ∆ f px; ∆xq and a concave polyhedral function x ∆ f px; ∆xq such that f px`∆xq´f pxq " 1 2´| ∆ f px; ∆xq`x ∆ f px; ∆xq¯`Op}∆x} 2 q Moreover, both terms and their generalized gradients at ∆x " 0 or anywhere else can be computed with the same order of complexity as f itself.

Proof.
In [11], we show that f px`∆xq´f pxq " ∆ f px; ∆xq`Op}∆x} 2 q , where ∆ f px; ∆xq is a piecewise linearization of f developed at x and evaluated at ∆x. Applying the convex/concave decomposition of Theorem 1, one obtains immediately the assertion with a convex polyhedral function | ∆ f px; ∆xq and a concave polyhedral function x ∆ f px; ∆xq evaluated at ∆x. The complexity results follow from the propagation rules derived so far.
We had hoped that it would be possible to use this approximate decomposition into polyhedral parts to construct at least locally an exact decomposition of a general function f P C d abs pR n q into a convex and compact part. The natural idea seems to add a sufficiently large quadratic term β}∆x} 2 to such that it would become convex. Then the same term could be subtracted from x ∆ f px; ∆xq maintaining its concavity. Unfortunately, the following simple example shows that this is not possible.

Example 1 (Half pipe)
. The function f : R 2 Þ Ñ R, f px 1 , x 2 q " maxpx 2 2´m axpx 1 , 0q, 0q (33) in the class C 8 abs pR n q is certainly nonconvex as shown in Figure 1. As already observed in [19] this generally nonsmooth function is actually Fréchet differentiable at the origin x " 0 with a vanishing gradient ∇ f p0q " 0. Hence, we have f p∆xq " Op}∆x} 2 q and may simply choose constantly | ∆ f p0; ∆xq " 0 " x ∆ f p0; ∆xq. However, neither by adding β}∆x} 2 nor any other smooth function to f p∆xq can we eliminate the downward facing kink along the vertical axis ∆x 1 " 0. In fact, it is not clear whether this example has any DC decomposition at all.

Applying the Reverse Mode for Accumulating Generalized Gradients
Whenever gradients are propagated forward through a smooth evaluation procedure, i.e., for functions in C 2 pR n q, they are uniquely defined as affine combinations of each other, starting from Cartesian basis vectors for the components of x. Given only the coefficients of the affine combinations one can propagate corresponding adjoint values, or impact factors backwards, to obtain the gradient of a single dependent with respect to all independents at a small multiple of the operations needed to evaluate the dependent variable by itself. This cheap gradient result is a fundamental principle of computational mathematics, which is widely applied under various names, for example discrete adjoints, back propagation, and reverse mode differentiation. For a historical review see [20] and for a detailed description using similar notation to the current paper see our book [5]. For good reasons, there has been little attention to the reverse mode in the context of nonsmooth analysis, where one can at best obtain subgradients. The main obstacle is again that the forward propagation rules are only sharp when all elementary operations maintain convexity, which is by the way the only constructive way of verifying convexity for a given evaluation procedure. While general affine combinations and the absolute value are themselves convex functions, they do not maintain convexity when applied to a convex argument.
The last equation of Lemma 1 shows that one cannot directly propagate a subgradient of the convex radius functions δu because there is a reference to v " |u| itself, which does not maintain convexity except when it is redundant due to its argument having a constant sign. However, it follows from the identity δu " 1 2 pq u´p uq that for all intermediates u ∇q u P B q u^∇p u P B p u ùñ 1 2 p∇q u´∇p uq P Bδu .
Hence one can get affine lower bounds of the radii, although one would probably prefer upper bounds to limit the discrepancy between the convex and concave parts. When v " |u| and u " 0 we may choose according to Equation (32) any convex combination It is tempting but not necessarily a good idea to always choose the weight µ equal to 1 2 for simplicity. Before discussing the reasons for this at the end of this subsection, let us note that from the values of the constants c, the intermediate values u, and the chosen weights µ it is clear how the next generalized gradient pair p∇q v, ∇p vq is computed as a linear combination of the generalized gradients of the inputs for each operation, possibly with a switch in their roles. That means after only evaluating the function f itself, not even the bounds q f and p f , we can compute a pair of generalized gradients in B q fˆB p f using the reverse mode of algorithmic differentiation, which goes back to at least [21] though not under that name. The complexity of this computation will be independent of the number of variables and relative to the complexity of the function f itself. All the operations are relatively benign, namely scaling by constants, interchanges and additions and subtractions. After all the reverse mode is just a reorganization of the linear algebra in the forward propagation of gradients. Hence, it appears that we can be comparatively optimistic regarding the numerical stability of this process.
To be specific we will indicate the (scalar) adjoint value of all intermediates q u and p u as usual by s q u P R and s p u P R. They are all initialized to zero except for either s q y " 1 or s p y " 1. Then at the end of the reverse sweep, the vectors ps x j q n j"1 represent either ∇q y or ∇p y, respectively. For computational efficiency one may propagate both adjoint components simultaneously, so that one computes with sextuplets consisting of q u, p u and their adjoints with respect to q y and p y. In any case we have the following adjoint operations. For v " u`w p s q w, s p wq`" p s q v, s p vq and p s q u, s p uq`" p s q v, s p vq , Of course, the update for the critical case u " 0 of the absolute value is just the convex combination for the two cases u ą 0 and u ă 0 weighted by µ. Due to round-off errors it is very unlikely that the critical case u " 0 ever occurs in floating point arithmetic. Once more, the sign of the arguments u of the absolute value function are of great importance, because they determine on which faces of the polyhedral functions q f and p f the current argument x is located. In some situations one prefers a gradient that is limiting in that it actually occurs as a proper gradient on one of the adjacent smooth pieces. For example, if we had simply f pxq " v " |x| for x P R and chose µ " 1 2 we would get q v " 2|x|, p v " 0 and find by Equation (34) that ∇q v " 2p 1 2´1 2 q " 0 at x " q x " p x " 0. This is not a limiting gradient of q v since Bq v " r´2, 2s, whose interior contains the particular generalized gradient 0.

Exploiting the Convex/concave Decomposion for the DC Algorithm
In order to minimize the decomposed objective function f we may use the DCA algorithm [17] which is given in its basic form using our notation by where`1 2 q f˘˚denotes the Fenchel conjugate of`1 2 q f˘. For a convex function h : R n Þ Ñ R one has w P Bh˚pyq ô w P argmin xPR n thpxq´y J xu, see [15], Chapter 11. Hence, the classic DCA reduces in our Euclidean scenario to a simple recurrence The objective function on the left hand side is a constantly shifted convex polyhedral upper bound on 2 f pxq since It follows from Equation (29) and x k`1 being a minimizer that Now, since (36) is an LOP, an exact solution x k`1 can be found in finitely many steps, for example by a variant of the Simplex method. Moreover, we can then assume that x k`1 is one of finitely many vertex points of the epigraph of q f . At these vertex points, f itself attains a finite number of bounded values. Provided f itself is bounded below, we can conclude that for any choice of the p g k P B p f σ pkq the resulting function values f px k q can only be reduced finitely often so that f px k q " f px k´1 q and w.l.o.g. x k " x k´1 eventually. We then choose the next p g k " ∇ p f σ pkq with σ pkq " σ pk´1q Ź σpx k q as the reflection of σ pk´1q at σpx k q as defined in (15). If then again f px k`1 q " f px k q it follows from Corollary A2 that x k is a local minimizer of f and we may terminate the optimization run. Hence we obtain the DCA variant listed in Algorithm 1, which is guaranteed to reach local optimality under LIKQ. It is well defined even without this property and we conjecture that otherwise the final iterate is still a stationary point of f . The path of the algorithm on the example discussed in Section 5 is sketched in Figure 3. It reaches the stationary point p0,´1q where σ " p0,´1, 0q from within the polyhedron with the signature p´1,´1,´1q and then continues after the reflection p1,´1, 1q " p´1,´1,´1q Ź p0,´1, 0q. From within that polyhedron the inner loop reaches the point p1, 1q with signature p1, 0, 0q, whose minimality is established after a search in the polyhedron s P p1,1,´1q . If the function f pxq is unbounded below, so will be one of the inner convex problems and the convex minimizer should produce a ray of infinite descent instead of the next iterate x k`1 . This exceptional scenario will not be explicitly considered in the remainder of the paper. The reflection operation is designed to facilitate further descent or establish local optimality. It is discussed in the context of general optimality conditions in the following subsection.

Proximal Rather Than Global
By some authors the DCA algorithm has been credited with being able to reach global minimizers with a higher probability than other algorithms. There is really no justification for this optimism in the light of the following observation. Suppose the objective f pxq " 1 2 p q f pxq`p f pxqq has an isolated local minimizer x˚. Then there exists an ε ą 0 such that the level set tx P R n : f pxq ď f px˚q`εu has a bounded connected component containing x˚, say L ε . Now suppose DCA is started from any point x 0 P L ε . Since f 0 pxq " 1 2 p q f pxq`p f px 0 q`p gpx 0 q J px´x 0 qq is by Equation (37) a convex upper bound on f pxq its level set t f 0 pxq ď f px 0 qu will be contained in L ε . Hence any step from x 0 that reduces the upper bound f 0 pxq must stay in the same component, so there is absolutely no chance to move away from the catchment L ε of x 0 towards another local minimizer of f , whether global or not. In fact, by adding the convex term 1 2´p f px 0 q`p gpx 0 q J px´x 0 q´p f pxq¯ě 0 , which vanishes at x 0 , to the actual objective f pxq one performs a kind of regularization, like in the proximal point method. This means the step is actually held back compared to a larger step that might be taken by a method that only requires the reduction of f pxq itself.
Hence we may interpret DCA as a proximal point method where the proximal term is defined as an affinely shifted negative of the concave part. Since in general the norm and the coefficient defining the proximal term may be quite hard to select, this way of defining it may make a lot of sense. However, it is certainly not global optimization. Notice that in this argument we have used neither the polyhedrality nor the inclusion property. So it applies to a general DC decomposition on Euclidean space. Another conclusion from the "holding back" observation is that it is probably not worthwhile to minimize the upper bound very carefully. One might rather readjust the shift p g J x after a few or even just one iteration.

Nesterov's Piecewise Linear Example
According to [6], Nesterov suggested three Rosenbrock-like test functions for nonsmooth optimization. One of them given by is nonconvex and piecewise linear. It is shown in [6] that this function has 2 n´1 Clarke stationary points only one of which is a local and thus the global minimizer. Numerical studies showed that optimization algorithms tend to be trapped at one of the stationary points making it an interesting test problem. We have demonstrated in [23] that using an active signature strategy one can guarantee convergence to the unique minimizer from any starting point albeit using in the worst case 2 n iterations as all stationary points are visited. Let us first write the problem in the new abs-linear form. Defining the s " 2 n switching variables z i " F i px, |z|q " x i for 1 ď i ă n, z n " F n px, |z|q " x 1´1 , and z n`i " F n`i px, |z|q " x i`1´2 |z i |`1 for 1 ď i ă n, z s " 1 4 |z n |`n´1 ÿ i"1 |z n`i | the resulting objective function is then simply identical to y " f pxq " z s . With the vectors and matrices c J " p0,´1, e J n´1 , 0q P R pn´1q`1`pn´1q`1 , Z " » ---- where Z and L have different row partitions, one obtains an abs-linear form (11) of f . Here, I k denotes the identity matrix of dimension k, e J " p1,¨¨¨, 1q P R k the vector containing only ones and the symbol 0 pads with zeros to achieve the specified dimensions. One can easily check that |L| 2 ‰ 0 " |L| 3 , hence this example has switching depth ν " 2. The geometry of the situation is depicted in Figure 3, which was already briefly discussed in Sections 3 and 5.
Since the corresponding extended abs-linear form forf " py, δyq does not provide any new insight we do not state it here. Directly in terms of the original equations we obtain for the radii and δ f " δz s " 1 4 |z n |`n´1 Thus, from Equation (7) we get the convex and concave part explicitly as Clearly p f is a concave function and to check the convexity of q f we note thaťˇx The last expression is the sum of an affine function and the positive part of the sum of the absolute value and an affine function, which must therefore also be convex. The corresponding term in Equation (42) is the same with the convex function 2|x i | added, so that δ f is also convex in agreement with the general theory. Finally, one verifies easily that which is the whole idea of the decomposition. It would seem that the automatic decomposition by propagation through the abs-linear procedure yields a rather tight result. The function f as well as the lower and upper bound given by the convex/concave decomposition are illustrated on the left hand side of Figure 2. Notice that the switching structure is indeed identical for all three as stated in Theorem 1. On the right hand side of Figure 2, the difference 2δ f between the upper, convex and lower, concave bound is shown, which is indeed convex. It is worthwhile to look at the condition number of the decomposition, namely we get the following trivial bound The disappointing right hand side value follows from the fact that at the well known unique global optimizer x˚" p1, 1, . . . , 1q P R n the numerator is zero and the denominator positive. However, elsewhere, we can bound the conditioning as follows.
Lemma 3. In case of the example (40) there is a constant c P R such that Proof. Since the denominator is piecewise linear and vanishes only at the minimizer x˚there must be a constant c 0 ą 0 such that for }x´x˚} 8 ď 3 which takes the value 32pn´1q{p3c 0 q on the boundary. On the other hand we get for }x} 8 ě 2 and thus in particular }x´x˚} 8 ě 3 Assuming without loss of generality that c 0 ď 4{3 we can combine the two bounds to obtain the assertion with c " 32pn´1q{c 0 .
Hence, we see the condition number κp q f pxq, p f pxqq is nicely bounded and the decomposition should work as long as our optimization algorithm has not yet reached its goal x˚. It is verified in the companion article [24], that the DCA exploiting the observations made in this paper reaches the global minimizer in finitely many steps. It was already shown in [7] that the LIKQ condition is satisfied everywhere and that the optimality test singles out the unique minimizer correctly. In Figure 3, the arrows indicate the path of our reflection version of the DCA method as described in Section 5.

Summary, Conclusions and Outlook
In this paper the following new results were achieved • For every piecewise linear function f given as an abs-linear evaluation procedure, rules for simultaneously evaluating its representation as the average of a concave lower bound p f and a convex upper bound q f are derived.

•
The two bounds can be constructively expressed as a single maximum and minimum of affine functions, which drastically simplifies the classical min´max representation. Due to its likely combinatorial complexity we do not recommend this form for practical calculations.

•
For the two bounds q f and p f , generalized gradients q g and p g can be propagated forward or reverse through the convex or concave operations that define them. The gradients are not unique but guaranteed to yield supporting hyperplanes and thus provide a verified version of the oracle paradigm.

•
The DCA algorithm can be implemented such that a local minimizer is reached in finitely many iterations, provided the Linear Independence Kink Qualification (LIKQ) is satisfied. It is conjectured that without this assumption the algorithm still converges in finitely many steps to a Clarke stationary point. Details on this can be found in the companion paper [24].
These results are illustrated on the piecewise linear Rosenbrock variant of Nesterov. On a theoretical level it would be gratifying and possibly provide additional insight, to prove the result of Corollary A3 directly using the explicit representations of the generalized differentials of the convex and concave part given in Corollary 1. Moreover, it remains to be explored what happens when LIKQ is not satisfied. We have conjectured in [25] that just verifying the weaker Mangasarian Fromovitz Kink Qualification (MFKQ) represents an NP hard task. Possibly, there are other weaker conditions that can be cheaply verified and facilitate the testing for at least local optimality.
Global optimality can be characterized theoretically in terms of ε´subgradients, albeit with ε arbitrarily large [26]. There is the possibility that the alternative definition of ε-gradients given in [18] might allow one to constructively check for global optimality. It does not really seem clear how these global optimality conditions can be used to derive corresponding algorithms.
The implementation of the DCA algorithm can be optimized in various ways. Notice that for applying the Simplex method in standard form, one could use for the representation as DC function the max-part in the more economical representation Equation (27) introducingm additional variables, rather than the potentially combinatorial Equation (28) to assemble the constraint matrix. In any case it seems doubtful that solving each sub problem to completion is a good idea, especially as the resulting step in the outer iteration is probably much too small anyhow. Therefore, the generalized gradient of the concave part, which defines the inner problem, should probably be updated much more frequently. Moreover, the inner solver might be an SQOP type active signature method or a matrix free gradient method with momentum term, as is used in machine learning, notwithstanding the nonsmoothness of the objective. Various options in that range will be discussed and tested in the companion article [24].
Finally, one should always keep in mind that the task of minimizing a piecewise linear function will most likely occur as an inner problem in the optimization of a piecewise smooth and nonlinear function. As we have shown in [27] the local piecewise linear model problem can be obtained easily by a slight generalization of automatic or algorithmic differentiation, e.g., ADOL-C [28] and Tapenade [29]. as LIKQ in [7] and obviously requires that no more than n switches are active atx. As discussed in [7], for the pointx to be a local minimizer of f it is necessary that it solves the trunk problem min a J x`b J z s.t. |Σ|z´c´Zx " 0 .
Here |Σ| P R˜sˆ˜s is the projection onto thes´m vector components whose indices do not belong to α so the equality constraint combines (A1) and the constraint P α z " 0. Now we get from KKT theory or equivalently LOP duality thatx is a minimizer on P α if and only if for some Lagrange multiplier vector λ P R˜s a J "´λ JZ and b J " λ J |Σ| .
Since I " |Σ|`P J α P α we derive that where λ α " P α λ. This is a generally overdetermined system of n equations in the m components of λ α . If it is solvable the full multiplier vector λ " P J α λ α`|Σ |b is immediately available. Because of the assumed full rank of the Jacobian P αZ we have m ď n, and ifx is a vertex in that m " n the tangential stationarity condition (A3) is automatically satisfied. Now it is necessary and sufficient for local minimality thatx is also a minimizer of f on all polyhedra s P σ with definite σ ąσ. Any such σ ąσ can be written as σ "σ`γ with γ P t´1, 0, 1us structurally orthogonal toσ such that for Γ " diagpγq we have the matrix equations Then we can express the zpxq " z σ pxq for x P P σ as z σ pxq " zσ`γpxq " pI´M´LΣ´LΓq´1pc`Zxq " pI´LΓq´1pc`Zxq , withL " pI´M´LΣq´1L . Nowx must be the minimizer of f on s P σ , i.e., solve the problem Notice that the inequalities are only imposed on the sign constraints that are active atx since the strict inequalities are maintained in a neighborhood ofx due to the continuity of zpxq. Then we get again from KKT theory or equivalently LOP duality that still a J "´λ JZ and for a second multiplier vector 0 ď µ P R m the equalities Multiplying from the right by the projection |Σ| we find that the conditions (A2) and (A3) must still hold so that λ remains exactly the same. Moreover, multiplying from the right by ΓP J α we get with P α P J α " I m and ΓΓ " P J α P α after some rearrangement the inequality Now the key observation is that this condition is linear in Γ and is strongest for the choice γ i " sgnpλ i´bi q for i P α yielding the inequalities In other words,x is a solution of the branch problems (A4) if and only if it is for the worst case where γ i " sgnpλ i´bi q for i P α. When coincidentally λ i " b i we can define γ i arbitrarily. Note that the complementarity condition µ J P α zpxq " 0 associated with Equation (A4) is automatically satisfied atx for any µ, since P αz " 0 by definition of the active index set α. These observations yield immediately: Proposition A1 (Necessary and sufficient minimality condition). Assume LIKQ holds in that P αZ has full row rank m " |α|. Then the pointx is a local minimizer of f if and only if we have tangential stationarity in that a`Z J b belongs to the range ofZ J P J α and normal growth holds in that |P α pλ´bq| ď P αL J λ .
The verification that LIKQ holds and subsequently the test whether tangential stationarity is satisfied can be based on a QR decomposition of the active Jacobian P αZ P R mˆn . The main expense here is the calculation ofZ itself, which requires one forward substitution on pI´M´LΣq for each of n columns of Z and hence at most ns 2 {2 fused multiply adds. Very likely this effort will already be made by any kind of active set method for reaching the candidate pointx. Once the multiplier vector λ is obtained the remaining test (A7) for normal growth is almost for free so that we have a polynomial minimality criterion provided LIKQ holds. Otherwise one may assume a weaker generalization of the Mangasarian Fromovitz constrained qualification called MFKQ in [25]. However, we have conjectured in [19] that verifying MFKQ is probably already NP-hard.
Corollary A1 (Descent direction in the nonoptimal case). Suppose that LIKQ holds. If tangential stationarity is violated there exits some direction d P R n such that P αZ d " 0 but pa J`bJZ qd ă 0, which implies descent in that f px`τdq ă f pxq for τ Á 0. If tangential stationarity holds but normal growth fails there exists at least one i P α with |λ i´bi | ą e J iL J λ. Defining γ " sgnpλ i´bi qe i P R˜s, any d satisfying P α pI´LΓq´1Zd " P α γ is a descent direction.
Proof. In the first case it is clear thatx`τd P Pσ for τ Á 0 since the components of zpx`τdq with indices in α stay zero and the others vary only slightly. Then the directional derivative of f p.q atx in direction τd is given by τa J d`τb JZ d " τpa J d`b JZ dq ă 0 , which proves the first assertion. Otherwise, λ is well defined and we can choose i P α with |λ i´bi | ą e J iL J λ. Setting γ " γ i e i with γ i " sgnpλ i´bi qe i , one obtains for d with P α pI´LΓq´1Zd " γ that x`τd P Pσ`γ for τ Á 0. On that polyhedron the Lagrange multiplier vector µ is also well defined by Equation (A6) but we have Then we get the directional derivative of f p.q atx in direction τd where we have used identity (A5). Hence we have again descent, which completes the proof.
Corollary A2 (Optimality via Reflection). Suppose anx where LIKQ holds has been reached by minimizing q f pxq`p g J x with p g " ∇ p f σ for 0 R σ ąσ. Thenx is a local minimizer of f on R n if and only if it is also a minimizer of q f pxq`∇ p f J σ x withσ " σ Źσ as defined in (15).
Proof. By assumptionx solves one of the branch problems of f itself. Hence we must have tangential stationarity (A5) with the corresponding Γ " diagpγq for γ " σ´σ. Sinceσ´σ "´γ we conclude from (A6) that pλ´bq J ΓP J α ď λ JL P J α ě pλ´bq J p´ΓqP J α "´pλ´bq J ΓP J α which implies thatˇˇˇp λ´bq J P J αˇ"ˇp λ´bq J ΓP J αˇď λ JL P J α . (A8) Hence both tangential stationarity and normal growth are satisfied, which completes the proof by Proposition A1 as the converse implication is trivial .
The key conclusion is that if anx is the solution of two complementary convex problems it must be locally optimal in the full dimensional space R n . Hence one can establish local optimality just using the preferred convex solver. If this test fails one naturally obtains descent to function values below f pxq until eventually a local minimizer is found.

Appendix A.1. Equivalence to DC Optimality Condition
Using the explicit expressions given in Lemma 1 we find that (see [18]) where γ ranges over all complements ofσ such thatσ`γ P t´1, 1u s is definite. Similarly we obtain with b J " |b| J pI´|M|´2|L|q´1|L| ě 0 P R s the limiting differentials of the convex and the concave part as ) . (A11) Hence we have an explicit representation for the limiting gradients of f as well as its convex and concave part q f and p f atx. It is easy to see that the minimality condition (A5) requires a to be in the range ofZ J so that we have again a J "´λ JZ yielding We had hoped to be able to derive directly from these expressions that normal growth implies the condition (39), but we have so far not been able to do so. However, we can indirectly derive the following equivalence.
Corollary A3 (First order minimality condition). Under LIKQ the limiting differential B Lp f pxq is contained in the convex hull of´B Lq f pxq if and only if tangential stationarity and normal growth condition hold according to Proposition A1.