Efficient Automatic Subdifferentiation for Programs with Linear Branches

Sejun Park

doi:10.3390/math11234858

Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea

Mathematics2023, 11(23), 4858;https://doi.org/10.3390/math11234858

This article belongs to the Special Issue High-Speed Computing and Parallel Algorithms

Version Notes

Order Reprints

Abstract

Computing an element of the Clarke subdifferential of a function represented by a program is an important problem in modern non-smooth optimization. Existing algorithms either are computationally inefficient in the sense that the computational cost depends on the input dimension or can only cover simple programs such as polynomial functions with branches. In this work, we show that a generalization of the latter algorithm can efficiently compute an element of the Clarke subdifferential for programs consisting of analytic functions and linear branches, which can represent various non-smooth functions such as max, absolute values, and piecewise analytic functions with linear boundaries, as well as any program consisting of these functions such as neural networks with non-smooth activation functions. Our algorithm first finds a sequence of branches used for computing the function value at a random perturbation of the input; then, it returns an element of the Clarke subdifferential by running the backward pass of the reverse-mode automatic differentiation following those branches. The computational cost of our algorithm is at most that of the function evaluation multiplied by some constant independent of the input dimension n, if a program consists of piecewise analytic functions defined by linear branches, whose arities and maximum depths of branches are independent of n.

Keywords:

automatic differentiation; Clarke subdifferential; algorithm

MSC:

68Q25; 68W40

1. Introduction

Automatic differentiation refers to various techniques to compute the derivatives of a function represented by a program, based on the well-known chain rule of calculus. It has been widely used across various domains, and diverse practical automatic differentiation systems have been developed [1,2,3]. In particular, reverse-mode automatic differentiation [4] has been a driving force of the rapid advances in numerical optimization [5,6,7].

There are two important properties of reverse-mode automatic differentiation: correctness and efficiency. For programs consisting of smooth functions, it is well known that reverse-mode automatic differentiation always computes the correct derivatives [8,9]. Furthermore, for programs returning a scalar value, it efficiently computes their derivatives in the sense that its computational cost is at most proportional to that of the function evaluation, where the additional multiplicative factor is bounded by five for rational functions [10,11] and by some constant that depends on the underlying implementation of smooth functions; if the arities of the functions are independent of the input dimension n, then this constant is also independent of n [12]. Such correctness and efficiency of reverse-mode automatic differentiation, referred to as the Cheap Gradient Principle, have been of central importance for modern nonlinear optimization algorithms [13].

In practical problems, a program often involves branches (e.g., max and an absolute value), and the corresponding target function can be non-smooth. In other words, the derivative of the program may not exist at some inputs. In this work, we investigated a Cheap Subgradient Principle: an efficient algorithm that correctly computes an element of the Clarke subdifferential, a generalized notion of the derivative, for scalar programs. One naïve approach is to directly apply reverse-mode automatic differentiation to the Clarke subdifferential. Such a method is computationally cheap as in the smooth case; however, due to the absence of the sharp chain rule for the Clarke subdifferential, it is incorrect in general even if the target function is differentiable [8,14,15].

There have been extensive research efforts to correctly compute an element of the Clarke subdifferential. A notable line of work is based on the lexicographic subdifferential [16], which is a subset of the Clarke subdifferential, but a sharp chain rule holds under some structural assumptions. Based on this, a series of works [17,18,19,20] has shown that an element of the lexicographic subdifferential can be computed by evaluating n directional derivatives, where n denotes the input dimension. Nevertheless, since this approach requires computing n directional derivatives, it incurs a multiplicative factor n in its computational cost compared to that of function evaluation in the worst case.

To avoid such an input-dimension-dependent factor, a two-step randomized algorithm for programs with branches has been proposed [14]. In the first step, the algorithm chooses a random direction

δ

and finds a sequence of branch selections based on the directional derivative with respect to

δ

under some qualification condition [14,21]. Then, the second step computes the derivative corresponding to the branches returned in the first step, which is shown to be an element of the Clarke subdifferential. Here, the second step can also be efficiently implemented via reverse-mode automatic differentiation. As a result, this two-step algorithm correctly computes an element of the Clarke subdifferential under the qualification condition, with the computational cost independent of the input dimension. However, this result is only for piecewise polynomial programs, defined by branches and finite compositions of monomials and affine functions.

In this work, we propose an efficient automatic subdifferentiation algorithm by generalizing the algorithm in [14] described above. Our algorithm correctly computes an element of the Clarke subdifferential for programs consisting of any analytic functions (including polynomials) and linear branches. As in the prior efficient automatic (sub)differentiation works, the computational cost of our algorithm is that of the function evaluation multiplied by some constant independent of the input dimension n, if a program consists of piecewise analytic functions (defined by linear branches) whose arities and maximum depths of branches are independent of n (e.g., max and the absolute value).

Related Works

Non-smooth optimization: Although smooth functions are easy to formulate and optimize, they have limited applicability as non-smoothness appears in various science and engineering problems. For example, real-world problems in thermodynamics often involve discrete switching between thermodynamic phases, which can be modeled by non-smooth functions. Dynamic simulation and optimization under these models often require the treatment of these models [22,23]. In machine learning applications, hinge loss, ReLU

(x) = max {x, 0}

, and maxpool operations are often used, which makes the optimization objective non-smooth [6,24,25]. For optimizing convex, but non-smooth functions, subderivative methods are widely used for approximating a local minimum [26,27]. However, for non-convex functions, the subderivative does not exist in general, and researchers have investigated the generalized notion of derivatives (e.g., the Clarke subdifferential).

Optimization algorithms using generalized derivatives: Recently, the convergence properties of optimization algorithms based on generalized derivatives for non-convex and non-smooth functions have received much attention. Ref. [28] proved that, for locally Lipschitz functions, the stochastic gradient method, where the gradient is chosen from the Clarke subdifferential, converges to a stationary point. However, as we introduced in the previous section, computing an element of the Clarke subdifferential is computationally expensive or can be efficient only for a specific class of programs. Ref. [29] proposed a new notion of gradient called the conservative gradient, which can be efficiently computed; nevertheless, its convergence property is not well understood, especially under practical setups.

2. Problem Setup

2.1. Notations

For

n \in N \cup {0}

, we denote

[n] ≜ {1, \dots, n}

, where

[0] = \emptyset

. For any set

S

, we use

S^{0} ≜ {()}

, where

()

denotes the zero-dimensional vector. For any vector

x \in S^{n}

and

i \in [n]

, we denote

x_{i}

for the i-th coordinate of x and

x_{: i} ≜ (x_{1}, \dots, x_{i - 1})

, where

x_{: 1} = ()

; similarly, for any

x \in S^{n}

and index set

I = {i_{1}, \dots, i_{k}} \subseteq [n]

with

i_{1} < \dots < i_{k}

, we use

x_{I} ≜ (x_{i_{1}}, \dots, x_{i_{k}})

. For any

u = (u_{1}, \dots, u_{n}) \in R^{n}

and

v = (v_{1}, \dots, v_{m}) \in R^{m}

, we use

u \oplus v ≜ (u_{1}, \dots, u_{n}, v_{1}, \dots, v_{m})

; when

n = m

, we write

⟨ u, v ⟩

to denote the standard inner product between u and v. For any

x \in R

, sign

(x) = 1

if

x > 0

and sign

(x) = - 1

otherwise. For any real-valued vector x, len

(x)

denotes the length of x, i.e., len

(x) = n

if

x \in R^{n}

. For any set

S \subseteq R^{n}

, we use cl

(S)

, int

(S)

, and conv

(S)

to denote the closure, interior, and convex hull of

S

, respectively. We lastly define the Clarke subdifferential. Given a function

f : R^{n} \to R

and the set

D \subseteq R^{n}

of all points at which f is differentiable, the Clarke subdifferential of f at

x \in R^{n}

is a set defined as

\begin{matrix} \partial^{c} f (x) ≜ conv \{s \in R^{n} : \exists {u_{t}}_{t \in N} \subseteq D such that u_{t} \to x and \nabla f (u_{t}) \to s\} . \end{matrix}

2.2. Programs with Branches

We considered a program P defined in Figure 1 (left). P applies a series of primitive functions

F_{n + 1}, \dots, F_{n + m}

to compute intermediate variables

v_{n + 1}, \dots, v_{n + m}

and, then, returns the last result

v_{n + m}

. Each primitive function

F_{i} : R^{d_{i}} \to R

is continuous and defined in an inductive way as in Figure 1 (right):

F_{i}

either applies a function

f_{i, ()}

or branches via (possibly nested) if–else statements. Namely,

F_{i}

is a continuous, piecewise function. If

F_{i}

branches with an input

(x_{1}, \dots, x_{d_{i}})

, then it first evaluates

y = ϕ_{i, ()} (x_{1}, \dots, x_{d_{i}})

for some

ϕ_{i, ()} : R^{d_{i}} \to R

and checks whether

y > c_{i, ()}

or not for some threshold value

c_{i, ()} \in R

. If

y > c_{i, ()}

, it executes a code

E_{i, (1)}

that either applies a function

f_{i, (1)}

or executes another code

E_{i, (1, 1)}

or

E_{i, (1, - 1)}

depending on whether

ϕ_{i, (1)} (x_{1}, \dots, x_{d_{i}}) > c_{i, (1)}

or not. The case

y \leq c_{i, ()}

is handled in a similar way. Here, we assumed that each

F_{i}

has finitely many branches. In short, each primitive function

F_{i}

first finds a proper piece labeled by

p \in ⋃_{j = 0}^{\infty} {- 1, 1}^{j}

and, then, returns

f_{i, p} (x_{1}, \dots, x_{d_{i}})

. We illustrate a flow chart for a primitive function

F_{i}

in Figure 2.

Figure 1. Definitions of a program P (left) and primitive functions

F_{n + 1}, \dots, F_{n + m}

(right).

Figure 2. A flow chart illustrating a primitive function

F_{i}

.

We present an example code for a primitive function

F_{i}

that returns

max {x_{1}, x_{2}, x_{3}}

in Figure 3. In this example,

F_{i}

branches at most twice: its first branch is determined by

ϕ_{i, ()} (x_{1}, x_{2}, x_{3}) = x_{1} - x_{2}

, which is stored in y in the second line in Figure 3. If

y > 0

(the third line with

c_{i, ()} = 0

), then it executes

E_{1, (1)}

, which corresponds to lines 4–6. Otherwise, it moves to

E_{1, (- 1)}

, which corresponds to the lines 8–10. Suppose

E_{1, (1)}

is executed (i.e.,

y > 0

in line 3). Then, it computes

ϕ_{i, (1)} (x_{1}, x_{2}, x_{3}) = x_{1} - x_{3}

and stores it in y as in line 4. If

y > 0

, then

E_{i, (1, 1)}

is executed, which returns the value of

x_{1}

(i.e.,

f_{i, (1, 1)} (x_{1}, x_{2}, x_{3}) = x_{1}

). Otherwise,

E_{i, (1, - 1)}

is executed, which returns the value of

x_{3}

(i.e.,

f_{i, (1, - 1)} (x_{1}, x_{2}, x_{3}) = x_{3}

).

Figure 3. Example code for the max function

F_{i} : (x_{1}, x_{2}, x_{3}) \mapsto max {x_{1}, x_{2}, x_{3}}

.

We considered each

v_{i}

and

v_{p a (i)}

as a function of an input

w \in R^{n}

. Specifically, for all

i \in [n]

, we use

v_{i} (w) ≜ w_{i}

; and for all

i \in [n + m] ∖ [n]

, we use

v_{i} (w) ≜ F_{i} (v_{(i)} (w))

and

v_{p a (i)} (w) ≜ (v_{j_{1}} (w), \dots, v_{j_{d_{i}}} (w))

, where

p a (i) = {j_{1}, \dots, j_{d_{i}}}

with

0 < j_{1} < \dots < j_{d_{i}} < i

. Under this notation,

w \mapsto v_{n + m} (w)

denotes the target function represented by the program P. We often omit w and write

v_{i}

and

v_{p a (i)}

if it is clear from the context. We denote the gradient (or Jacobian) of functions with respect to an input w by the operator: e.g., D

v_{i} (w) ≜ \partial v_{i} (w) / \partial w

and D

ϕ_{i, p} (v_{p a (i)} (w)) ≜ \partial ϕ_{i, p} (v_{p a (i)} (w)) / \partial w

.

Throughout this paper, we focus on programs with linear branches; that is, each

ϕ_{i, p} (x_{1}, \dots, x_{d})

is linear in

x_{1}, \dots, x_{d}

. Primitive functions with linear branches can express fundamental non-smooth functions such as max, the absolute value, bilinear interpolation, and any piecewise analytic functions with finite linear boundaries. They have been widely used in various fields including machine learning, electrical engineering, and non-smooth analysis. For example, max–min representation (or the abs–normal form) has been extensively studied in non-smooth analysis [30,31]. Furthermore, neural networks using the ReLU

(x) = max {x, 0}

activation function and maxpool operations are widely used in machine learning, computer vision, load forecasting, etc. [6,24,25]. The assumption on linear branches will be formally introduced in Assumption 1 in Section 3.1.

2.3. Pieces of Programs

We introduce useful notations here. For each

i \in [n + m] ∖ [n]

, we define the set of the pieces of

F_{i}

as

\begin{matrix} Γ_{i} ≜ \{p \in ⋃_{j = 0}^{\infty} {- 1, 1}^{j} : E_{i, p} = return f_{i, p} (x_{1}, \dots, x_{d_{i}})\} . \end{matrix}

We also define

Γ ≜ {()}^{n} \times \prod_{i = n + 1}^{m} Γ_{i}

for the pieces of the overall program. Here, we include an auxiliary piece

()

for the first n indices so that

γ_{i} \in Γ_{i}

for any

γ = (γ_{1}, \dots, γ_{n + m}) \in Γ

and

i \in [n + m] ∖ [n]

. For each

i \in [n + m] ∖ [n]

, we define the set of the inputs to

F_{i}

that corresponds to the piece

p \in Γ_{i}

, as

\begin{matrix} X_{i, p} ≜ {x \in R^{d_{i}} : F_{i} selects f_{i, p} at the input x} . \end{matrix}

Then,

{X_{i, p} : p \in Γ_{i}}

forms a partition of

R^{d_{i}}

. Likewise, we also define the set of the inputs to the overall program P that corresponds to

γ \in Γ

, as

\begin{matrix} W_{γ} ≜ {w \in R^{n} : v_{p a (i)} (w) \in X_{i, γ_{i}} for all i \in [n + m] ∖ [n]} . \end{matrix}

Then,

{W_{γ} : γ \in Γ}

forms a partition of

R^{n}

. Lastly, for each

γ \in Γ

, we inductively define the function

v_{i, γ} : R^{n} \to R

that corresponds to

v_{i} (\cdot)

, but is obtained by using the

γ_{j}

piece of each

F_{j}

, as

\begin{matrix} v_{i, γ} (w) ≜ \{\begin{matrix} w_{i} & if i \in [n] \\ f_{i, γ_{i}} (v_{p a (i), γ} (w)) & if i \in [n + m] ∖ [n], \end{matrix} \end{matrix}

where

v_{(i), γ} (\cdot)

is defined as in

v_{(i)} (\cdot)

. Then,

v_{i, γ} (\cdot)

coincides with

v_{i} (\cdot)

on

W_{γ}

for all

i \in [n + m]

.

2.4. Reverse-Mode Automatic Differentiation

Reverse-mode automatic differentiation is an algorithm for computing the gradient

v_{n + m} (w)

of the target function (if it exists) by sequentially running one forward pass (Algorithm 1) and one backward pass (Algorithm 2). Given

w \in R^{n}

, its forward pass computes

v_{n + 1} (w), \dots, v_{n + m} (w)

and corresponding pieces

γ_{i} \in Γ_{i}

such that

w \in W_{γ}

for

γ = ((), \dots, (), γ_{n + 1}, \dots, γ_{n + m})

. Namely, we have

v_{i} (w) = v_{i, γ} (w)

for all

i \in [n + m]

.

Algorithm 1 Forward pass of reverse-mode automatic differentiation

1:: Input: P, $(w_{1}, \dots, w_{n})$
2:: Initialize: $(v_{1}, \dots, v_{n}) = (w_{1}, \dots, w_{n})$
3:: for $i = n + 1, \dots, n + m$ do
4:: Let $p = ()$
5:: while $p \notin Γ_{i}$ do
6:: $p = p \oplus ((sign ϕ_{i, p} (v_{p a (i)}) - c_{i, p}))$
7:: end while
8:: Set $γ_{i} = p$ and $v_{i} = f_{i, γ_{i}} (v_{p a (i)})$
9:: end for
10:: return $(v_{1}, \dots, v_{n + m})$ , $(γ_{n + 1}, \dots, γ_{n + m})$

Algorithm 2 Backward pass of reverse-mode automatic differentiation

1:: Input: P, $(v_{1}, \dots, v_{n + m})$ , $(γ_{n + 1}, \dots, γ_{n + m})$
2:: Initialize: $(g v_{1}, \dots, g v_{n + m}) = (0, \dots, 0, 1)$
3:: for $i = n + m, \dots, n + 1$ do
4:: for $j \in (i)$ do
5:: $g v_{j} = g v_{j} + g v_{i} \cdot \frac{\partial f_{i, γ_{i}}}{\partial v_{j}} (v_{(i)})$
6:: end for
7:: end for
8:: return $(g v_{1}, \dots, g v_{n})$

Given

v_{1} (w), \dots, v_{n + m} (w)

and

γ

, the backward pass computes

{D v}_{n + m, γ} (w)

by applying the chain rule to the composition of differentiable functions

f_{n + 1, γ_{n + 1}}, \dots, f_{n + m, γ_{n + m}}

. In particular, it iteratively updates

g v_{i}

and returns

(g v_{1}, \dots, g v_{n}) = {D v}_{n + m, γ} (w)

. It is well known that reverse-mode automatic differentiation computes the correct gradient, i.e.,

g v_{i}

coincides with

\partial v_{n + m} (w) / \partial w_{i}

for all

i \in [n]

, if primitive functions

F_{n + 1}, \dots, F_{n + m}

do not have any branches [8,9]. However, if some

F_{i}

uses branches, it may return arbitrary values even if the target function

v_{n + m} (\cdot)

is differentiable at w [14,32,33]. In the rest of the paper, we use AD to denote reverse-mode automatic differentiation.

Our algorithm computing an element of the Clarke subdifferential is similar to AD: it first finds some pieces

γ_{i}^{*} \in Γ_{i}

and applies the backward pass of AD (Algorithm 2) to compute its output. Here, we chose the pieces

γ_{i}^{*}

that are used for computing the forward pass with some perturbed input, not the original one. Hence, our pieces and that of AD are different in general, which enables our algorithm to correctly compute an element of the Clarke subdifferential. We provide more details including the intuition behind our algorithm in Section 3.2 and Section 3.3.

3. Efficient Automatic Subdifferentiation

In this section, we present our algorithm for efficiently computing an element of the Clarke subdifferential. To this end, we first introduce a class of primitive functions, which we consider in the rest of this paper. Then, we describe our algorithm after illustrating its underlying intuition via an example. Lastly, we analyze the computational complexity of our algorithm.

3.1. Assumptions on Primitive Functions

We considered primitive functions that satisfy the following assumptions.

Assumption 1.

For any

i \in [n + m] ∖ [n]

,

p \in Γ_{i}

, and

j \in [len (p)]

, the following hold:

$X_{i, p} \neq \emptyset$ .
$ϕ_{i, p_{: j}}$ is linear, i.e., there exists $z \in R^{d_{i}}$ such that $ϕ_{i, p_{: j}} (x) = ⟨ z, x ⟩$ .
$f_{i, p}$ is analytic on $R^{d_{i}}$ .

The first assumption states that, for any

i \in [n + m] ∖ [n]

and

p \in Γ_{i}

, there exists

x \in R^{d_{i}}

such that

F_{i}

selects

f_{i, p}

at x. In other words, there is no non-reachable piece

p \in Γ_{i}

, i.e., all pieces of

F_{i}

are necessary to express

F_{i}

. The second assumption requires that all if–else statements of

F_{i}

have linear

ϕ_{i, p}

in their conditions. Lastly, we considered

f_{i, p}

, which is analytic on its domain (e.g., polynomials, exp, log, and sin), as stated in the third assumption. From this,

v_{i, γ} (\cdot)

is well-defined and analytic on some open set containing cl

(W_{γ})

for all

i \in [n + m]

and

γ \in Γ

.

Assumption 1 admits any primitive function that is analytic or piecewise analytic with linear boundaries such as max and bilinear interpolation. Hence, it allows many interesting programs such as nearly all neural networks considered in modern deep learning, e.g., [6,34].

3.2. Intuition Behind Efficient Automatic Subdifferentiation

As in AD, our algorithm first performs one forward pass (Algorithm 3) to compute the intermediate values

v_{n + 1} (w), \dots, v_{n + m} (w)

and to find proper pieces

γ_{n + 1}, \dots, γ_{n + m}

for the given input w. Then, it runs the original backward pass of AD (Algorithm 2) to compute an element of the Clarke subdifferential at w using the intermediate values and the pieces generated by the forward pass. Here, the key component of our algorithm is about how to choose proper pieces in the forward pass so that the backward pass can correctly compute an element of the Clarke subdifferential.

Before describing our algorithm, we explain its underlying intuition. Let

δ = (δ_{1}, \dots, δ_{n}) \in R^{n}

be a random vector drawn from a Gaussian distribution (see the initialization of Algorithm 3). Then, there exists unique

γ^{*} \in Γ

and some

s^{*} > 0

almost surely such that

\begin{matrix} w + t \cdot δ \in int (W_{γ^{*}}) for all t \in (0, s^{*}), \end{matrix}

(1)

i.e., a given program takes the same piece

γ^{*}

for all inputs close to w along the direction of

δ

; see Lemma 7 in Section 4 for the details. Since

v_{n + m} (\cdot) = v_{n + m, γ^{*}} (\cdot)

on

W_{γ^{*}}

and

v_{n + m, γ^{*}} (\cdot)

is differentiable, Equation (1) implies that

v_{n + m} (\cdot)

is differentiable at

w + t \cdot δ

for all

t \in (0, s^{*})

. Therefore, the quantity:

\begin{matrix} D v_{n + m, γ^{*}} (w) = lim_{t ↓ 0} {D v}_{n + m, γ^{*}} (w + t \cdot δ) = lim_{t ↓ 0} {D v}_{n + m} (w + t \cdot δ) \end{matrix}

(2)

is an element of the Clarke subdifferential

\partial^{c} v_{n + m} (w)

, and our algorithm computes this very quantity via the backward pass of AD.

Algorithm 3 Forward pass of our algorithm

1:: Input: P, $(w_{1}, \dots, w_{n})$
2:: Initialize: Sample $δ_{i} \sim Normal (0, 1)$ , and set $(v_{i}, {d v}_{i}) = (w_{i}, δ_{i})$ for all $i \in [n]$
3:: for $i = n + 1, \dots, n + m$ do
4:: Let $p = ()$
5:: while $p \notin Γ_{i}$ do
6:: if $(ϕ_{i, p} (v_{p a (i)}) > c_{i, p}) \lor (ϕ_{i, p} (v_{p a (i)}) = c_{i, p} \land ϕ_{i, p} (d v_{p a (i)}) > 0)$ then
7:: $p = p \oplus (1)$
8:: else if $(ϕ_{i, p} (v_{p a (i)}) < c_{i, p}) \lor (ϕ_{i, p} (v_{p a (i)}) = c_{i, p} \land ϕ_{i, p} (d v_{p a (i)}) < 0)$ then
9:: $p = p \oplus (- 1)$
10:: else if $ϕ_{i, p} (v_{p a (i)}) = c_{i, p} \land ϕ_{i, p} (d v_{p a (i)}) = 0$ then
11:: $p = p \oplus (s)$ for any $s \in {- 1, 1}$
12:: end if
13:: end while
14:: Set $γ_{i} = p$ , $v_{i} = f_{i, p} (v_{p a (i)})$ , and $d v_{i} = ⟨ \nabla f_{i, p} (v_{p a (i)}), d v_{p a (i)} ⟩$
15:: end for
16:: return $(v_{1}, \dots, v_{n + m})$ , $(γ_{n + 1}, \dots, γ_{n + m})$

We now illustrate the main idea behind our forward pass, which enables the backward pass to compute

{D v}_{n + m, γ^{*}} (w)

in Equation (2). As an example, consider a program with the following primitive functions:

F_{n + 1}, \dots, F_{n + m - 1}

are all analytic, and

F_{n + m}

branches only once with

ϕ_{n + m, ()} (\cdot) = ϕ (\cdot)

and

c_{n + m, ()} = 0

. For notational simplicity, we use

u (w) = v_{p a (n + m)} (w)

.

If

ϕ (u (w)) > 0

, then it is easy to observe that

γ_{n + m}^{*} = 1

from the continuity of

ϕ

and u. Likewise, if

ϕ (u (w)) < 0

, then

γ_{n + m}^{*} = - 1

. In the case that

ϕ (u (w)) = 0

, we use the following directional derivatives to determine

γ_{n + m}^{*}

:

\begin{matrix} d v_{i} (w; δ) ≜ lim_{t ↓ 0} \frac{v_{i} (w + t \cdot δ) - v_{i} (w)}{t} \end{matrix}

(3)

for

i \in [n + m - 1]

, which can be easily computed using the chain rule. From the definition of

d v_{i}

, the linearity of

ϕ

, and the chain rule, it holds that

\begin{matrix} lim_{t ↓ 0} \frac{ϕ (u (w + t \cdot δ)) - ϕ (u (w))}{t} = \sum_{j \in p a (n + m)} \frac{\partial ϕ (u (w))}{\partial v_{j} (w)} \cdot d v_{j} (w; δ) = ϕ (d u (w; δ)), \end{matrix}

where

d u (w; δ)

denotes the vector of all

d v_{j} (w; δ)

with

j \in p a (n + m)

. Then, by Taylor’s theorem,

ϕ (d u (w; δ)) > 0

(or

ϕ (d u (w; δ)) < 0

) implies

γ_{n + m}^{*} = 1

(or

γ_{n + m}^{*} = - 1

). In summary, if

ϕ (u (w)) \neq 0

or

ϕ (d u (w; δ)) \neq 0

, then the exact

γ_{n + m}^{*}

can be found, and hence, the backward pass (Algorithm 2) can correctly compute

{D v}_{n + m, γ^{*}} (w)

using

γ^{*} = ((), \dots, (), γ_{n + m}^{*})

.

Now, we considered the only remaining case:

ϕ (u (w)) = 0

and

ϕ (d u (w; δ)) = 0

. Unlike the previous cases, it is non-trivial here to find the correct

γ_{n + m}^{*}

because the first-order Taylor series approximation does not provide any information about whether a small perturbation of w toward

δ

increases

ϕ (u (w))

or not. An important point, however, is that we do not need the exact

γ_{n + m}^{*}

to compute an element of the Clarke subdifferential; instead, it suffices to compute

{D v}_{n + m, γ^{*}} (w)

. Surprisingly, this can be performed by choosing an arbitrary piece of

F_{n + m}

, as shown below.

For simplicity, suppose that

ϕ (x) = x_{1}

, i.e.,

ϕ (u (w)) = v_{i^{*}} (w)

for some

i^{*} \in p a (n + m)

; the below argument can be easily extended to an arbitrary linear

ϕ

. Let

γ^{α} = ((), \dots, (), α)

for

α \in {(- 1), (1)}

, i.e.,

γ_{n + m}^{α} = α

. Then, for any

α \in {(- 1), (1)}

, we have

\begin{matrix} {D v}_{n + m, γ^{α}} (w) = \sum_{j \in p a (n + m) ∖ {i^{*}}} \frac{\partial f_{n + m, α} (u (w))}{\partial v_{j} (w)} \cdot {D v}_{j} (w) \end{matrix}

(4)

almost surely, by the chain rule and the following result:

d v_{i^{*}} (w; δ) = ϕ (d u (w; δ)) = 0

implies

{D v}_{i^{*}} (w) = 0

almost surely (Lemma 5 in Section 4). Here, from the continuity and the definition of

F_{n + m}

, we must have

f_{n + m, (1)} = f_{n + m, (- 1)}

on the hyperplane

{x \in R^{d_{n + m}} : x_{1} = 0}

, and thus,

\partial f_{n + m, (1)} (x) / \partial x_{j} = \partial f_{n + m, (- 1)} (x) / \partial x_{j}

for any

x \in {x \in R^{d_{n + m}} : x_{1} = 0}

and

j \in [d_{n + m}] ∖ {1}

. From this and

v_{i^{*}} (w) = ϕ (u (w)) = 0

, we then obtain

\begin{matrix} \frac{\partial f_{n + m, (1)} (u (w))}{\partial v_{j} (w)} = \frac{\partial f_{n + m, (- 1)} (u (w))}{\partial v_{j} (w)} \end{matrix}

(5)

for all

j \in p a (n + m) ∖ {i^{*}}

. By combining Equations (4) and (5), we can finally conclude that

\begin{matrix} {D v}_{n + m, γ^{(1)}} (w) = {D v}_{n + m, γ^{(- 1)}} (w) = {D v}_{n + m, γ^{*}} (w) \end{matrix}

almost surely, where the last equality is from the fact that

γ_{n + m}^{*}

is either

(1)

or

(- 1)

. To summarize, if

ϕ (u (w)) = 0

and

ϕ (d u (w; δ)) = 0

, we can compute the target element of the Clarke subdifferential (i.e.,

{D v}_{n + m, γ^{*}} (w)

) by choosing an arbitrary piece of

F_{n + m}

.

3.3. Forward Pass for Efficient Automatic Subdifferentiation

Our algorithm for computing an element of the Clarke subdifferential is based on the observation made in the previous section: it runs one forward pass (Algorithm 3) for computing

v_{n + 1} (w), \dots, v_{n + m} (w)

and some

γ \in Γ

such that

{D v}_{n + m, γ} (w) = {D v}_{n + m, γ^{*}} (w)

and one backward pass of AD (Algorithm 2) for computing

{D v}_{n + m, γ} (w)

.

We now describe our forward pass procedure (Algorithm 3). First, it randomly samples a vector

δ \in R^{n}

from a Gaussian distribution and initializes

d v_{i} = δ_{i}

for all

i \in [n]

(line 2). Then, it iterates for

i = n + 1, \dots, n + m

as follows. Given

v_{1} (w), \dots, v_{i - 1} (w)

and their directional derivatives

d v_{1} (w; δ), \dots, d v_{i - 1} (w; δ)

with respect to

δ

, lines 5–13 in Algorithm 3 find a proper piece

γ_{i} \in Γ_{i}

of

F_{i}

by exploring its branches. If the condition in line 6 is satisfied, then it moves to the branch corresponding to

ϕ_{i, p} (v_{(i)} (w)) > c_{i, p}

(line 7). It moves in a similar way if the condition in line 8 is satisfied. As in our example in Section 3.2, if

ϕ_{i, p} (v_{p a (i)} (w)) = c_{i, p}

and

ϕ_{i, p} (d v_{p a (i)} (w; δ)) = 0

(line 10), then our algorithm moves to an arbitrary branch (line 11). Once Algorithm 3 finds a proper piece

γ_{i}

of

F_{i}

, it updates

v_{i} (w)

and

d v_{i} (w; δ)

via the chain rule (line 14). Here,

v_{i} (w)

can be correctly computed due to the continuity of

F_{i}

, while

d v_{i} (w; δ)

can also be correctly computed almost surely; see Lemma 8 in Section 4 for details. We remark that our algorithm is a generalization of the algorithm in [14]. The difference occurs in lines 10–11, where the existing algorithm deterministically chooses s based on some qualification condition [14].

As illustrated in Section 3.2, the piece

γ \in Γ

computed by our forward pass satisfies

{D v}_{n + m, γ} (w) = {D v}_{n + m, γ^{*}} (w)

almost surely, and hence, the backward pass using this

γ

correctly computes

{D v}_{n + m, γ^{*}} (w)

almost surely, which is an element of the Clarke subdifferential. We formally state the correctness of our algorithm in the following theorem; its proof is given in Section 4.

Theorem 1.

Suppose that Assumption 1 holds. Then, for any

w \in R^{n}

, running Algorithm 3 and then Algorithm 2 returns an element of

\partial^{c} v_{n + m} (w)

almost surely.

3.4. Computational Cost

In this section, we analyze the computational cost of our algorithm (both forward and backward passes) on a program P, compared to the cost of running P. Here, we only counted the cost of arithmetic operations and function evaluations and ignore the cost of memory read and write. We assumed that elementary operations (

+, \times, \land, \lor

), the comparison between two scalar values (

>, <, =

), and sampling a value from the standard normal distribution have a unit cost (e.g., cost

(+) = 1

), while the cost for evaluating an analytic function f is represented by cost

(f)

. To denote the cost of evaluating a program P with an input w, we use cost

(P (w))

. Likewise, for the cost of running our algorithm (i.e., Algorithms 2 and 3) on P and w, we use cost

(ours (P, w))

. We also assumed that memory read/write costs are included in our cost function. Under this setup, we bound the computational cost of our algorithm in Theorem 2.

Theorem 2.

Suppose that

cost (P (w)) \geq n

for all

w \in R^{n}

. Then, for any program P and its input

w \in R^{n}

,

cost (ours (P, w)) \leq κ \cdot cost (P (w))

where

\begin{matrix} κ ≜ & 1 + max_{i \in [n + m] ∖ [n]} κ_{i}, \\ κ_{i} ≜ & \frac{{max}_{p \in Γ_{i}} 2 cost (\nabla f_{i, p}) + cost (f_{i, p}) + 4 d_{i} + 4 len (p) + 2 \sum_{j = 1}^{len (p)} cost (ϕ_{i, p_{: j}})}{{min}_{q \in Γ_{i}} cost (f_{i, q}) + len (q) + \sum_{j = 1}^{len (q)} cost (ϕ_{i, q_{: j}})} . \end{matrix}

(6)

The assumption in Theorem 2 is mild since it is satisfied if at least one distinct operation is applied to each input for evaluating P. The proof of Theorem 2 is presented in Section 5, where we use program representations of Algorithms 2 and 3 (see Figure 4 and Figure 5 in Section 3.4 for the details).

Figure 4. A program

P^{ADB}

implementing the backward pass of AD (Algorithm 2).

Figure 5. A program

P^{ours}

implementing Algorithm 3. Here,

x_{1 : d_{i}} = (x_{1}, \dots, x_{d_{i}})

.

Suppose that, for each

i \in [n + m] ∖ [n]

,

d_{i}

and

{max}_{p \in Γ_{i}} len (p)

(i.e., the arity and the maximum branch depth of

f_{i}

) are independent of n. This condition holds in many practical cases: e.g., the absolute value function has

d_{i} = 1

and

{max}_{p \in Γ_{i}} len (p) = 1

;

max {\cdot, \cdot}

has

d_{i} = 2

and

{max}_{p \in Γ_{i}} len (p) = 1

. Under this mild condition,

cost (f_{i, p})

,

cost (\nabla f_{i, p})

, and

cost (ϕ_{i, p_{: j}})

are independent of n, and thus,

κ_{i}

does so because the numerator in the definition of

κ_{i}

is independent of n and the denominator is at least one (as

cost (f_{i, q}) \geq 1

). This implies that

κ

is independent of the input dimension n under the above condition.

In practical setups with large n, the computational cost of our algorithm can be much smaller than that of existing algorithms based on the lexicographic subdifferential [17,18,19,20]. For example, modern neural networks have more than a million parameters (i.e., n), where the cost for computing the gradient of each piece in the activation functions (i.e.,

cost (\nabla f_{i, p})

) is typically bounded by

O (cost (f_{i, p}))

. Further, the depth of branches in these activation functions is often bounded by a constant (e.g., the depth is one for ReLU). Hence, for those networks,

κ = O (1)

, and our algorithm does not incur much computational overhead. On the other hand, lexicographic-subdifferential-based approaches require at least n computations of

P (w)

[17,18,19,20], which may not be practical when n is large.

4. Proof of Theorem 1

In this section, we prove Theorem 3 under the setup that

δ

in Algorithm 3 is given instead of randomly sampled. This theorem directly implies Theorem 1 since the statement of Theorem 3 holds for almost every

δ

and the proof of Theorem 1 requires showing the same statement almost surely, where the randomness comes from

δ

following an Isotropic Gaussian distribution. Namely, proving Theorem 3 suffices for proving Theorem 1. We note that all results in this section are under Assumption 1.

Theorem 3.

Given

w \in R^{n}

, Algorithms 3 and 2 compute an element of

\partial^{c} v_{n + m} (w)

for almost every

δ \in R^{n}

.

4.1. Additional Notations

We frequently use the following shorthand notations: the set of indices of branches:

I_{b r} ≜ {i \in [n + m] : | Γ_{i} | > 1}

and an auxiliary index set:

{Idx}_{i} ≜ {(j, p) : p \in Γ_{i}, j \in [len (p)]} .

For

γ \in Γ

,

i \in [n + m] ∖ [n]

, and

(j, p) \in {Idx}_{i}

, we use

\begin{matrix} ϕ_{i, p_{: j}}^{γ} (w) ≜ & ϕ_{i, p_{: j}} (v_{p a (i), γ} (w)) . \end{matrix}

Note that

v_{i, γ}

and

ϕ_{i, p_{: j}}^{γ}

are analytic (and, therefore, differentiable) for all

γ \in Γ

,

i \in [n + m] ∖ [n]

, and

(j, p) \in {Idx}_{i}

. We next define the set of pieces reachable by our algorithm (Algorithm 3) with inputs

w = (w_{1}, \dots, w_{n}), δ = (δ_{1}, \dots, δ_{n}) \in R^{n}

as

Γ (w, δ)

:

\begin{matrix} Γ (w, δ) ≜ {γ \in Γ : γ_{i, j} \in C_{i, j}^{γ} (w; δ) \forall i \in I_{b r}, \forall j \in [len (γ_{i})]}, where \\ C_{i, j}^{γ} (w; δ) ≜ \{\begin{matrix} {1} & if (ϕ_{i, γ_{i, : j}} (v_{p a (i), γ} (w)) = c_{i, γ_{i, : j}} \land ϕ_{i, γ_{i, : j}} (d v_{p a (i), γ} (w; δ)) > 0) \\ \lor (ϕ_{i, γ_{i, : j}} (v_{p a (i)}^{γ}) > c_{i, γ_{i, : j}}) \\ {- 1} & if (ϕ_{i, γ_{i, : j}} (v_{p a (i), γ} (w)) = c_{i, γ_{i, : j}} \land ϕ_{i, γ_{i, : j}} (d v_{p a (i), γ} (w; δ)) < 0) \\ \lor (ϕ_{i, γ_{i, : j}} (v_{p a (i)}^{γ}) < c_{i, γ_{i, : j}}) \\ {- 1, 1} & if (ϕ_{i, γ_{i, : j}} (v_{p a (i), γ} (w)) = c_{i, γ_{i, : j}} \land ϕ_{i, γ_{i, : j}} (d v_{p a (i), γ} (w; δ)) = 0) \end{matrix} . \end{matrix}

4.2. Technical Claims

Lemma 1.

For any open

O \subset R

, for any analytic, but non-constant

f : O \to R

, and for any

x \in O

, there exists

ε > 0

such that

\begin{matrix} f (x) \notin f ([x - ε, x + ε] ∖ {x}) . \end{matrix}

Furthermore, f is strictly monotone on

[x, x + ε]

and strictly monotone on

[x - ε, x]

.

Proof Lemma 1.

Without loss of generality, suppose that

f (x) = 0

. Since f is analytic, f is infinitely differentiable and can be represented by the Taylor series on

(x - δ, x + δ)

for some

δ > 0

:

\begin{matrix} f (z) = \sum_{i = 0}^{\infty} \frac{f^{(i)} (x)}{i!} {(z - x)}^{i} \end{matrix}

where

f^{(i)}

denotes the i-th derivative of f. Since f is non-constant, there exists

i \in N

such that

f^{(i)} (x) \neq 0

. Let

i^{*}

be the minimum such i. Then, by Taylor’s theorem,

f (z) = \frac{f^{(i^{*})} (x)}{i^{*}!} {(z - x)}^{i^{*}} {+ o (| z - x |}^{i^{*}})

.

Consider the case that

f^{(i)} (x) > 0

and

i^{*}

is odd. Then, we can choose

ε \in (0, δ)

so that

\begin{matrix} f^{(1)} (z) < 0 on [x - ε, x) and f^{(1)} (z) > 0 on (x, x + ε] \end{matrix}

i.e., f is strictly increasing on

[x - ε, x + ε]

(e.g., by the mean value theorem), and hence,

f (x) \notin f ([x - ε, x + ε] ∖ {x})

. One can apply a similar argument to the cases that

f^{(i)} (x) < 0

and

i^{*}

is odd,

f^{(i)} (x) > 0

and

i^{*}

is even, and

f^{(i)} (x) < 0

and

i^{*}

is even. This completes the proof of Lemma 1. □

Lemma 2

(Proposition 0 in [35]). For any

n \in N

, for any open connected

O \subset R^{n}

, and for any real analytic

f : O \to R

, if

μ_{n} (zero (f)) > 0

, then

f (x) = 0

for all

x \in O

.

Lemma 3.

For any

n \in N

, for any open connected

O \subset R^{n}

, and for any real analytic

f, g : O \to R

, if

μ_{n} (zero (f - g)) > 0

, then

f (x) = g (x)

for all

x \in O

.

Proof Lemma 3.

The proof directly follows from Lemma 2. □

4.3. Technical Assumptions

Assumption 2.

Given

w \in R^{n}

,

δ \in R^{n}

satisfies the following: for any

γ \in Γ

,

i \in I_{b r}

, and

(j, p) \in {Idx}_{i}

, if

ϕ_{i, p_{: j}}^{γ}

is not a constant function, then

\begin{matrix} ϕ_{i, p_{: j}}^{γ} (w + t \cdot δ) is not a constant function in t \in R . \end{matrix}

Assumption 3.

Given

w \in R^{n}

,

δ \in R^{n}

satisfies the following: for any

γ \in Γ

,

i \in I_{b r}

, and

(j, p) = {Idx}_{i}

,

\begin{matrix} ⟨ δ, {D ϕ}_{i, p_{: j}}^{γ} (w) ⟩ = 0 if and only if {D ϕ}_{i, p_{: j}}^{γ} (w) = \vec{0} . \end{matrix}

4.4. Technical Lemmas

Lemma 4.

Given

w \in R^{n}

, almost every

δ \in R^{n}

satisfies Assumption 2.

Proof of Lemma 4.

Since

| Γ | < \infty

, if the set of

δ

that does not satisfy Assumption 2 has a non-zero measure, then there exist

γ \in Γ

,

i \in I_{b r}

, and

(j, p) \in {Idx}_{i}

such that

ϕ_{i, p_{: j}}^{γ}

is not a constant function and

\begin{matrix} μ_{n} (S ≜ {δ \in R^{n} : ϕ_{i, p_{: j}}^{γ} (w + t \cdot δ) is a constant function in t \in R}) > 0 . \end{matrix}

Without loss of generality, suppose that

ϕ_{i, p_{: j}}^{γ} (w) = 0

. Then, from the definition of

S

,

S

is contained in the zero set:

\begin{matrix} zero (\tilde{ϕ}) ≜ {u \in R : \tilde{ϕ} (u) = 0} \end{matrix}

of an analytic function

\tilde{ϕ} : R^{W} \to R

defined as

\begin{matrix} \tilde{ϕ} (u) ≜ ϕ_{i, p_{: j}}^{γ} (w + u) . \end{matrix}

Namely,

μ_{n} (zero (\tilde{ϕ})) \geq μ_{n} (S) > 0

. However, from Lemma 2,

\tilde{ϕ}

must be a constant function, which contradicts our assumption that

ϕ_{i, p_{: j}}^{γ}

is not a constant function. This completes the proof of Lemma 4. □

Lemma 5.

Given

w \in R^{n}

, almost every

δ \in R^{n}

satisfies Assumption 3.

Proof of Lemma 5.

Since

{D ϕ}_{i, p_{: j}}^{γ} (w) = \vec{0}

implies

⟨ δ, ϕ_{i, p_{: j}}^{γ} (w) ⟩ = 0

for all

δ \in R^{n}

, we prove the converse. Suppose that

{D ϕ}_{i, p_{: j}}^{γ} (w) \neq \vec{0}

. Since the set

{δ \in R^{n} : ⟨ δ, {D ϕ}_{i, p_{: j}}^{γ} (w) ⟩ = 0}

has zero measure under

{D ϕ}_{i, p_{: j}}^{γ} (w) \neq \vec{0}

,

\begin{matrix} ⋃_{γ \in Γ, i \in I_{b r}, (j, p) \in {Idx}_{i} : {D ϕ}_{i, p_{: j}}^{γ} (w) \neq \vec{0}} {δ \in R^{n} : ⟨ δ, {D ϕ}_{i, p_{: j}}^{γ} (w) ⟩ = 0} \end{matrix}

also has zero measure. This completes the proof of Lemma 5. □

Lemma 6.

For

i \in I_{b r}

and

p \in Γ_{i}

, suppose that

x \in R^{d_{i}}

satisfies one of the following for all

j \in [len (p)]

:

$sign (ϕ_{i, p_{: j}} (x) - c_{i, p_{: j}}) = p_{j}$ ;
$ϕ_{i, p_{: j}} (x) = c_{i, p_{: j}}$ .

Then,

x \in cl (X_{i, p})

.

Proof of Lemma 6.

Without loss of generality, assume that

x \notin X_{i, p}

. Since we assumed

X_{i, p} \neq \emptyset

by Assumption 1, there exists

y \in X_{i, p}

, i.e.,

sign (ϕ_{i, p_{: j}} (y) - c_{i, p_{: j}}) = p_{j}

for all

j \in [len (p)]

. Define

\begin{matrix} I ≜ {j \in [len (p)] : ϕ_{i, p_{: j}} (x) = c_{i, p_{: j}}} . \end{matrix}

Since

ϕ_{i, p_{: j}}

is linear, for

z = y - x

and for any

j \in I

, we have

\begin{matrix} sign (ϕ_{i, p_{: j}} (z)) = p_{j} . \end{matrix}

This implies that, for any

t > 0

and

j \in I

, it holds that

\begin{matrix} sign (ϕ_{i, p_{: j}} (x + t \cdot z)) = p_{j} . \end{matrix}

(7)

Since

| ϕ_{i, p_{: j}} (x) - c_{i, p_{: j}} | > 0

and

sign (ϕ_{i, p_{: j}} (x)) = p_{j}

for all

j \in [len (p)] ∖ I

by the definition of

I

, there exists

s > 0

such that

\begin{matrix} | ϕ_{i, p_{: j}} (x + t \cdot z) - c_{i, p_{: j}} | > 0 and sign (ϕ_{i, p_{: j}} (x + t \cdot z)) = p_{j} \end{matrix}

(8)

for all

t \in [0, s]

and

j \in [len (p)] ∖ I

. Combining Equations (7) and (8) implies that

x + t \cdot z \in X_{i, p}

for all

t \in (0, s]

, i.e., x is a limit point of

X_{i, p}

. This completes the proof of lem:closure □

4.5. Key Lemmas

Lemma 7.

For any

w \in R^{n}

and for any

δ \in R^{n}

satisfying Assumption 2, there exist

s^{*} > 0

and

γ^{*} \in Γ (w, δ)

such that

\begin{matrix} w + t \cdot δ \in int (W_{γ^{*}}) for all t \in (0, s^{*}) . \end{matrix}

Proof of Lemma 7.

We first define some notations: for an analytic function

f : R \to R

and

t \in R

,

\begin{matrix} ϕ_{i, p_{: j}}^{γ, w, δ} (t) ≜ & ϕ_{i, p_{: j}}^{γ} (w + t \cdot δ), \\ dir (f) ≜ & \{\begin{matrix} s^{'} & if s^{'} = {sup}_{s \geq 0} {f (0) is strictly increasing on [0, s]} > 0 \\ s^{″} & if s^{″} = {sup}_{s \geq 0} {f (0) is strictly decreasing on [0, s]} > 0 \\ ⊥ & otherwise \end{matrix} . \end{matrix}

Under Assumption 2 and by Lemma 1, one can observe that, for any

i \in I_{b r}

,

p = γ_{i}

, and

j \in [len (p)]

,

dir (ϕ_{i, p_{: j}}^{γ, w, δ}) = ⊥

if and only if

ϕ_{i, p_{: j}}^{γ}

is a constant function. In addition, from Lemma 1, if

ϕ_{i, p_{: j}}^{γ}

is not a constant function, then

dir (ϕ_{i, p_{: j}}^{γ, w, δ}) > 0

. Using Algorithm 4, we iteratively construct

γ^{*} \in Γ (w, δ)

and update

s^{*} > 0

for each

i \in I_{b r}

so that

\begin{matrix} w + t \cdot δ \in int (W_{γ^{*}}) for all t \in (0, s^{*}) . \end{matrix}

Under our construction of

γ^{*}

, one can observe that

γ^{*} \in Γ (w, δ)

. From our choice of

s^{*}

, for any

i \in I_{b r}

,

j \in [len (p)]

, and for

s_{i, j} = dir (ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ})

, the following statements hold:

If $s_{i, j} \neq ⊥$ , then $ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ} ((0, s^{*}))$ is open since $ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ}$ is strictly monotone on $(0, s^{*})$ ;
If $s_{i, j} = ⊥$ , then $ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ}, ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}}$ are constant functions (i.e., $ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ} ((0, s^{*}))$ is a constant) due to Assumption 2.

For any

t \in (0, s^{*})

, we have

w + t \cdot δ \in O \subset W_{γ^{*}}

, where

\begin{matrix} O ≜ (⋂_{i, j : z_{i, j} = ⊥} {(ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}})}^{- 1} (ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ} ((0, s^{*})))) \cap (⋂_{i, j : z_{i, j} \neq ⊥} {(ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}})}^{- 1} (ϕ_{i, γ_{i, : j}^{*}}^{γ^{*}, w, δ} ((0, s^{*})))) . \end{matrix}

Here,

O

is open since each term for the intersection in the above equation is open; it is

R^{n}

if

s_{i, j} = ⊥

, and it is an inverse image of a continuous function of an open set otherwise. This completes the proof of Lemma 7. □

Algorithm 4 Construction of

γ^{*}

and

s^{*}

Input: P, $(w_{1}, \dots, w_{n})$ , $(δ_{1}, \dots, δ_{n})$
Initialize: $(v_{1}, \dots, v_{n}) = (w_{1}, \dots, w_{n})$ , ${d v}_{i} = δ_{i}$ for all $i \in [n]$ , $s^{*} = \infty$ , $γ^{*} = ((), \dots, ())$
for $i = n + 1, \dots, n + m$ do
Let $x = v_{p a (i)}$ , $d x = d v_{p a (i)}$ , and $p = ()$
while $p \notin Γ_{i}$ do
Set $y = ϕ_{i, p} (x)$ and $s = dir (ϕ_{i, p}^{γ^{*}, w, δ})$
if $s = ⊥ \land y > c_{i, p}$ then
$p = p \oplus 1$
else if $s = ⊥ \land y \leq c_{i, p}$ then
$p = p \oplus (- 1)$
else if $s \neq ⊥ \land y > c_{i, p}$ then
$p = p \oplus 1$ and $ε = min {| ϕ_{i, p}^{γ^{*}, w, δ} (s) - y |, y - c_{i, p}}$
${s^{'}} = {(ϕ_{i, p}^{γ^{*}, w, δ})}^{- 1} (y + z \cdot ε) \cap [0, s]$ and $s^{*} = min {s^{*}, s^{'}}$
else if $s \neq ⊥ \land y < c_{i, p}$ then
$p = p \oplus (- 1)$ and $ε = min {| ϕ_{i, p}^{γ^{*}, w, δ} (s) - y |, c_{i, p} - y}$
${s^{'}} = {(ϕ_{i, p}^{γ^{*}, w, δ})}^{- 1} (y + z \cdot ε) \cap [0, s]$ and $s^{*} = min {s^{*}, s^{'}}$
else if $s \neq ⊥ \land y = c_{i, p}$ then
$p = p \oplus sign (z)$ and $s^{*} = min {s^{*}, s}$
end if
Set $γ_{i}^{*} = p$ and $v_{i} = f_{i, γ_{i}^{*}} (x)$
end while
end for
return $γ^{*}$ , $s^{*}$

Corollary 1.

For any

w \in R^{n}

and for any

δ \in R^{n}

satisfying Assumption 2, there exist

s^{*} > 0

and

γ^{*} \in Γ (w, δ)

such that

\begin{matrix} {D v}_{n + m} (w + t \cdot δ) = {D v}_{n + m, γ^{*}} (w + t \cdot δ) for all t \in (0, s^{*}) . \end{matrix}

Proof of Corollary 1.

This corollary directly follows from Lemma 7. □

Lemma 8.

For any

w \in R^{n}

and for any

δ

satisfying Assumption 3, it holds that

{D v}_{n + m, γ^{'}} (w) = {D v}_{n + m, γ^{″}} (w) for all γ^{'}, γ^{″} \in Γ (w, δ) .

Proof of Lemma 8.

We use the mathematical induction on i to show that

v_{i, γ^{'}} (w) = v_{i, γ^{″}} (w)

and

{D v}_{i, γ^{'}} (w) = {D v}_{i, γ^{″}} (w)

for all

i \in [n + m]

. The base case is trivial:

v_{i, γ^{'}} (w) = v_{i, γ^{″}} (w)

and

{D v}_{i, γ^{'}} (w) = {D v}_{i, γ^{″}} (w)

for all

i \in [n]

. Hence, suppose that

i \in I_{b r}

since the case that

i \in [n + m] ∖ ([n] \cup I_{b r})

is also trivial. Then, by the induction hypothesis, we have

v_{j, γ^{'}} (w) = v_{j, γ^{″}} (w)

and

{D v}_{j, γ^{'}} (w) = {D v}_{j, γ^{″}} (w)

for all

j \in [i - 1]

. For notational simplicity, we denote

x_{j} ≜ v_{j, γ^{'}} (w) = v_{j, γ^{″}} (w)

and

d x_{j} ≜ d v_{j, γ^{'}} (w; δ) = d v_{j, γ^{″}} (w; δ)

for all

j \in [i - 1]

.

Let

p^{'} = γ_{i}^{'}

and

p^{″} = γ_{i}^{″}

. First, by Lemma 6, the definition of

Γ (w, δ)

, and the induction hypothesis, we have

\begin{matrix} x_{p a (i)} \in cl (X_{i, p^{'}}) \cap cl (X_{i, p^{″}}) . \end{matrix}

Due to the continuity of

F_{i}

, this implies that

\begin{matrix} v_{i, γ^{'}} (w) = f_{i, p^{'}} (x_{p a (i)}) = f_{i, p^{″}} (x_{p a (i)}) = v_{i, γ^{″}} (w) . \end{matrix}

Now, it remains to show

{D v}_{i, γ^{'}} (w) = {D v}_{i, γ^{″}} (w)

. To this end, we define the following:

\begin{matrix} I & ≜ \{(j, p) \in {Idx}_{i} : ϕ_{i, p_{: j}} (d x_{p a (i)}) = 0\}, \\ S & ≜ \{x \in R^{d_{i}} : ϕ_{i, p_{: j}} (x) = 0 for all (j, p) \in I\} . \end{matrix}

From the definition of

S

and

I

,

d x_{p a (i)} \in S

. Furthermore, by Assumption 3, for any

(j, p) \in {Idx}_{i}

and

γ \in {γ^{'}, γ^{″}}

, we have

ϕ_{i, p_{: j}} (d v_{p a (i), γ} (w; δ)) = 0

if and only if

{D ϕ}_{i, p_{: j}} (v_{p a (i), γ} (w)) = \vec{0}

, i.e.,

\partial ϕ_{i, p_{: j}} (v_{p a (i), γ} (w)) / \partial w_{ℓ} = ϕ_{i, p_{: j}} (\partial v_{p a (i), γ} (w) / \partial w_{ℓ}) = 0

for all

ℓ \in [n]

. Therefore, since

d x_{p a (i)} \in S

, it holds that

\begin{matrix} \frac{\partial v_{p a (i), γ^{'}} (w)}{\partial w_{ℓ}}, \frac{\partial v_{p a (i), γ^{″}} (w)}{\partial w_{ℓ}} \in S for all ℓ \in [n] . \end{matrix}

In addition, due to the identities

\begin{matrix} {D v}_{i, γ^{'}} (w) = {D f}_{i, p^{'}} (v_{p a (i), γ^{'}} (w)), {D v}_{i, γ^{″}} (w) = {D f}_{i, p^{″}} (v_{p a (i), γ^{″}} (w)), \\ \frac{\partial f_{i, p^{'}} (v_{p a (i), γ^{'}} (w))}{\partial w_{ℓ}} = ⟨\nabla f_{i, p^{'}} (x_{p a (i)}), \frac{\partial v_{p a (i), γ^{'}} (w)}{\partial w_{ℓ}}⟩, \\ \frac{\partial f_{i, p^{″}} (v_{p a (i), γ^{″}} (w))}{\partial w_{ℓ}} = ⟨\nabla f_{i, p^{″}} (x_{p a (i)}), \frac{\partial v_{p a (i), γ^{″}}}{\partial w_{ℓ}}⟩, \end{matrix}

showing the following stronger statement suffices for proving Lemma 8:

\begin{matrix} ⟨\nabla f_{i, p^{'}} (x_{p a (i)}), z⟩ = ⟨\nabla f_{i, p^{″}} (x_{p a (i)}), z⟩ for all z \in S . \end{matrix}

(9)

By Lemma 1 and the induction hypothesis (

d v_{p a (i), γ^{'}} = d v_{p a (i), γ^{″}}

), there exists

s > 0

such that, for any

t \in (0, s)

and

(j, p) \in {Idx}_{i}

and for

z_{t} = x_{p a (i)} + t \cdot d x_{p a (i)}

,

\begin{matrix} ϕ_{i, p_{: j}} (z_{t}) > c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) > 0) \lor (ϕ_{i, p_{: j}} (x_{p a (i)}) > c_{i, p_{: j}}), \\ ϕ_{i, p_{: j}} (z_{t}) < c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) < 0) \lor (ϕ_{i, p_{: j}} (x_{p a (i)}) < c_{i, p_{: j}}), \\ ϕ_{i, p_{: j}} (z_{t}) = c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) = 0) . \end{matrix}

Since each

ϕ_{i, p_{: j}}

is linear (i.e., continuous) and by the definition of

S

, for each

t \in (0, s)

, there exists an open neighborhood

O_{t} \subset S

(open in

S

) of

z_{t}

such that, for any

z \in O_{t}

and

(j, p) \in {Idx}_{i}

,

\begin{matrix} ϕ_{i, p_{: j}} (z) > c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) > 0) \lor (ϕ_{i, p_{: j}} (x_{p a (i)}) > c_{i, p_{: j}}), \\ ϕ_{i, p_{: j}} (z) < c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) < 0) \lor (ϕ_{i, p_{: j}} (x_{p a (i)}) < c_{i, p_{: j}}), \\ ϕ_{i, p_{: j}} (z) = c_{i, p_{: j}} if (ϕ_{i, p_{: j}} (x_{p a (i)}) = c_{i, p_{: j}} \land ϕ_{i, p_{: j}} (d x_{p a (i)}) = 0) . \end{matrix}

(10)

Here, we claim that, for any

t \in (0, s)

,

\begin{matrix} O_{t} \subset cl (X_{i, p^{'}}) \cap cl (X_{i, p^{″}}) \cap S ≜ B and x_{p a (i)} \in B . \end{matrix}

(11)

First, from the definition of

O_{t}

, we have

O_{t} \subset S

. In addition, by Equation (10) and the definition of

Γ (w, δ)

, either

(ϕ_{i, p_{: j}^{'}} (z) - c_{i, p_{: j}^{'}}) = p_{j}^{'}

or

ϕ_{i, p_{: j}^{'}} (z) = c_{i, p_{: j}^{'}}

for all

j \in [len (p^{'})]

and

z \in O_{t}

; the same argument also holds for

p^{″}

. Hence, by Lemma 6, the LHS of Equation (11) holds. Likewise, we have

x_{p a (i)} \in B

.

Due to the continuity of

F_{i}

, Equation (11) implies that, for any

t \in (0, s)

,

\begin{matrix} f_{i, p^{'}} = f_{i, p^{″}} on O_{t}, \end{matrix}

(12)

i.e.,

⟨ \nabla f_{i, p^{'}} (z_{t}), z ⟩ = ⟨ \nabla f_{i, p^{″}} (z_{t}), z ⟩

for all

z \in S

and

t \in (0, s)

. Here,

f_{i, p^{'}}

and

f_{i, p^{″}}

are differentiable at

z_{t}

by Equation (11) and Assumption 1. Due to the analyticity of

f_{i, p^{'}}

and

f_{i, p^{″}}

, this implies that, for any

z \in S

, we have

\begin{matrix} ⟨ \nabla f_{i, p^{'}} (x_{p a (i)}), z ⟩ = lim_{t ↓ 0} ⟨ \nabla f_{i, p^{'}} (z_{t}), z ⟩ = lim_{t ↓ 0} ⟨ \nabla f_{i, p^{″}} (z_{t}), z ⟩ = ⟨ \nabla f_{i, p^{″}} (x_{p a (i)}), z ⟩ \end{matrix}

where

f_{i, p^{'}}

and

f_{i, p^{″}}

are differentiable at

x_{p a (i)}

by Equation (11) and Assumption 1. This proves Equation (9) and completes the proof of Lemma 8. □

4.6. Proof of Theorem 3

Under Assumptions 2 and 3, combining Lemmas 4 and 5, Corollary 1, and Lemma 8 completes the proof of Theorem 3.

5. Proof of Theorem 2

Here, we analyze the computational costs based on the program representations in Figure 4 and Figure 5. For simplicity, we use A2 and A3 for the shorthand notations for Algorithm 2 and Algorithm 3, respectively. Under these setups, our cost analysis for

P (w)

and

A 2 (P, A 3 (P, w)) = ours (P, w)

is as follows: for

γ, γ^{'} \in Γ

such that

γ = ((), \dots, (), γ_{n + 1}, \dots, γ_{n + m})

, where

(γ_{n + 1}, \dots, γ_{n + m})

and

v (w) = (v_{1} (w), \dots, v_{n + m} (w))

are the outputs of

A 3 (P, w)

and

w \in W_{γ^{'}}

:

\begin{matrix} cost (P (w)) = \sum_{i = n + 1}^{n + m} (cost (f_{i, γ_{i}^{'}}) + len (γ_{i}^{'}) + \sum_{j \in [len (γ_{i}^{'})]} cost (ϕ_{i, γ_{i, : j}^{'}})), \\ cost (A 2 (P, v (w), γ)) = \sum_{i = n + 1}^{n + m} d_{i} \cdot (cost (+) + cost (\times)) + cost (\nabla f_{i, γ_{i}}) \\ \leq \sum_{i = n + 1}^{n + m} 2 d_{i} + cost (\nabla f_{i, γ_{i}}), \\ cost (A 3 (P, w)) \leq \sum_{i = n + 1}^{n + m} (max {d_{i} - 1, 0} \cdot cost (+) + d_{i} \cdot cost (\times) \\ + cost (f_{i, γ_{i}}) + cost (\nabla f_{i, γ_{i}}) + \sum_{j \in [len (γ_{i})]} (2 cost (ϕ_{i, γ_{i, : j}}) + 4 cost (>))) \\ \leq \sum_{i = n + 1}^{n + m} (2 d_{i} + 4 len (γ_{i}) + cost (f_{i, γ_{i}}) + cost (\nabla f_{i, γ_{i}}) + 2 \sum_{j \in [len (γ_{i})]} cost (ϕ_{i, γ_{i, : j}})), \\ cost (ours (P, w)) \leq \sum_{i = 1}^{n} cost (sample δ_{i}) + A 2 (P, v (w), γ)) + cost (A 3 (P, w)) \\ \leq n + \sum_{i = n + 1}^{n + m} (4 d_{i} + 4 len (γ_{i}) + cost (f_{i, γ_{i}}) + 2 cost (\nabla f_{i, γ_{i}}) + 2 \sum_{j \in [len (γ_{i})]} cost (ϕ_{i, γ_{i, : j}})) . \end{matrix}

This implies the following:

\begin{matrix} \frac{cost (ours (P, w))}{cost (P (w))} \leq \frac{n}{cost (P (w))} \\ + \frac{\sum_{i = n + 1}^{n + m} (4 d_{i} + 4 len (γ_{i}) + cost (f_{i, γ_{i}}) + 2 cost (\nabla f_{i, γ_{i}}) + 2 \sum_{j \in [len (γ_{i})]} cost (ϕ_{i, γ_{i, : j}}))}{cost (P (w))} \\ \leq 1 + max_{i \in [n + m] ∖ [n]} κ_{i} = κ \end{matrix}

where the first inequality is from the above bound and the second inequality is from the definition of

κ_{i}

and the assumption

cost (P (w)) \geq n

. This completes the proof.

6. Conclusions

In this work, we proposed an efficient subdifferentiation algorithm for computing an element of the Clarke subdifferential of programs with linear branches. In particular, we generalized the existing algorithm in [14] and extended its application from polynomials to analytic functions. The computational cost of our algorithm is at most that of the function evaluation multiplied by an input-dimension-independent factor, for primitive functions whose arities and maximum depths of branches are independent of the input dimension. We believe that extending our algorithm to general functions (e.g., continuously differentiable functions), general branches (e.g., nonlinear branches), and general programs (e.g., programs with loops) will be an important future research direction.

Funding

This research was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2022R1F1A1076180).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflict of interest.

References

Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS Autodiff Workshop. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 1 November 2023).
Frostig, R.; Johnson, M.; Leary, C. Compiling machine learning programs via high-level tracing. In Proceedings of the SysML Conference, Stanford, CA, USA, 15–16 February 2018; Volume 4. [Google Scholar]
Speelpenning, B. Compiling Fast Partial Derivatives of Functions Given by Algorithms; University of Illinois at Urbana-Champaign: Champaign, IL, USA, 1980. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 84–90. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed.; SIAM: Philadelphia, PA, USA, 2008. [Google Scholar]
Pearlmutter, B.A.; Siskind, J.M. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 2008, 30, 1–36. [Google Scholar] [CrossRef]
Baur, W.; Strassen, V. The complexity of partial derivatives. Theor. Comput. Sci. 1983, 22, 317–330. [Google Scholar] [CrossRef]
Griewank, A. On automatic differentiation. Math. Program. Recent Dev. Appl. 1989, 6, 83–107. [Google Scholar]
Bolte, J.; Boustany, R.; Pauwels, E.; Pesquet-Popescu, B. Nonsmooth automatic differentiation: A cheap gradient principle and other complexity results. arXiv 2022, arXiv:2206.01730. [Google Scholar]
Griewank, A. Who invented the reverse mode of differentiation. Doc. Math. Extra Vol. ISMP 2012, 389400, 389–400. [Google Scholar]
Kakade, S.M.; Lee, J.D. Provably Correct Automatic Sub-Differentiation for Qualified Programs. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 7125–7135. [Google Scholar]
Lee, W.; Yu, H.; Rival, X.; Yang, H. On Correctness of Automatic Differentiation for Non-Differentiable Functions. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 6719–6730. [Google Scholar]
Nesterov, Y. Lexicographic differentiation of nonsmooth functions. Math. Program. 2005, 104, 669–700. [Google Scholar] [CrossRef]
Khan, K.A.; Barton, P.I. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Trans. Math. Softw. 2013, 39, 1–28. [Google Scholar]
Khan, K.A.; Barton, P.I. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optim. Methods Softw. 2015, 30, 1185–1212. [Google Scholar] [CrossRef]
Barton, P.I.; Khan, K.A.; Stechlinski, P.; Watson, H.A.J. Computationally relevant generalized derivatives: Theory, evaluation and applications. Optim. Methods Softw. 2018, 33, 1030–1072. [Google Scholar] [CrossRef]
Khan, K.A. Branch-locking AD techniques for nonsmooth composite functions and nonsmooth implicit functions. Optim. Methods Softw. 2018, 33, 1127–1155. [Google Scholar] [CrossRef]
Griewank, A. Automatic directional differentiation of nonsmooth composite functions. In Proceedings of the Recent Developments in Optimization: Seventh French-German Conference on Optimization, Dijon, France, 27 June–2 July 1995; Springer: Berlin/Heidelberg, Germany, 1995; pp. 155–169. [Google Scholar]
Sahlodin, A.M.; Barton, P.I. Optimal campaign continuous manufacturing. Ind. Eng. Chem. Res. 2015, 54, 11344–11359. [Google Scholar] [CrossRef]
Sahlodin, A.M.; Watson, H.A.; Barton, P.I. Nonsmooth model for dynamic simulation of phase changes. AIChE J. 2016, 62, 3334–3351. [Google Scholar] [CrossRef]
Hanin, B. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef]
Alghamdi, H.; Hafeez, G.; Ali, S.; Ullah, S.; Khan, M.I.; Murawwat, S.; Hua, L.G. An Integrated Model of Deep Learning and Heuristic Algorithm for Load Forecasting in Smart Grid. Mathematics 2023, 11, 4561. [Google Scholar] [CrossRef]
Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Rehman, H.U.; Kumam, P.; Argyros, I.K.; Shutaywi, M.; Shah, Z. Optimization based methods for solving the equilibrium problems with applications in variational inequality problems and solution of Nash equilibrium models. Mathematics 2020, 8, 822. [Google Scholar] [CrossRef]
Davis, D.; Drusvyatskiy, D.; Kakade, S.; Lee, J.D. Stochastic subgradient method converges on tame functions. Found. Comput. Math. 2020, 20, 119–154. [Google Scholar] [CrossRef]
Bolte, J.; Pauwels, E. Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 2021, 188, 19–51. [Google Scholar] [CrossRef]
Scholtes, S. Introduction to Piecewise Differentiable Equations; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Griewank, A.; Bernt, J.U.; Radons, M.; Streubel, T. Solving piecewise linear systems in abs-normal form. Linear Algebra Its Appl. 2015, 471, 500–530. [Google Scholar] [CrossRef]
Bolte, J.; Pauwels, E. A mathematical model for automatic differentiation in machine learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; pp. 10809–10819. [Google Scholar]
Lee, W.; Park, S.; Aiken, A. On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. arXiv 2023, arXiv:2301.13370. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Mityagin, B. The zero set of a real analytic function. arXiv 2015, arXiv:1512.07276. [Google Scholar] [CrossRef]

Figure 1. Definitions of a program P (left) and primitive functions

F_{n + 1}, \dots, F_{n + m}

(right).

Figure 2. A flow chart illustrating a primitive function

F_{i}

.

Figure 3. Example code for the max function

F_{i} : (x_{1}, x_{2}, x_{3}) \mapsto max {x_{1}, x_{2}, x_{3}}

.

Figure 4. A program

P^{ADB}

implementing the backward pass of AD (Algorithm 2).

Figure 5. A program

P^{ours}

implementing Algorithm 3. Here,

x_{1 : d_{i}} = (x_{1}, \dots, x_{d_{i}})

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Efficient Automatic Subdifferentiation for Programs with Linear Branches

Abstract

1. Introduction

Related Works

2. Problem Setup

2.1. Notations

2.2. Programs with Branches

2.3. Pieces of Programs

2.4. Reverse-Mode Automatic Differentiation

3. Efficient Automatic Subdifferentiation

3.1. Assumptions on Primitive Functions

3.2. Intuition Behind Efficient Automatic Subdifferentiation

3.3. Forward Pass for Efficient Automatic Subdifferentiation

3.4. Computational Cost

4. Proof of Theorem 1

4.1. Additional Notations

4.2. Technical Claims

4.3. Technical Assumptions

4.4. Technical Lemmas

4.5. Key Lemmas

4.6. Proof of Theorem 3

5. Proof of Theorem 2

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics