Next Article in Journal
A Nyström Method for 2D Linear Fredholm Integral Equations on Curvilinear Domains
Previous Article in Journal
A Conjugate Gradient Method: Quantum Spectral Polak–Ribiére–Polyak Approach for Unconstrained Optimization Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Automatic Subdifferentiation for Programs with Linear Branches

Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea
Mathematics 2023, 11(23), 4858; https://doi.org/10.3390/math11234858
Submission received: 2 November 2023 / Revised: 25 November 2023 / Accepted: 29 November 2023 / Published: 3 December 2023
(This article belongs to the Special Issue High-Speed Computing and Parallel Algorithms)

Abstract

:
Computing an element of the Clarke subdifferential of a function represented by a program is an important problem in modern non-smooth optimization. Existing algorithms either are computationally inefficient in the sense that the computational cost depends on the input dimension or can only cover simple programs such as polynomial functions with branches. In this work, we show that a generalization of the latter algorithm can efficiently compute an element of the Clarke subdifferential for programs consisting of analytic functions and linear branches, which can represent various non-smooth functions such as max, absolute values, and piecewise analytic functions with linear boundaries, as well as any program consisting of these functions such as neural networks with non-smooth activation functions. Our algorithm first finds a sequence of branches used for computing the function value at a random perturbation of the input; then, it returns an element of the Clarke subdifferential by running the backward pass of the reverse-mode automatic differentiation following those branches. The computational cost of our algorithm is at most that of the function evaluation multiplied by some constant independent of the input dimension n, if a program consists of piecewise analytic functions defined by linear branches, whose arities and maximum depths of branches are independent of n.

1. Introduction

Automatic differentiation refers to various techniques to compute the derivatives of a function represented by a program, based on the well-known chain rule of calculus. It has been widely used across various domains, and diverse practical automatic differentiation systems have been developed [1,2,3]. In particular, reverse-mode automatic differentiation [4] has been a driving force of the rapid advances in numerical optimization [5,6,7].
There are two important properties of reverse-mode automatic differentiation: correctness and efficiency. For programs consisting of smooth functions, it is well known that reverse-mode automatic differentiation always computes the correct derivatives [8,9]. Furthermore, for programs returning a scalar value, it efficiently computes their derivatives in the sense that its computational cost is at most proportional to that of the function evaluation, where the additional multiplicative factor is bounded by five for rational functions [10,11] and by some constant that depends on the underlying implementation of smooth functions; if the arities of the functions are independent of the input dimension n, then this constant is also independent of n [12]. Such correctness and efficiency of reverse-mode automatic differentiation, referred to as the Cheap Gradient Principle, have been of central importance for modern nonlinear optimization algorithms [13].
In practical problems, a program often involves branches (e.g., max and an absolute value), and the corresponding target function can be non-smooth. In other words, the derivative of the program may not exist at some inputs. In this work, we investigated a Cheap Subgradient Principle: an efficient algorithm that correctly computes an element of the Clarke subdifferential, a generalized notion of the derivative, for scalar programs. One naïve approach is to directly apply reverse-mode automatic differentiation to the Clarke subdifferential. Such a method is computationally cheap as in the smooth case; however, due to the absence of the sharp chain rule for the Clarke subdifferential, it is incorrect in general even if the target function is differentiable [8,14,15].
There have been extensive research efforts to correctly compute an element of the Clarke subdifferential. A notable line of work is based on the lexicographic subdifferential [16], which is a subset of the Clarke subdifferential, but a sharp chain rule holds under some structural assumptions. Based on this, a series of works [17,18,19,20] has shown that an element of the lexicographic subdifferential can be computed by evaluating n directional derivatives, where n denotes the input dimension. Nevertheless, since this approach requires computing n directional derivatives, it incurs a multiplicative factor n in its computational cost compared to that of function evaluation in the worst case.
To avoid such an input-dimension-dependent factor, a two-step randomized algorithm for programs with branches has been proposed [14]. In the first step, the algorithm chooses a random direction δ and finds a sequence of branch selections based on the directional derivative with respect to δ under some qualification condition [14,21]. Then, the second step computes the derivative corresponding to the branches returned in the first step, which is shown to be an element of the Clarke subdifferential. Here, the second step can also be efficiently implemented via reverse-mode automatic differentiation. As a result, this two-step algorithm correctly computes an element of the Clarke subdifferential under the qualification condition, with the computational cost independent of the input dimension. However, this result is only for piecewise polynomial programs, defined by branches and finite compositions of monomials and affine functions.
In this work, we propose an efficient automatic subdifferentiation algorithm by generalizing the algorithm in [14] described above. Our algorithm correctly computes an element of the Clarke subdifferential for programs consisting of any analytic functions (including polynomials) and linear branches. As in the prior efficient automatic (sub)differentiation works, the computational cost of our algorithm is that of the function evaluation multiplied by some constant independent of the input dimension n, if a program consists of piecewise analytic functions (defined by linear branches) whose arities and maximum depths of branches are independent of n (e.g., max and the absolute value).

Related Works

Non-smooth optimization: Although smooth functions are easy to formulate and optimize, they have limited applicability as non-smoothness appears in various science and engineering problems. For example, real-world problems in thermodynamics often involve discrete switching between thermodynamic phases, which can be modeled by non-smooth functions. Dynamic simulation and optimization under these models often require the treatment of these models [22,23]. In machine learning applications, hinge loss, ReLU ( x ) = max { x , 0 } , and maxpool operations are often used, which makes the optimization objective non-smooth [6,24,25]. For optimizing convex, but non-smooth functions, subderivative methods are widely used for approximating a local minimum [26,27]. However, for non-convex functions, the subderivative does not exist in general, and researchers have investigated the generalized notion of derivatives (e.g., the Clarke subdifferential).
Optimization algorithms using generalized derivatives: Recently, the convergence properties of optimization algorithms based on generalized derivatives for non-convex and non-smooth functions have received much attention. Ref. [28] proved that, for locally Lipschitz functions, the stochastic gradient method, where the gradient is chosen from the Clarke subdifferential, converges to a stationary point. However, as we introduced in the previous section, computing an element of the Clarke subdifferential is computationally expensive or can be efficient only for a specific class of programs. Ref. [29] proposed a new notion of gradient called the conservative gradient, which can be efficiently computed; nevertheless, its convergence property is not well understood, especially under practical setups.

2. Problem Setup

2.1. Notations

For n N { 0 } , we denote [ n ] { 1 , , n } , where [ 0 ] = . For any set S , we use S 0 { ( ) } , where ( ) denotes the zero-dimensional vector. For any vector x S n and i [ n ] , we denote x i for the i-th coordinate of x and x : i ( x 1 , , x i 1 ) , where x : 1 = ( ) ; similarly, for any x S n and index set I = { i 1 , , i k } [ n ] with i 1 < < i k , we use x I ( x i 1 , , x i k ) . For any u = ( u 1 , , u n ) R n and v = ( v 1 , , v m ) R m , we use u v ( u 1 , , u n , v 1 , , v m ) ; when n = m , we write u , v to denote the standard inner product between u and v. For any x R , sign ( x ) = 1 if x > 0 and sign ( x ) = 1 otherwise. For any real-valued vector x, len ( x ) denotes the length of x, i.e., len ( x ) = n if x R n . For any set S R n , we use cl ( S ) , int ( S ) , and conv ( S ) to denote the closure, interior, and convex hull of S , respectively. We lastly define the Clarke subdifferential. Given a function f : R n R and the set D R n of all points at which f is differentiable, the Clarke subdifferential of f at x R n is a set defined as
c f ( x ) conv s R n : { u t } t N D such that u t x and f ( u t ) s .

2.2. Programs with Branches

We considered a program P defined in Figure 1 (left). P applies a series of primitive functions F n + 1 , , F n + m to compute intermediate variables v n + 1 , , v n + m and, then, returns the last result v n + m . Each primitive function F i : R d i R is continuous and defined in an inductive way as in Figure 1 (right): F i either applies a function f i , ( ) or branches via (possibly nested) if–else statements. Namely, F i is a continuous, piecewise function. If F i branches with an input ( x 1 , , x d i ) , then it first evaluates y = ϕ i , ( ) ( x 1 , , x d i ) for some ϕ i , ( ) : R d i R and checks whether y > c i , ( ) or not for some threshold value c i , ( ) R . If y > c i , ( ) , it executes a code E i , ( 1 ) that either applies a function f i , ( 1 ) or executes another code E i , ( 1 , 1 ) or E i , ( 1 , 1 ) depending on whether ϕ i , ( 1 ) ( x 1 , , x d i ) > c i , ( 1 ) or not. The case y c i , ( ) is handled in a similar way. Here, we assumed that each F i has finitely many branches. In short, each primitive function F i first finds a proper piece labeled by p j = 0 { 1 , 1 } j and, then, returns f i , p ( x 1 , , x d i ) . We illustrate a flow chart for a primitive function F i in Figure 2.
We present an example code for a primitive function F i that returns max { x 1 , x 2 , x 3 } in Figure 3. In this example, F i branches at most twice: its first branch is determined by ϕ i , ( ) ( x 1 , x 2 , x 3 ) = x 1 x 2 , which is stored in y in the second line in Figure 3. If y > 0 (the third line with c i , ( ) = 0 ), then it executes E 1 , ( 1 ) , which corresponds to lines 4–6. Otherwise, it moves to E 1 , ( 1 ) , which corresponds to the lines 8–10. Suppose E 1 , ( 1 ) is executed (i.e., y > 0 in line 3). Then, it computes ϕ i , ( 1 ) ( x 1 , x 2 , x 3 ) = x 1 x 3 and stores it in y as in line 4. If y > 0 , then E i , ( 1 , 1 ) is executed, which returns the value of x 1 (i.e., f i , ( 1 , 1 ) ( x 1 , x 2 , x 3 ) = x 1 ). Otherwise, E i , ( 1 , 1 ) is executed, which returns the value of x 3 (i.e., f i , ( 1 , 1 ) ( x 1 , x 2 , x 3 ) = x 3 ).
We considered each v i and v p a ( i ) as a function of an input w R n . Specifically, for all i [ n ] , we use v i ( w ) w i ; and for all i [ n + m ] [ n ] , we use v i ( w ) F i ( v ( i ) ( w ) ) and v p a ( i ) ( w ) ( v j 1 ( w ) , , v j d i ( w ) ) , where p a ( i ) = { j 1 , , j d i } with 0 < j 1 < < j d i < i . Under this notation, w v n + m ( w ) denotes the target function represented by the program P. We often omit w and write v i and v p a ( i ) if it is clear from the context. We denote the gradient (or Jacobian) of functions with respect to an input w by the operator: e.g., D v i ( w ) v i ( w ) / w and D ϕ i , p ( v p a ( i ) ( w ) ) ϕ i , p ( v p a ( i ) ( w ) ) / w .
Throughout this paper, we focus on programs with linear branches; that is, each ϕ i , p ( x 1 , , x d ) is linear in x 1 , , x d . Primitive functions with linear branches can express fundamental non-smooth functions such as max, the absolute value, bilinear interpolation, and any piecewise analytic functions with finite linear boundaries. They have been widely used in various fields including machine learning, electrical engineering, and non-smooth analysis. For example, max–min representation (or the abs–normal form) has been extensively studied in non-smooth analysis [30,31]. Furthermore, neural networks using the ReLU ( x ) = max { x , 0 } activation function and maxpool operations are widely used in machine learning, computer vision, load forecasting, etc. [6,24,25]. The assumption on linear branches will be formally introduced in Assumption 1 in Section 3.1.

2.3. Pieces of Programs

We introduce useful notations here. For each i [ n + m ] [ n ] , we define the set of the pieces of F i as
Γ i p j = 0 { 1 , 1 } j : E i , p = return f i , p ( x 1 , , x d i ) .
We also define Γ { ( ) } n × i = n + 1 m Γ i for the pieces of the overall program. Here, we include an auxiliary piece ( ) for the first n indices so that γ i Γ i for any γ = ( γ 1 , , γ n + m ) Γ and i [ n + m ] [ n ] . For each i [ n + m ] [ n ] , we define the set of the inputs to F i that corresponds to the piece p Γ i , as 
X i , p { x R d i : F i selects f i , p at the input x } .
Then, { X i , p : p Γ i } forms a partition of R d i . Likewise, we also define the set of the inputs to the overall program P that corresponds to γ Γ , as 
W γ { w R n : v p a ( i ) ( w ) X i , γ i for all i [ n + m ] [ n ] } .
Then, { W γ : γ Γ } forms a partition of R n . Lastly, for each γ Γ , we inductively define the function v i , γ : R n R that corresponds to v i ( · ) , but is obtained by using the γ j piece of each F j , as 
v i , γ ( w ) w i if i [ n ] f i , γ i ( v p a ( i ) , γ ( w ) ) if i [ n + m ] [ n ] ,
where v ( i ) , γ ( · ) is defined as in v ( i ) ( · ) . Then, v i , γ ( · ) coincides with v i ( · ) on W γ for all i [ n + m ] .

2.4. Reverse-Mode Automatic Differentiation

Reverse-mode automatic differentiation is an algorithm for computing the gradient v n + m ( w ) of the target function (if it exists) by sequentially running one forward pass (Algorithm 1) and one backward pass (Algorithm 2). Given w R n , its forward pass computes v n + 1 ( w ) , , v n + m ( w ) and corresponding pieces γ i Γ i such that w W γ for γ = ( ( ) , , ( ) , γ n + 1 , , γ n + m ) . Namely, we have v i ( w ) = v i , γ ( w ) for all i [ n + m ] .
Algorithm 1 Forward pass of reverse-mode automatic differentiation
1:
Input: P, ( w 1 , , w n )
2:
Initialize:  ( v 1 , , v n ) = ( w 1 , , w n )
3:
for  i = n + 1 , , n + m  do
4:
   Let p = ( )
5:
   while  p Γ i   do
6:
      p = p ( ( sign ϕ i , p ( v p a ( i ) ) c i , p ) )
7:
    end  while
8:
   Set γ i = p and v i = f i , γ i ( v p a ( i ) )
9:
end  for
10:
return  ( v 1 , , v n + m ) , ( γ n + 1 , , γ n + m )
Algorithm 2 Backward pass of reverse-mode automatic differentiation
1:
Input:  P, ( v 1 , , v n + m ) , ( γ n + 1 , , γ n + m )
2:
Initialize:  ( g v 1 , , g v n + m ) = ( 0 , , 0 , 1 )
3:
for  i = n + m , , n + 1   do
4:
    for  j ( i )   do
5:
      g v j = g v j + g v i · f i , γ i v j ( v ( i ) )
6:
    end  for
7:
end  for
8:
return  ( g v 1 , , g v n )
Given v 1 ( w ) , , v n + m ( w ) and γ , the backward pass computes D v n + m , γ ( w ) by applying the chain rule to the composition of differentiable functions f n + 1 , γ n + 1 , , f n + m , γ n + m . In particular, it iteratively updates g v i and returns ( g v 1 , , g v n ) = D v n + m , γ ( w ) . It is well known that reverse-mode automatic differentiation computes the correct gradient, i.e.,  g v i coincides with v n + m ( w ) / w i for all i [ n ] , if primitive functions F n + 1 , , F n + m do not have any branches [8,9]. However, if some F i uses branches, it may return arbitrary values even if the target function v n + m ( · ) is differentiable at w [14,32,33]. In the rest of the paper, we use AD to denote reverse-mode automatic differentiation.
Our algorithm computing an element of the Clarke subdifferential is similar to AD: it first finds some pieces γ i * Γ i and applies the backward pass of AD (Algorithm 2) to compute its output. Here, we chose the pieces γ i * that are used for computing the forward pass with some perturbed input, not the original one. Hence, our pieces and that of AD are different in general, which enables our algorithm to correctly compute an element of the Clarke subdifferential. We provide more details including the intuition behind our algorithm in Section 3.2 and Section 3.3.

3. Efficient Automatic Subdifferentiation

In this section, we present our algorithm for efficiently computing an element of the Clarke subdifferential. To this end, we first introduce a class of primitive functions, which we consider in the rest of this paper. Then, we describe our algorithm after illustrating its underlying intuition via an example. Lastly, we analyze the computational complexity of our algorithm.

3.1. Assumptions on Primitive Functions

We considered primitive functions that satisfy the following assumptions.
Assumption 1. 
For any i [ n + m ] [ n ] , p Γ i , and  j [ len ( p ) ] , the following hold:
  • X i , p .
  • ϕ i , p : j is linear, i.e., there exists z R d i such that ϕ i , p : j ( x ) = z , x .
  • f i , p is analytic on R d i .
The first assumption states that, for any i [ n + m ] [ n ] and p Γ i , there exists x R d i such that F i selects f i , p at x. In other words, there is no non-reachable piece p Γ i , i.e., all pieces of F i are necessary to express F i . The second assumption requires that all if–else statements of F i have linear ϕ i , p in their conditions. Lastly, we considered f i , p , which is analytic on its domain (e.g., polynomials, exp, log, and sin), as stated in the third assumption. From this, v i , γ ( · ) is well-defined and analytic on some open set containing cl ( W γ ) for all i [ n + m ] and γ Γ .
Assumption 1 admits any primitive function that is analytic or piecewise analytic with linear boundaries such as max and bilinear interpolation. Hence, it allows many interesting programs such as nearly all neural networks considered in modern deep learning, e.g., [6,34].

3.2. Intuition Behind Efficient Automatic Subdifferentiation

As in AD, our algorithm first performs one forward pass (Algorithm 3) to compute the intermediate values v n + 1 ( w ) , , v n + m ( w ) and to find proper pieces γ n + 1 , , γ n + m for the given input w. Then, it runs the original backward pass of AD (Algorithm 2) to compute an element of the Clarke subdifferential at w using the intermediate values and the pieces generated by the forward pass. Here, the key component of our algorithm is about how to choose proper pieces in the forward pass so that the backward pass can correctly compute an element of the Clarke subdifferential.
Before describing our algorithm, we explain its underlying intuition. Let δ = ( δ 1 , , δ n ) R n be a random vector drawn from a Gaussian distribution (see the initialization of Algorithm 3). Then, there exists unique γ * Γ and some s * > 0 almost surely such that
w + t · δ int ( W γ * ) for all t ( 0 , s * ) ,
i.e., a given program takes the same piece γ * for all inputs close to w along the direction of δ ; see Lemma 7 in Section 4 for the details. Since v n + m ( · ) = v n + m , γ * ( · ) on W γ * and v n + m , γ * ( · ) is differentiable, Equation (1) implies that v n + m ( · ) is differentiable at w + t · δ for all t ( 0 , s * ) . Therefore, the quantity:
D v n + m , γ * ( w ) = lim t 0 D v n + m , γ * ( w + t · δ ) = lim t 0 D v n + m ( w + t · δ )
is an element of the Clarke subdifferential c v n + m ( w ) , and our algorithm computes this very quantity via the backward pass of AD.
Algorithm 3 Forward pass of our algorithm
1:
Input:  P, ( w 1 , , w n )
2:
Initialize:  Sample δ i Normal ( 0 , 1 ) , and set ( v i , d v i ) = ( w i , δ i ) for all i [ n ]
3:
for  i = n + 1 , , n + m   do
4:
   Let p = ( )
5:
    while  p Γ i   do
6:
      if  ( ϕ i , p ( v p a ( i ) ) > c i , p ) ( ϕ i , p ( v p a ( i ) ) = c i , p ϕ i , p ( d v p a ( i ) ) > 0 )   then
7:
         p = p ( 1 )
8:
     else if  ( ϕ i , p ( v p a ( i ) ) < c i , p ) ( ϕ i , p ( v p a ( i ) ) = c i , p ϕ i , p ( d v p a ( i ) ) < 0 )   then
9:
         p = p ( 1 )
10:
     else if  ϕ i , p ( v p a ( i ) ) = c i , p ϕ i , p ( d v p a ( i ) ) = 0   then
11:
         p = p ( s ) for any s { 1 , 1 }
12:
      end  if
13:
    end  while
14:
   Set γ i = p , v i = f i , p ( v p a ( i ) ) , and  d v i = f i , p ( v p a ( i ) ) , d v p a ( i )
15:
end  for
16:
return  ( v 1 , , v n + m ) , ( γ n + 1 , , γ n + m )
We now illustrate the main idea behind our forward pass, which enables the backward pass to compute D v n + m , γ * ( w ) in Equation (2). As an example, consider a program with the following primitive functions: F n + 1 , , F n + m 1 are all analytic, and  F n + m branches only once with ϕ n + m , ( ) ( · ) = ϕ ( · ) and c n + m , ( ) = 0 . For notational simplicity, we use u ( w ) = v p a ( n + m ) ( w ) .
If ϕ ( u ( w ) ) > 0 , then it is easy to observe that γ n + m * = 1 from the continuity of ϕ and u. Likewise, if  ϕ ( u ( w ) ) < 0 , then γ n + m * = 1 . In the case that ϕ ( u ( w ) ) = 0 , we use the following directional derivatives to determine γ n + m * :
d v i ( w ; δ ) lim t 0 v i ( w + t · δ ) v i ( w ) t
for i [ n + m 1 ] , which can be easily computed using the chain rule. From the definition of d v i , the linearity of ϕ , and the chain rule, it holds that
lim t 0 ϕ ( u ( w + t · δ ) ) ϕ ( u ( w ) ) t = j p a ( n + m ) ϕ ( u ( w ) ) v j ( w ) · d v j ( w ; δ ) = ϕ ( d u ( w ; δ ) ) ,
where d u ( w ; δ ) denotes the vector of all d v j ( w ; δ ) with j p a ( n + m ) . Then, by Taylor’s theorem, ϕ ( d u ( w ; δ ) ) > 0 (or ϕ ( d u ( w ; δ ) ) < 0 ) implies γ n + m * = 1 (or γ n + m * = 1 ). In summary, if  ϕ ( u ( w ) ) 0 or ϕ ( d u ( w ; δ ) ) 0 , then the exact γ n + m * can be found, and hence, the backward pass (Algorithm 2) can correctly compute D v n + m , γ * ( w ) using γ * = ( ( ) , , ( ) , γ n + m * ) .
Now, we considered the only remaining case: ϕ ( u ( w ) ) = 0 and ϕ ( d u ( w ; δ ) ) = 0 . Unlike the previous cases, it is non-trivial here to find the correct γ n + m * because the first-order Taylor series approximation does not provide any information about whether a small perturbation of w toward δ increases ϕ ( u ( w ) ) or not. An important point, however, is that we do not need the exact γ n + m * to compute an element of the Clarke subdifferential; instead, it suffices to compute D v n + m , γ * ( w ) . Surprisingly, this can be performed by choosing an arbitrary piece of F n + m , as shown below.
For simplicity, suppose that ϕ ( x ) = x 1 , i.e.,  ϕ ( u ( w ) ) = v i * ( w ) for some i * p a ( n + m ) ; the below argument can be easily extended to an arbitrary linear ϕ . Let γ α = ( ( ) , , ( ) , α ) for α { ( 1 ) , ( 1 ) } , i.e.,  γ n + m α = α . Then, for any α { ( 1 ) , ( 1 ) } , we have
D v n + m , γ α ( w ) = j p a ( n + m ) { i * } f n + m , α ( u ( w ) ) v j ( w ) · D v j ( w )
almost surely, by the chain rule and the following result: d v i * ( w ; δ ) = ϕ ( d u ( w ; δ ) ) = 0 implies D v i * ( w ) = 0 almost surely (Lemma 5 in Section 4). Here, from the continuity and the definition of F n + m , we must have f n + m , ( 1 ) = f n + m , ( 1 ) on the hyperplane { x R d n + m : x 1 = 0 } , and thus, f n + m , ( 1 ) ( x ) / x j = f n + m , ( 1 ) ( x ) / x j for any x { x R d n + m : x 1 = 0 } and j [ d n + m ] { 1 } . From this and v i * ( w ) = ϕ ( u ( w ) ) = 0 , we then obtain
f n + m , ( 1 ) ( u ( w ) ) v j ( w ) = f n + m , ( 1 ) ( u ( w ) ) v j ( w )
for all j p a ( n + m ) { i * } . By combining Equations (4) and (5), we can finally conclude that
D v n + m , γ ( 1 ) ( w ) = D v n + m , γ ( 1 ) ( w ) = D v n + m , γ * ( w )
almost surely, where the last equality is from the fact that γ n + m * is either ( 1 ) or ( 1 ) . To summarize, if  ϕ ( u ( w ) ) = 0 and ϕ ( d u ( w ; δ ) ) = 0 , we can compute the target element of the Clarke subdifferential (i.e., D v n + m , γ * ( w ) ) by choosing an arbitrary piece of F n + m .

3.3. Forward Pass for Efficient Automatic Subdifferentiation

Our algorithm for computing an element of the Clarke subdifferential is based on the observation made in the previous section: it runs one forward pass (Algorithm 3) for computing v n + 1 ( w ) , , v n + m ( w ) and some γ Γ such that D v n + m , γ ( w ) = D v n + m , γ * ( w ) and one backward pass of AD (Algorithm 2) for computing D v n + m , γ ( w ) .
We now describe our forward pass procedure (Algorithm 3). First, it randomly samples a vector δ R n from a Gaussian distribution and initializes d v i = δ i for all i [ n ] (line 2). Then, it iterates for i = n + 1 , , n + m as follows. Given v 1 ( w ) , , v i 1 ( w ) and their directional derivatives d v 1 ( w ; δ ) , , d v i 1 ( w ; δ ) with respect to δ , lines 5–13 in Algorithm 3 find a proper piece γ i Γ i of F i by exploring its branches. If the condition in line 6 is satisfied, then it moves to the branch corresponding to ϕ i , p ( v ( i ) ( w ) ) > c i , p (line 7). It moves in a similar way if the condition in line 8 is satisfied. As in our example in Section 3.2, if  ϕ i , p ( v p a ( i ) ( w ) ) = c i , p and ϕ i , p ( d v p a ( i ) ( w ; δ ) ) = 0 (line 10), then our algorithm moves to an arbitrary branch (line 11). Once Algorithm 3 finds a proper piece γ i of F i , it updates v i ( w ) and d v i ( w ; δ ) via the chain rule (line 14). Here, v i ( w ) can be correctly computed due to the continuity of F i , while d v i ( w ; δ ) can also be correctly computed almost surely; see Lemma 8 in Section 4 for details. We remark that our algorithm is a generalization of the algorithm in [14]. The difference occurs in lines 10–11, where the existing algorithm deterministically chooses s based on some qualification condition [14].
As illustrated in Section 3.2, the piece γ Γ computed by our forward pass satisfies D v n + m , γ ( w ) = D v n + m , γ * ( w ) almost surely, and hence, the backward pass using this γ correctly computes D v n + m , γ * ( w ) almost surely, which is an element of the Clarke subdifferential. We formally state the correctness of our algorithm in the following theorem; its proof is given in Section 4.
Theorem 1. 
Suppose that Assumption 1 holds. Then, for any w R n , running Algorithm 3 and then Algorithm 2 returns an element of c v n + m ( w ) almost surely.

3.4. Computational Cost

In this section, we analyze the computational cost of our algorithm (both forward and backward passes) on a program P, compared to the cost of running P. Here, we only counted the cost of arithmetic operations and function evaluations and ignore the cost of memory read and write. We assumed that elementary operations ( + , × , , ), the comparison between two scalar values ( > , < , = ), and sampling a value from the standard normal distribution have a unit cost (e.g., cost ( + ) = 1 ), while the cost for evaluating an analytic function f is represented by cost ( f ) . To denote the cost of evaluating a program P with an input w, we use cost ( P ( w ) ) . Likewise, for the cost of running our algorithm (i.e., Algorithms 2 and 3) on P and w, we use cost ( ours ( P , w ) ) . We also assumed that memory read/write costs are included in our cost function. Under this setup, we bound the computational cost of our algorithm in Theorem 2.
Theorem 2. 
Suppose that cost ( P ( w ) ) n for all w R n . Then, for any program P and its input w R n , cost ( ours ( P , w ) ) κ · cost ( P ( w ) ) where
κ 1 + max i [ n + m ] [ n ] κ i , κ i max p Γ i 2 cost ( f i , p ) + cost ( f i , p ) + 4 d i + 4 len ( p ) + 2 j = 1 len ( p ) cost ( ϕ i , p : j ) min q Γ i cost ( f i , q ) + len ( q ) + j = 1 len ( q ) cost ( ϕ i , q : j ) .
The assumption in Theorem 2 is mild since it is satisfied if at least one distinct operation is applied to each input for evaluating P. The proof of Theorem 2 is presented in Section 5, where we use program representations of Algorithms 2 and 3 (see Figure 4 and Figure 5 in Section 3.4 for the details).
Suppose that, for each i [ n + m ] [ n ] , d i and max p Γ i len ( p ) (i.e., the arity and the maximum branch depth of f i ) are independent of n. This condition holds in many practical cases: e.g., the absolute value function has d i = 1 and max p Γ i len ( p ) = 1 ; max { · , · } has d i = 2 and max p Γ i len ( p ) = 1 . Under this mild condition, cost ( f i , p ) , cost ( f i , p ) , and  cost ( ϕ i , p : j ) are independent of n, and thus, κ i does so because the numerator in the definition of κ i is independent of n and the denominator is at least one (as cost ( f i , q ) 1 ). This implies that κ is independent of the input dimension n under the above condition.
In practical setups with large n, the computational cost of our algorithm can be much smaller than that of existing algorithms based on the lexicographic subdifferential [17,18,19,20]. For example, modern neural networks have more than a million parameters (i.e., n), where the cost for computing the gradient of each piece in the activation functions (i.e., cost ( f i , p ) ) is typically bounded by O ( cost ( f i , p ) ) . Further, the depth of branches in these activation functions is often bounded by a constant (e.g., the depth is one for ReLU). Hence, for those networks, κ = O ( 1 ) , and our algorithm does not incur much computational overhead. On the other hand, lexicographic-subdifferential-based approaches require at least n computations of P ( w ) [17,18,19,20], which may not be practical when n is large.

4. Proof of Theorem 1

In this section, we prove Theorem 3 under the setup that δ in Algorithm 3 is given instead of randomly sampled. This theorem directly implies Theorem 1 since the statement of Theorem 3 holds for almost every δ and the proof of Theorem 1 requires showing the same statement almost surely, where the randomness comes from δ following an Isotropic Gaussian distribution. Namely, proving Theorem 3 suffices for proving Theorem 1. We note that all results in this section are under Assumption 1.
Theorem 3. 
Given w R n , Algorithms 3 and 2 compute an element of c v n + m ( w ) for almost every δ R n .

4.1. Additional Notations

We frequently use the following shorthand notations: the set of indices of branches:
I b r { i [ n + m ] : | Γ i | > 1 }
and an auxiliary index set:
Idx i { ( j , p ) : p Γ i , j [ len ( p ) ] } .
For γ Γ , i [ n + m ] [ n ] , and  ( j , p ) Idx i , we use
ϕ i , p : j γ ( w ) ϕ i , p : j ( v p a ( i ) , γ ( w ) ) .
Note that v i , γ and ϕ i , p : j γ are analytic (and, therefore, differentiable) for all γ Γ , i [ n + m ] [ n ] , and  ( j , p ) Idx i . We next define the set of pieces reachable by our algorithm (Algorithm 3) with inputs w = ( w 1 , , w n ) , δ = ( δ 1 , , δ n ) R n as Γ ( w , δ ) :
Γ ( w , δ ) { γ Γ : γ i , j C i , j γ ( w ; δ ) i I b r , j [ len ( γ i ) ] } , where C i , j γ ( w ; δ ) { 1 } if ϕ i , γ i , : j ( v p a ( i ) , γ ( w ) ) = c i , γ i , : j ϕ i , γ i , : j ( d v p a ( i ) , γ ( w ; δ ) ) > 0 ϕ i , γ i , : j ( v p a ( i ) γ ) > c i , γ i , : j { 1 } if ϕ i , γ i , : j ( v p a ( i ) , γ ( w ) ) = c i , γ i , : j ϕ i , γ i , : j ( d v p a ( i ) , γ ( w ; δ ) ) < 0 ϕ i , γ i , : j ( v p a ( i ) γ ) < c i , γ i , : j { 1 , 1 } if ϕ i , γ i , : j ( v p a ( i ) , γ ( w ) ) = c i , γ i , : j ϕ i , γ i , : j ( d v p a ( i ) , γ ( w ; δ ) ) = 0 .

4.2. Technical Claims

Lemma 1. 
For any open O R , for any analytic, but non-constant f : O R , and for any x O , there exists ε > 0 such that
f ( x ) f ( [ x ε , x + ε ] { x } ) .
Furthermore, f is strictly monotone on [ x , x + ε ] and strictly monotone on [ x ε , x ] .
Proof Lemma 1. 
Without loss of generality, suppose that f ( x ) = 0 . Since f is analytic, f is infinitely differentiable and can be represented by the Taylor series on ( x δ , x + δ ) for some δ > 0 :
f ( z ) = i = 0 f ( i ) ( x ) i ! ( z x ) i
where f ( i ) denotes the i-th derivative of f. Since f is non-constant, there exists i N such that f ( i ) ( x ) 0 . Let i * be the minimum such i. Then, by Taylor’s theorem, f ( z ) = f ( i * ) ( x ) i * ! ( z x ) i * + o ( | z x | i * ) .
Consider the case that f ( i ) ( x ) > 0 and i * is odd. Then, we can choose ε ( 0 , δ ) so that
f ( 1 ) ( z ) < 0 on [ x ε , x ) and f ( 1 ) ( z ) > 0 on ( x , x + ε ]
i.e., f is strictly increasing on [ x ε , x + ε ] (e.g., by the mean value theorem), and hence, f ( x ) f ( [ x ε , x + ε ] { x } ) . One can apply a similar argument to the cases that f ( i ) ( x ) < 0 and i * is odd, f ( i ) ( x ) > 0 and i * is even, and  f ( i ) ( x ) < 0 and i * is even. This completes the proof of Lemma 1.    □
Lemma 2 
(Proposition 0 in [35]). For any n N , for any open connected O R n , and for any real analytic f : O R , if  μ n ( zero ( f ) ) > 0 , then f ( x ) = 0 for all x O .
Lemma 3. 
For any n N , for any open connected O R n , and for any real analytic f , g : O R , if  μ n ( zero ( f g ) ) > 0 , then f ( x ) = g ( x ) for all x O .
Proof Lemma 3. 
The proof directly follows from Lemma 2.    □

4.3. Technical Assumptions

Assumption 2. 
Given w R n , δ R n satisfies the following: for any γ Γ , i I b r , and  ( j , p ) Idx i , if  ϕ i , p : j γ is not a constant function, then
ϕ i , p : j γ ( w + t · δ ) is not a constant function in t R .
Assumption 3. 
Given w R n , δ R n satisfies the following: for any γ Γ , i I b r , and  ( j , p ) = Idx i ,
δ , D ϕ i , p : j γ ( w ) = 0 if and only if D ϕ i , p : j γ ( w ) = 0 .

4.4. Technical Lemmas

Lemma 4. 
Given w R n , almost every δ R n satisfies Assumption 2.
Proof of Lemma 4. 
Since | Γ | < , if the set of δ that does not satisfy Assumption 2 has a non-zero measure, then there exist γ Γ , i I b r , and  ( j , p ) Idx i such that ϕ i , p : j γ is not a constant function and
μ n S { δ R n : ϕ i , p : j γ ( w + t · δ ) is a constant function in t R } > 0 .
Without loss of generality, suppose that ϕ i , p : j γ ( w ) = 0 . Then, from the definition of S , S is contained in the zero set:
zero ( ϕ ˜ ) { u R : ϕ ˜ ( u ) = 0 }
of an analytic function ϕ ˜ : R W R defined as
ϕ ˜ ( u ) ϕ i , p : j γ ( w + u ) .
Namely, μ n ( zero ( ϕ ˜ ) ) μ n ( S ) > 0 . However, from Lemma 2, ϕ ˜ must be a constant function, which contradicts our assumption that ϕ i , p : j γ is not a constant function. This completes the proof of Lemma 4.    □
Lemma 5. 
Given w R n , almost every δ R n satisfies Assumption 3.
Proof of Lemma 5. 
Since D ϕ i , p : j γ ( w ) = 0 implies δ , ϕ i , p : j γ ( w ) = 0 for all δ R n , we prove the converse. Suppose that D ϕ i , p : j γ ( w ) 0 . Since the set { δ R n : δ , D ϕ i , p : j γ ( w ) = 0 } has zero measure under D ϕ i , p : j γ ( w ) 0 ,
γ Γ , i I b r , ( j , p ) Idx i : D ϕ i , p : j γ ( w ) 0 { δ R n : δ , D ϕ i , p : j γ ( w ) = 0 }
also has zero measure. This completes the proof of Lemma 5.    □
Lemma 6. 
For i I b r and p Γ i , suppose that x R d i satisfies one of the following for all j [ len ( p ) ] :
  • sign ( ϕ i , p : j ( x ) c i , p : j ) = p j ;
  • ϕ i , p : j ( x ) = c i , p : j .
Then, x cl ( X i , p ) .
Proof of Lemma 6. 
Without loss of generality, assume that x X i , p . Since we assumed X i , p by Assumption 1, there exists y X i , p , i.e.,  sign ( ϕ i , p : j ( y ) c i , p : j ) = p j for all j [ len ( p ) ] . Define
I { j [ len ( p ) ] : ϕ i , p : j ( x ) = c i , p : j } .
Since ϕ i , p : j is linear, for  z = y x and for any j I , we have
sign ( ϕ i , p : j ( z ) ) = p j .
This implies that, for any t > 0 and j I , it holds that
sign ( ϕ i , p : j ( x + t · z ) ) = p j .
Since | ϕ i , p : j ( x ) c i , p : j | > 0 and sign ( ϕ i , p : j ( x ) ) = p j for all j [ len ( p ) ] I by the definition of I , there exists s > 0 such that
| ϕ i , p : j ( x + t · z ) c i , p : j | > 0 and sign ( ϕ i , p : j ( x + t · z ) ) = p j
for all t [ 0 , s ] and j [ len ( p ) ] I . Combining Equations (7) and (8) implies that x + t · z X i , p for all t ( 0 , s ] , i.e., x is a limit point of X i , p . This completes the proof of lem:closure    □

4.5. Key Lemmas

Lemma 7. 
For any w R n and for any δ R n satisfying Assumption 2, there exist s * > 0 and γ * Γ ( w , δ ) such that
w + t · δ int ( W γ * ) for all t ( 0 , s * ) .
Proof of Lemma 7. 
We first define some notations: for an analytic function f : R R and t R ,
ϕ i , p : j γ , w , δ ( t ) ϕ i , p : j γ ( w + t · δ ) , dir ( f ) s if s = sup s 0 { f ( 0 ) is strictly increasing on [ 0 , s ] } > 0 s if s = sup s 0 { f ( 0 ) is strictly decreasing on [ 0 , s ] } > 0 otherwise .
Under Assumption 2 and by Lemma 1, one can observe that, for any i I b r , p = γ i , and  j [ len ( p ) ] , dir ( ϕ i , p : j γ , w , δ ) = if and only if ϕ i , p : j γ is a constant function. In addition, from Lemma 1, if  ϕ i , p : j γ is not a constant function, then dir ( ϕ i , p : j γ , w , δ ) > 0 . Using Algorithm 4, we iteratively construct γ * Γ ( w , δ ) and update s * > 0 for each i I b r so that
w + t · δ int ( W γ * ) for all t ( 0 , s * ) .
Under our construction of γ * , one can observe that γ * Γ ( w , δ ) . From our choice of s * , for any i I b r , j [ len ( p ) ] , and for s i , j = dir ( ϕ i , γ i , : j * γ * , w , δ ) , the following statements hold:
  • If s i , j , then ϕ i , γ i , : j * γ * , w , δ ( ( 0 , s * ) ) is open since ϕ i , γ i , : j * γ * , w , δ is strictly monotone on ( 0 , s * ) ;
  • If s i , j = , then ϕ i , γ i , : j * γ * , w , δ , ϕ i , γ i , : j * γ * are constant functions (i.e., ϕ i , γ i , : j * γ * , w , δ ( ( 0 , s * ) ) is a constant) due to Assumption 2.
For any t ( 0 , s * ) , we have w + t · δ O W γ * , where
O i , j : z i , j = ( ϕ i , γ i , : j * γ * ) 1 ϕ i , γ i , : j * γ * , w , δ ( ( 0 , s * ) ) i , j : z i , j ( ϕ i , γ i , : j * γ * ) 1 ϕ i , γ i , : j * γ * , w , δ ( ( 0 , s * ) ) .
Here, O is open since each term for the intersection in the above equation is open; it is R n if s i , j = , and it is an inverse image of a continuous function of an open set otherwise. This completes the proof of Lemma 7.    □
Algorithm 4 Construction of γ * and s *
  • Input: P, ( w 1 , , w n ) , ( δ 1 , , δ n )
  • Initialize: ( v 1 , , v n ) = ( w 1 , , w n ) , d v i = δ i for all i [ n ] , s * = , γ * = ( ) , , ( )
  • for  i = n + 1 , , n + m   do
  •    Let x = v p a ( i ) , d x = d v p a ( i ) , and  p = ( )
  •     while  p Γ i   do
  •      Set y = ϕ i , p ( x ) and s = dir ( ϕ i , p γ * , w , δ )
  •       if  s = y > c i , p   then
  •          p = p 1
  • else if  s = y c i , p then
  •          p = p ( 1 )
  • else if  s y > c i , p  then
  •          p = p 1 and ε = min { | ϕ i , p γ * , w , δ ( s ) y | , y c i , p }
  •          { s } = ( ϕ i , p γ * , w , δ ) 1 ( y + z · ε ) [ 0 , s ] and s * = min { s * , s }
  • else if  s y < c i , p  then
  •          p = p ( 1 ) and ε = min { | ϕ i , p γ * , w , δ ( s ) y | , c i , p y }
  •          { s } = ( ϕ i , p γ * , w , δ ) 1 ( y + z · ε ) [ 0 , s ] and s * = min { s * , s }
  • else if  s y = c i , p  then
  •          p = p sign ( z ) and s * = min { s * , s }
  •       end  if
  •      Set γ i * = p and v i = f i , γ i * ( x )
  •     end  while
  • end  for
  • return  γ * , s *
Corollary 1. 
For any w R n and for any δ R n satisfying Assumption 2, there exist s * > 0 and γ * Γ ( w , δ ) such that
D v n + m ( w + t · δ ) = D v n + m , γ * ( w + t · δ ) for all t ( 0 , s * ) .
Proof of Corollary 1. 
This corollary directly follows from Lemma 7.    □
Lemma 8. 
For any w R n and for any δ satisfying Assumption 3, it holds that
D v n + m , γ ( w ) = D v n + m , γ ( w ) for all γ , γ Γ ( w , δ ) .
Proof of Lemma 8. 
We use the mathematical induction on i to show that v i , γ ( w ) = v i , γ ( w ) and D v i , γ ( w ) = D v i , γ ( w ) for all i [ n + m ] . The base case is trivial: v i , γ ( w ) = v i , γ ( w ) and D v i , γ ( w ) = D v i , γ ( w ) for all i [ n ] . Hence, suppose that i I b r since the case that i [ n + m ] ( [ n ] I b r ) is also trivial. Then, by the induction hypothesis, we have v j , γ ( w ) = v j , γ ( w ) and D v j , γ ( w ) = D v j , γ ( w ) for all j [ i 1 ] . For notational simplicity, we denote x j v j , γ ( w ) = v j , γ ( w ) and d x j d v j , γ ( w ; δ ) = d v j , γ ( w ; δ ) for all j [ i 1 ] .
Let p = γ i and p = γ i . First, by Lemma 6, the definition of Γ ( w , δ ) , and the induction hypothesis, we have
x p a ( i ) cl ( X i , p ) cl ( X i , p ) .
Due to the continuity of F i , this implies that
v i , γ ( w ) = f i , p ( x p a ( i ) ) = f i , p ( x p a ( i ) ) = v i , γ ( w ) .
Now, it remains to show D v i , γ ( w ) = D v i , γ ( w ) . To this end, we define the following:
I ( j , p ) Idx i : ϕ i , p : j ( d x p a ( i ) ) = 0 , S x R d i : ϕ i , p : j ( x ) = 0 for all ( j , p ) I .
From the definition of S and I , d x p a ( i ) S . Furthermore, by Assumption 3, for any ( j , p ) Idx i and γ { γ , γ } , we have ϕ i , p : j ( d v p a ( i ) , γ ( w ; δ ) ) = 0 if and only if D ϕ i , p : j ( v p a ( i ) , γ ( w ) ) = 0 , i.e.,  ϕ i , p : j ( v p a ( i ) , γ ( w ) ) / w = ϕ i , p : j ( v p a ( i ) , γ ( w ) / w ) = 0 for all [ n ] . Therefore, since d x p a ( i ) S , it holds that
v p a ( i ) , γ ( w ) w , v p a ( i ) , γ ( w ) w S for all [ n ] .
In addition, due to the identities
D v i , γ ( w ) = D f i , p ( v p a ( i ) , γ ( w ) ) , D v i , γ ( w ) = D f i , p ( v p a ( i ) , γ ( w ) ) , f i , p ( v p a ( i ) , γ ( w ) ) w = f i , p ( x p a ( i ) ) , v p a ( i ) , γ ( w ) w , f i , p ( v p a ( i ) , γ ( w ) ) w = f i , p ( x p a ( i ) ) , v p a ( i ) , γ w ,
showing the following stronger statement suffices for proving Lemma 8:
f i , p ( x p a ( i ) ) , z = f i , p ( x p a ( i ) ) , z for all z S .
By Lemma 1 and the induction hypothesis ( d v p a ( i ) , γ = d v p a ( i ) , γ ), there exists s > 0 such that, for any t ( 0 , s ) and ( j , p ) Idx i and for z t = x p a ( i ) + t · d x p a ( i ) ,
ϕ i , p : j ( z t ) > c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) > 0 ) ( ϕ i , p : j ( x p a ( i ) ) > c i , p : j ) , ϕ i , p : j ( z t ) < c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) < 0 ) ( ϕ i , p : j ( x p a ( i ) ) < c i , p : j ) , ϕ i , p : j ( z t ) = c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) = 0 ) .
Since each ϕ i , p : j is linear (i.e., continuous) and by the definition of S , for each t ( 0 , s ) , there exists an open neighborhood O t S (open in S ) of z t such that, for any z O t and ( j , p ) Idx i ,
ϕ i , p : j ( z ) > c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) > 0 ) ( ϕ i , p : j ( x p a ( i ) ) > c i , p : j ) , ϕ i , p : j ( z ) < c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) < 0 ) ( ϕ i , p : j ( x p a ( i ) ) < c i , p : j ) , ϕ i , p : j ( z ) = c i , p : j if ( ϕ i , p : j ( x p a ( i ) ) = c i , p : j ϕ i , p : j ( d x p a ( i ) ) = 0 ) .
Here, we claim that, for any t ( 0 , s ) ,
O t cl ( X i , p ) cl ( X i , p ) S B and x p a ( i ) B .
First, from the definition of O t , we have O t S . In addition, by Equation (10) and the definition of Γ ( w , δ ) , either ( ϕ i , p : j ( z ) c i , p : j ) = p j or ϕ i , p : j ( z ) = c i , p : j for all j [ len ( p ) ] and z O t ; the same argument also holds for p . Hence, by Lemma 6, the LHS of Equation (11) holds. Likewise, we have x p a ( i ) B .
Due to the continuity of F i , Equation (11) implies that, for any t ( 0 , s ) ,
f i , p = f i , p on O t ,
i.e., f i , p ( z t ) , z = f i , p ( z t ) , z for all z S and t ( 0 , s ) . Here, f i , p and f i , p are differentiable at z t by Equation (11) and Assumption 1. Due to the analyticity of f i , p and f i , p , this implies that, for any z S , we have
f i , p ( x p a ( i ) ) , z = lim t 0 f i , p ( z t ) , z = lim t 0 f i , p ( z t ) , z = f i , p ( x p a ( i ) ) , z
where f i , p and f i , p are differentiable at x p a ( i ) by Equation (11) and Assumption 1. This proves Equation (9) and completes the proof of Lemma 8.    □

4.6. Proof of Theorem 3

Under Assumptions 2 and 3, combining Lemmas 4 and 5, Corollary 1, and Lemma 8 completes the proof of Theorem 3.

5. Proof of Theorem 2

Here, we analyze the computational costs based on the program representations in Figure 4 and Figure 5. For simplicity, we use A2 and A3 for the shorthand notations for Algorithm 2 and Algorithm 3, respectively. Under these setups, our cost analysis for P ( w ) and A 2 ( P , A 3 ( P , w ) ) = ours ( P , w ) is as follows: for γ , γ Γ such that γ = ( ) , , ( ) , γ n + 1 , , γ n + m , where ( γ n + 1 , , γ n + m ) and v ( w ) = ( v 1 ( w ) , , v n + m ( w ) ) are the outputs of A 3 ( P , w ) and w W γ :
cost ( P ( w ) ) = i = n + 1 n + m cost ( f i , γ i ) + len ( γ i ) + j [ len ( γ i ) ] cost ( ϕ i , γ i , : j ) , cost ( A 2 ( P , v ( w ) , γ ) ) = i = n + 1 n + m d i · ( cost ( + ) + cost ( × ) ) + cost ( f i , γ i ) ( A 3 ( P , v ( w ) , γ ) ) i = n + 1 n + m 2 d i + cost ( f i , γ i ) , cost ( A 3 ( P , w ) ) i = n + 1 n + m ( max { d i 1 , 0 } · cost ( + ) + d i · cost ( × ) + cost ( f i , γ i ) + cost ( f i , γ i ) + j [ len ( γ i ) ] 2 cost ( ϕ i , γ i , : j ) + 4 cost ( > ) ) i = n + 1 n + m 2 d i + 4 len ( γ i ) + cost ( f i , γ i ) + cost ( f i , γ i ) + 2 j [ len ( γ i ) ] cost ( ϕ i , γ i , : j ) , cost ( ours ( P , w ) ) i = 1 n cost ( sample δ i ) + A 2 ( P , v ( w ) , γ ) ) + cost ( A 3 ( P , w ) ) n + i = n + 1 n + m 4 d i + 4 len ( γ i ) + cost ( f i , γ i ) + 2 cost ( f i , γ i ) + 2 j [ len ( γ i ) ] cost ( ϕ i , γ i , : j ) .
This implies the following:
cost ( ours ( P , w ) ) cost ( P ( w ) ) n cost ( P ( w ) ) + i = n + 1 n + m 4 d i + 4 len ( γ i ) + cost ( f i , γ i ) + 2 cost ( f i , γ i ) + 2 j [ len ( γ i ) ] cost ( ϕ i , γ i , : j ) cost ( P ( w ) ) 1 + max i [ n + m ] [ n ] κ i = κ
where the first inequality is from the above bound and the second inequality is from the definition of κ i and the assumption cost ( P ( w ) ) n . This completes the proof.

6. Conclusions

In this work, we proposed an efficient subdifferentiation algorithm for computing an element of the Clarke subdifferential of programs with linear branches. In particular, we generalized the existing algorithm in [14] and extended its application from polynomials to analytic functions. The computational cost of our algorithm is at most that of the function evaluation multiplied by an input-dimension-independent factor, for primitive functions whose arities and maximum depths of branches are independent of the input dimension. We believe that extending our algorithm to general functions (e.g., continuously differentiable functions), general branches (e.g., nonlinear branches), and general programs (e.g., programs with loops) will be an important future research direction.

Funding

This research was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2022R1F1A1076180).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
  2. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS Autodiff Workshop. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 1 November 2023).
  3. Frostig, R.; Johnson, M.; Leary, C. Compiling machine learning programs via high-level tracing. In Proceedings of the SysML Conference, Stanford, CA, USA, 15–16 February 2018; Volume 4. [Google Scholar]
  4. Speelpenning, B. Compiling Fast Partial Derivatives of Functions Given by Algorithms; University of Illinois at Urbana-Champaign: Champaign, IL, USA, 1980. [Google Scholar]
  5. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 84–90. [Google Scholar]
  6. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  8. Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed.; SIAM: Philadelphia, PA, USA, 2008. [Google Scholar]
  9. Pearlmutter, B.A.; Siskind, J.M. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 2008, 30, 1–36. [Google Scholar] [CrossRef]
  10. Baur, W.; Strassen, V. The complexity of partial derivatives. Theor. Comput. Sci. 1983, 22, 317–330. [Google Scholar] [CrossRef]
  11. Griewank, A. On automatic differentiation. Math. Program. Recent Dev. Appl. 1989, 6, 83–107. [Google Scholar]
  12. Bolte, J.; Boustany, R.; Pauwels, E.; Pesquet-Popescu, B. Nonsmooth automatic differentiation: A cheap gradient principle and other complexity results. arXiv 2022, arXiv:2206.01730. [Google Scholar]
  13. Griewank, A. Who invented the reverse mode of differentiation. Doc. Math. Extra Vol. ISMP 2012, 389400, 389–400. [Google Scholar]
  14. Kakade, S.M.; Lee, J.D. Provably Correct Automatic Sub-Differentiation for Qualified Programs. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 7125–7135. [Google Scholar]
  15. Lee, W.; Yu, H.; Rival, X.; Yang, H. On Correctness of Automatic Differentiation for Non-Differentiable Functions. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 6719–6730. [Google Scholar]
  16. Nesterov, Y. Lexicographic differentiation of nonsmooth functions. Math. Program. 2005, 104, 669–700. [Google Scholar] [CrossRef]
  17. Khan, K.A.; Barton, P.I. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Trans. Math. Softw. 2013, 39, 1–28. [Google Scholar]
  18. Khan, K.A.; Barton, P.I. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optim. Methods Softw. 2015, 30, 1185–1212. [Google Scholar] [CrossRef]
  19. Barton, P.I.; Khan, K.A.; Stechlinski, P.; Watson, H.A.J. Computationally relevant generalized derivatives: Theory, evaluation and applications. Optim. Methods Softw. 2018, 33, 1030–1072. [Google Scholar] [CrossRef]
  20. Khan, K.A. Branch-locking AD techniques for nonsmooth composite functions and nonsmooth implicit functions. Optim. Methods Softw. 2018, 33, 1127–1155. [Google Scholar] [CrossRef]
  21. Griewank, A. Automatic directional differentiation of nonsmooth composite functions. In Proceedings of the Recent Developments in Optimization: Seventh French-German Conference on Optimization, Dijon, France, 27 June–2 July 1995; Springer: Berlin/Heidelberg, Germany, 1995; pp. 155–169. [Google Scholar]
  22. Sahlodin, A.M.; Barton, P.I. Optimal campaign continuous manufacturing. Ind. Eng. Chem. Res. 2015, 54, 11344–11359. [Google Scholar] [CrossRef]
  23. Sahlodin, A.M.; Watson, H.A.; Barton, P.I. Nonsmooth model for dynamic simulation of phase changes. AIChE J. 2016, 62, 3334–3351. [Google Scholar] [CrossRef]
  24. Hanin, B. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef]
  25. Alghamdi, H.; Hafeez, G.; Ali, S.; Ullah, S.; Khan, M.I.; Murawwat, S.; Hua, L.G. An Integrated Model of Deep Learning and Heuristic Algorithm for Load Forecasting in Smart Grid. Mathematics 2023, 11, 4561. [Google Scholar] [CrossRef]
  26. Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  27. Rehman, H.U.; Kumam, P.; Argyros, I.K.; Shutaywi, M.; Shah, Z. Optimization based methods for solving the equilibrium problems with applications in variational inequality problems and solution of Nash equilibrium models. Mathematics 2020, 8, 822. [Google Scholar] [CrossRef]
  28. Davis, D.; Drusvyatskiy, D.; Kakade, S.; Lee, J.D. Stochastic subgradient method converges on tame functions. Found. Comput. Math. 2020, 20, 119–154. [Google Scholar] [CrossRef]
  29. Bolte, J.; Pauwels, E. Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 2021, 188, 19–51. [Google Scholar] [CrossRef]
  30. Scholtes, S. Introduction to Piecewise Differentiable Equations; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  31. Griewank, A.; Bernt, J.U.; Radons, M.; Streubel, T. Solving piecewise linear systems in abs-normal form. Linear Algebra Its Appl. 2015, 471, 500–530. [Google Scholar] [CrossRef]
  32. Bolte, J.; Pauwels, E. A mathematical model for automatic differentiation in machine learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; pp. 10809–10819. [Google Scholar]
  33. Lee, W.; Park, S.; Aiken, A. On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. arXiv 2023, arXiv:2301.13370. [Google Scholar]
  34. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  35. Mityagin, B. The zero set of a real analytic function. arXiv 2015, arXiv:1512.07276. [Google Scholar] [CrossRef]
Figure 1. Definitions of a program P (left) and primitive functions F n + 1 , , F n + m (right).
Figure 1. Definitions of a program P (left) and primitive functions F n + 1 , , F n + m (right).
Mathematics 11 04858 g001
Figure 2. A flow chart illustrating a primitive function F i .
Figure 2. A flow chart illustrating a primitive function F i .
Mathematics 11 04858 g002
Figure 3. Example code for the max function F i : ( x 1 , x 2 , x 3 ) max { x 1 , x 2 , x 3 } .
Figure 3. Example code for the max function F i : ( x 1 , x 2 , x 3 ) max { x 1 , x 2 , x 3 } .
Mathematics 11 04858 g003
Figure 4. A program P ADB implementing the backward pass of AD (Algorithm 2).
Figure 4. A program P ADB implementing the backward pass of AD (Algorithm 2).
Mathematics 11 04858 g004
Figure 5. A program P ours implementing Algorithm 3. Here, x 1 : d i = ( x 1 , , x d i ) .
Figure 5. A program P ours implementing Algorithm 3. Here, x 1 : d i = ( x 1 , , x d i ) .
Mathematics 11 04858 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, S. Efficient Automatic Subdifferentiation for Programs with Linear Branches. Mathematics 2023, 11, 4858. https://doi.org/10.3390/math11234858

AMA Style

Park S. Efficient Automatic Subdifferentiation for Programs with Linear Branches. Mathematics. 2023; 11(23):4858. https://doi.org/10.3390/math11234858

Chicago/Turabian Style

Park, Sejun. 2023. "Efficient Automatic Subdifferentiation for Programs with Linear Branches" Mathematics 11, no. 23: 4858. https://doi.org/10.3390/math11234858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop