Robust learning with implicit residual networks

In this effort we propose a new deep architecture utilizing residual blocks inspired by implicit discretization schemes. As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations. We show that this choice leads to improved stability of both forward and backward propagations, has a favorable impact on the generalization power of the network and allows for higher learning rates. In addition, we consider a reformulation of ResNet which does not introduce new parameters and can potentially lead to a reduction in the number of required layers due to improved forward stability and robustness. Finally, we derive the memory efficient reversible training algorithm and provide numerical results in support of our findings.


Introduction and related works
A large volume of empirical results has been collected in recent years illustrating the striking success of deep neural networks (DNNs) in approximating complicated maps by a mere composition of relatively simple functions [15]. Universal approximation property of DNNs with a relatively small number of parameters has also been shown for a large class of functions [11,16]. The training of deep networks nevertheless remains a notoriously difficult task due to the issues of exploding and vanishing gradients, which become more apparent and noticeable with increasing depth [1]. These issues accelerated efforts of the research community in an attempt to explain this behavior and gain new insights into the design of better architectures and faster algorithms. A promising approach in this direction was obtained by casting evolution of the hidden states x t ∈ X t of a DNN as a dynamical system [4], i.e., x t = Φ t (γ t , x t−1 ), t = 1, ..., T, where for each layer t, Φ t : Γ t × X t → X t+1 is a nonlinear transformation parameterized by the weights γ t ∈ Γ t and X t , Θ t are the appropriately chosen spaces. In the case of a very deep network, when T → ∞, it is convenient to consider the continuous time limit of the above expression such that where the parametric evolution function Φ : Γ × X → X defines a continuous flow through the input data x 0 ∈ X. Parameter estimation for such continuous evolution can be viewed as an optimal Preprint. Under review. arXiv:1905.10479v1 [cs.LG] 24 May 2019 control problem [5], given by min γ E µ0 L x(T ), y 0 + T 0 R γ(t), x(t) dt , (1) subject to where L x(T ), y 0 is a terminal loss function, R γ(t), x(t) is a regularizer, and µ 0 is a probability distribution of the input-target data pairs (x(0), y 0 ). More general models additionally consider spatially continuous networks by using differential [17] or integral formulations [18]. A continuous time formulation based on ordinary differential equations (ODEs) was proposed in [3] with the state equation (2) of the formẋ In the work [3], the authors relied on the black-box ODE solvers and used adjoint sensitivity analysis (see, e.g., [19] for the introduction to adjoint methods) to derive equations for the backpropagation of errors through the continuous system.
The authors of [8] concentrated on the well-posedness of the learning problem for ODE-constrained control and emphasized the importance of stability in the design of deep architectures. For instance, the solution of a homogeneous linear ODE with constant coefficientṡ is given by where A = QΛQ T is the eigen-decomposition of a matrix A, and Λ is the diagonal matrix with corresponding eigenvalues. Similar equation holds for the backpropagation of gradients. To guarantee the efficient propagation of information through the network, one must ensure that the elements of e Λt have magnitudes close to one. This condition, of course, is satisfied when all eigenvalues of the matrix A are imaginary with real parts close to zero. In order to preserve this property, the authors of [8] proposed several time continuous architectures of the forṁ y(t) = f γ(t); y(t), z(t) andż(t) = g γ(t); y(t), z(t) .
When f (y, z) = ∇ z H(y, z), g(y, z) = −∇ y H(y, z), the equations above provide an example of a conservative Hamiltonian system with the total energy H.
In the discrete setting of the ordinary feed forward networks, the necessary conditions for the optimal solution of (1)-(2) recover well-known equations for the forward propagation (state equation (2)), backward gradient propagation (co-state equation), and the optimality condition, to compute the weights (gradient descent algorithm), see, e.g, [14]. The continuous setting offers additional flexibility in the construction of discrete networks with the desired properties and efficient learning algorithms. Classical feed forward networks (Figure 1, left) is just the particular and the simplest example of such discretization which is prone to all the issues of deep learning. In order to facilitate the training process, a skip-connection is often added to the network (Figure 1, middle) yielding where h is a positive hyperparameter. Equation (5) can be viewed as a forward Euler scheme to solve the ODE in (3) numerically on the time grid with step size h. While it was shown that such residual layers help to mitigate the problem of vanishing gradients and speed-up the training process [12], the scheme has very restrictive stability properties [10]. This can result in the uncontrolled accumulation of errors at the inference stage reducing the generalization ability of the trained network. Moreover, Euler scheme is not capable of preserving geometric structure of conservative flows and is thus a bad choice for the long time integration of such ODEs [9]. In other words, residual blocks in (5) are not well suited for the very deep networks.
Memory efficient explicit reversible architectures can be obtained by considering time discretization of the partitioned system of ODEs in (4). The reversibility property allows to recover the internal states of the system by propagating through the network in both directions and thus does not require one to cache these values for the evaluation of the gradients. First, such architecture (RevNet) was proposed in [6], and without using a connection to discrete solutions of ODEs, it has the form It was later recognized as the Verlet method applied to the particular form of the system in (4), see [8,2]. The leapfrog and midpoint networks are two other examples of reversible architectures proposed in [2].
Other residual architectures can be also found in the literature including Resnet in Resnet (RiR) [20], Dense Convolutional Network (DenseNet) [13] and linearly implicit network (IMEXNet) [7]. For some problems, all of these networks show a substantial improvement over the classical ResNet but still have an explicit structure, which has limited robustness to the perturbations of the input data and parameters of the network. Instead, in this effort we propose new fully implicit residual architecture which, unlike the above mentioned examples, is unconditionally stable and robust.
As opposed to the standard feed-forward networks, the outputs of the proposed implicit residual blocks are defined as the fixed points of the appropriately chosen nonlinear transformations as follows: The right part of Figure 1 provides a graphical illustration of the proposed layer. The choice of the nonlinear transformation F and the design of the learning algorithm are discussed in the next section.

Description of the method
We first motivate the necessity for our new method by letting the continuous model of a network be given by the ordinary differential equations in (4), that is: An s-stage Runge-Kutta method for the approximate solution of the above equations is given by ., s, The order conditions for the coefficients a ij , b i , c i ,â ij ,b i , andĉ i , which guarantee convergence of the numerical solution are well known and can be found in any topical text, see, e.g., [10]. Note that when a ij = 0 orâ ij = 0 for at least some j ≥ i, the scheme is implicit and a system of nonlinear equations has to be solved at each iteration which obviously increases the complexity of the solver. Nevertheless, the following example illustrates the benefits of using implicit approximations.   (6).
Linear stability analysis. Consider the following linear differential systeṁ and four simple discretization schemes: Due to linearity of the system in (6), we can write the generated numerical solutions as The long time behavior of the discrete dynamics is hence determined by the spectral radius of the matrix A(hω) which need to be less or equal to one for the sake of stability. For example, we have λ 1,2 = 1 ± ihω for the forward Euler scheme and the method is unconditionally unstable. Backward Euler scheme gives λ 1,2 = (1 ± ihω) −1 and the method is unconditionally stable. The corresponding eigenvalues of the trapezoidal scheme have magnitude equal to one for all ω and h. Finally, the characteristic polynomial for the matrix of the Verlet scheme is given by λ 2 − (2 − h 2 ω 2 )λ + 1, i.e., the method is only conditionally stable when |hω| ≤ 2. Notice that the flows of the forward and backward Euler schemes are strictly expanding and contracting which makes the training process inherently ill-posed as the dynamics are not easily invertible. Contrary, the implicit trapezoidal and explicit Verlet schemes seem to reproduce the original flow very well but the latter is conditional on the size of the step h. Another nice property of the trapezoidal and Verlet schemes is their symmetry with respect to the exchanging y n ↔ y n−1 and z n ↔ z n−1 . Such methods play a central role in the goemetric integration of reversible differential flows and are handy in the construction of the memory efficient reversible network architectures. Conditions for the reversibilty of general Runge-Kutta schemes can be found in [9].

Implicit ResNet.
Motivated by the discussion above, we propose an implicit variant of the residual networks given by where x, y, γ are the input, output and parameters of the layer and F is a nonlinear function.
Forward propagation. To solve the nonlinear equation in (7), consider the equivalent minimization problem One way to construct the required solution is by applying the gradient descent algorithm y n+1 ← y n − λ n I − θ ∂F (γ, y n ) ∂y T · r(y n ) 1 2 ∇y r(yn) 2 , n = 0, 1, 2, ...

Alternatively, the fixed point iteration
can be also used when the initial guess is sufficiently close to the minimizer.
Finally, by linearizing F (γ, y) around x, we obtain the closed form estimate of the solution which can be used as an initial guess for the mentioned iterative algorithms.
It is worth noting that, even though the nonlinearity in (7) adds to the complexity of the forward propagation, the backpropagation through the nonlinear solver is not required as is shown below.
Backpropagation. Using the chain rule we can easily find the Jacobian matrices of the imlpicit residual layer as follows ∂γ .
The backpropagation formulas then follow immediately One can see that the backpropagation is an essentially linear process and only one linear solve is required at the beginning of each layer. It is also clear that the case θ = 0 corresponds to the standard ResNet architecture with essentially no control over the propagation of perturbations through the network. At the other extreme, when θ = 1, the network has excellent forward stability but cannot be trained. Trapezoidal scheme with θ = 1 2 instead has a proper balance between the stability and controllability while also being a reversible second-order integrator.
Implementation details. The proposed residual architecture can be easily implemented using any existing deep learning framework such as PyTorch or Tensorflow. The code snippet in Listing 1 gives an example of such implementation in Tensorflow. Firstly, we use tf.stop_gradient to avoid backpropagating through the nonlinear solver and then we compose the output of the layer with the custom_backprop function which is a tf.custom_gradient decorator of the identity map. This decorator is responsible for the linear solve in the backpropagation formulas above while the remaining operations are handled by the automatic differentiation algorithm supplied with the framework.
Complexity. Let d be the depth of the network and denote by m the memory complexity of the standard ResNet layer. Then the memory effort of the proposed architecture is O(d · m). In fact, it can be often reduced due to the improved stability and hence potentially smaller required depth. Moreover, the memory complexity can be made O(1) when using reversible methods such as the trapezoidal scheme in (7) with θ = 0.5. On the other hand, the computational cost of the implicit network is necessarily larger when compared to ResNet of the same depth since additional nonlinear and linear solves are required at each layer. The cost of the linear solver strongly depends on the structure of the linear operator. For general n × n dense matrices it is on the order of O(n γ ) for some γ ∈ (2,3]. In practice, the dimension of hidden states is often not very large or the linear operator is of special structure. For instance, sparse convolutional operators should not be cast into the matrix form and the corresponding linear systems can be solved by iterative methods which only require one to know how to apply a particular operator to the given tensor. The cost of the nonlinear solver is more difficult to estimate since the convergence is highly dependent on the initialization.  For this purpose, we use the folowing ordinary differential equatioṅ with a skew symmetric coefficient matrix Note that γ(t) has a purely imaginary spectrum which guarantees stability of the continuous dynamics.
We compare the behaviour of two networks derived from (7), namely the standard ResNet (θ = 0) and the new implicit trapezoidal network (θ = 0.5). We used the training dataset of 100 randomly sampled points and the standard L 2 loss function, and trained the networks using batch gradient descent optimizer on the batches of size 4. The validation dataset was evaluated on 200 points. Both networks were initialized with the Glorot uniform initializer and we used the weight regularization of the form where L is the number of layers. We set L = 100 for the ResNet and L = 10 for the trapezoidal network with the correspondingly adjusted values of the hyperparameter h so that both networks approximate the same ODE. We chose these values using the stability argument, this is the reason why ResNet need more layers. Example 2. For the second example, we consider another small test problem from [8]. The dataset is illustrated in Figure 6. It consists of 513 points organized in two differenetly labeled spirals. Every other point was removed to be used as the validation dataset. We used the same network architecture as in the previous example but with 6 hidden nodes at each of the 25 hidden layers and tanh activation insted of ReLU . The final classification layer has sigmoid activation. Figures 5 and 6 illustrate convergence of the networks and the classification results. One can see that the proposed implicit scheme is more accurate and robust than the classical ResNet.