Efﬁcient Deep Learning for Gradient-Enhanced Stress Dependent Damage Model

: This manuscript introduces a computational approach to micro-damage problems using deep learning for the prediction of loading deﬂection curves. The location of applied forces, dimensions of the specimen and material parameters are used as inputs of the process. The micro-damage is modelled with a gradient-enhanced damage model which ensures the well-posedness of the boundary value and yields mesh-independent results in computational methods such as FEM. We employ the Adam optimizer and Rectiﬁed linear unit activation function for training processes and research into the deep neural network architecture. The performance of our approach is demonstrated through some numerical examples including the three-point bending specimen, shear bending on L-shaped specimen and different failure mechanisms.


Introduction
Neural networks (NN) have been used for numerous applications in different areas including computational mechanics. Initially, single-layer or shallow neural networks (SNN) consisted only of one input and one output layer. Later on, additional hidden layers were added to the network architecture resulting in so-called deep neural networks (DNN). Table 1 gives a brief summary of different network architectures. Neural networks or more specifically deep neural networks suffer from three major difficulties, i.e., (a) vanishing gradients, (b) over-fitting and (c) computational loading. However, significant advances such as deep belief networks (DBN) [1], rectified linear unit (ReLU) activation functions [2], drop-out algorithms [3] or back-propagation algorithms and associated tools have contributed to the popularity of DNN. The vanishing gradient problem for instance has been significantly alleviated thanks to the RELU activation function and cross entropy-driven learning techniques. Nonetheless, certain issues such as over-fitting still remains a challenge in deep neural networks. Common techniques to address such problems include regularization techniques.
We recall that the crucial idea behind Artificial Neural Network (ANN) is that many neurons can be joined together by connecting weights to conduct complex computations. The structure is often demonstrated as a graph or a map whose nodes are the neurons and each (directed) edge in the graph connects the output to the input of the associated neurons. Deep learning methods, representative of learning methods with multiple processing layers in the hidden layers, consist of linear and non-linear transformations [4,5]. Some simplifications of the problems stem from their need for a usually very strong CPU and a noticeably long time to detect and analyse slow convergence behaviour. For complex problems, the solution could be non-existent. A newer method or a faster algorithm has yet to be found in recent decades. While machine learning (ML) approaches have been extensively and successfully used in numerous areas, its application in "modeling and simulation" is still in its infancy. For instance, in medicine, ML has been employed in diagnostics where it outperforms the diagnosis of established physicians [6]. The authors of [7] have applied successfully deep learning to cellular imaging. Park et al. [8] focused on the problems of cellular imaging in regulatory genomics. Goh et al. [9] took advantage of deep learning in computational chemistry. Deep learning techniques were also applied in applications such as bio-informatics, or the public health sector [10]. On the other hand, most approaches in engineering, or more specifically computational mechanics, have been used in data-driven contexts though there are numerous other applications such as the direct solution of partial differential equations [11], which have the potential to drastically accelerate the design to analysis time and the way modeling and simulation is performed. In the data-driven context, neural networks have commonly been used in the context of constitutive models [12,13] as an alternative to traditional constitutive models. The models in [14][15][16] presented interesting approaches to capture the response of anisotropic materials. New training algorithms for specific constitutive laws have been presented in [17,18]. However, setting up the network architecture for such engineering problems still remains a major challenge and is often determined by trial and error. The authors of [19] exploited Deep Learning to optimize the fine-scale structure of composites. Multi-fields problems were tackled for instance in [20,21]. Recently, Lee et al. [22] has applied deep learning algorithms to structural analysis. In this manuscript, we present a novel methodology to predict the load-deflection curve by deep learning. Passing through the three-point bending as an illustrative example, we suggest some possible architectures of the deep neural networks based on the Adam optimizer. Such findings can open a new branch of research that may prove beneficial to the fourth industrial revolution, where deep learning algorithms play a major role in big data analysis of structural engineering.

Constitutive Equations of Isotropic Damage Models
Let us assume small strain theory in the context of a scalar/isotropic damage model. The relation between the Cauchy stress tensor σ and the linear strain tensor ε is given by where ω denotes a monotonically increasing scalar damage variable, C the fourth-order elasticity tensor and C e f f = (1 − ω)C refers to the so-called 'effective' elasticity tensor. A value of ω = 1 indicate a completely damaged material while the material is intact for a value o ω = 0. The evolution of the damage variable is commonly governed by a scalar state variable κ, i.e., ω = ω(κ). The Kuhn-Tucker conditions, which finally lead to a convex optimization problem, ensures that the product of the loading function f and the rate of the state variable is equal to zero: ε eq indicating the effective strain, which is a projection of a multi-axial strain state onto a single scalar value. We consider a strain-based formulation, where the loading function is expressed in terms of the effective strain (instead of effective stresses), which facilitates the implementation in the context of displacement based finite element analysis: We test two different approaches to compute ε eq . The first one has been presented by Mazars [23] for quasi-brittle materials and is given by where ⟨⋅⟩ indicates a Macaulay bracket while ε 1 and ε 2 -for two-dimensional problems-denote the principal strains. For this approach, we employ an exponential evolution law for the damage variable as suggested by Peerlings et al. [24]: where κ 0 stands for an initial damage threshold while the material parameters α and β needs to be determined through experiments. We also employ an expression for the equivalent strain as suggested by de Vree et al. [25] and Peerlings et al. [26]: where I 1 indicates the first invariant of the linear strain tensor and J 2 refers to the second invariant of the deviatoric part of the strain tensor.
The parameter k in Equation (6) determines the ratio of the compressive and tensile strength.

Gradient-Enhanced Damage Models
It is well known that local damage models lead to ill-posed boundary value problems and associated numerical difficulties such as mesh-dependent results in computational modeling. Non-local damage models as proposed [26][27][28][29][30] restore the well-posedness of the boundary value problem by introducing an intrinsic length scale which smear the crack over a certain width. In such models, the loading function is therefore expressed in terms of non-local equivalent strain ε eq (x) which in turn depend on the local effective strain and a weighting function g (ξ) governing the domain of non-locality: An alternative to such strongly nonlocal models are weakly nonlocal approaches [26,28,31] which are based on Taylor series expansions to approximate the effective strain. Substituting these into the Equation (8) yields a different expression for the non-local equivalent strain. However, it is well known that such formulations require C 1 -continuity for second-order gradient models (C 2 -continuity for fourth-order gradient enhancements), which complicates their implementation in Lagrange polynomial based finite element analysis. Implicit formulations overcome these difficulties. Differentiating twice and neglecting higher-order gradient terms of the local equivalent strain, the implicit second-order gradient model reads: where the parameter l c indicates the intrinsic length scale, which can be regarded either as purely numerical regularization parameter or material parameter which needs to be determined through experiments or other theoretical considerations. For concrete materials, Bažant et al. related l c to the maximum aggregate size [32]. Furthermore, the following von Neumann boundary conditions need to be satisfied [28,33]: It can be shown that the gradient enhanced damage model, Equation (9), can be expressed in terms of the principal directions and an anisotropic weight function: where c 1 , c 2 are the weighting factors. More details about the derivation of above described gradient-enhanced damage model and its implementation can be found for instance in [26].

Optimizers
In machine learning (ML) approaches, the weights and biases of the network (see Figure 1) are obtained through minimizing an objective function. Commonly, gradient descent methods are employed in ML. The update of the gradient descent has the form x k+1 = x k − η∇ x f (x k ), η being the step size, which is also referred to as learning rate. Gradient descent methods are based on so-called batches employed to calculate the gradient in a single iteration. Commonly, the batch is the entire data set. In addition, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time in computation. By choosing examples from a data set at random, it could estimate (albeit noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme-it uses solely one example (a batch size of 1) per iteration. Every one computation for all data points, it is called one epoch. The term "stochastic" implies that the one example comprising each batch is chosen arbitrarily. The mini-batch stochastic gradient descent method (mini-batch SGD) is a compromise between the full-batch iteration and SGD. A mini-batch is usually between 10 and 1000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch. It is well known that the steepest gradient descent faces difficulties in areas where surface curves exhibit different gradients in different dimensions, which frequently occurs around local optima. To reduce the risk of getting stuck in local optima, we take advantage of the Momentum Method, which shows analogies to the equations governing the movement of particles in a viscous medium: where the parameter γ ≈ 0.9 governs the updating of the iterations within the stochastic gradient descent method (SGDM). This approach is commonly employed with the back propagation algorithm, which will be explained later. However, the momentum method is based on the situation where a ball rolling downhill blindly follows the slope. A smarter ball would slow down before the slope goes up again, which is the essence of the Nesterov Accelerated Gradient (NAG) approach [34]. Therefore, the changing of momentum is computed, which is simply the sum of the momentum vector and the gradient vector at the current step. The changing of the Nesterov momentum is then the sum of the momentum vector and the gradient vector at the approximation of the next step: Gradients of complex functions as used in DNN tend to either vanish or explode. These vanishing/exploding problems become more pronounced with increasing complexity of the function. These issues can be alleviated by an adapted learning rate method, named RMSProp, which was suggested by Geoff Hinton in Lecture 6e of his Coursera Class. The idea is based on a moving average of the squared gradients, which normalizes the gradient. This approach proves effective to balance the step size by decreasing the step for large gradient to avoid explosion while increasing the step for small gradients, which evenutally alleviates the vanishment problem: where the so-called squared gradient decay factor β ranges from 0 to 1; we employ the suggested value of β ≈ 0.9. Another method, that calculates learning rates adaptively for each parameter is the Adaptive Moment Estimation (Adam) [35]. Adam keeps an exponentially decaying average of past gradients v k as well. However, in contrast to the momentum method, Adam can be seen as a heavy ball with friction, that prefers flat minima in the error surface. The past decaying averages and squared gradients v k and r k can be obtained by v k and r k being estimates of the first and second moment, which are the mean and the uncentered variance of the gradients, respectively. To avoid the bias towards zero vectors for the initialized vectors v k and r k , we employ corrected first and second moment estimates as suggested by the authors of Adam: Subsequently, we use the Adam update rule given by Hidden layers

Back-propagation algorithm
Though the back-propagation algorithm can be traced back to the 1970s, its popularity grew with the seminal paper of Rumelhart et al. [36]. It has become the 'backbone' of many efficient machine learning tools such as Pytorch or Tensorflow. The back-propagation algorithm basically minimizes the error function in weight space by the method of gradient descent. Mathematically, the input and output are in matrix form (X, Y) where X ∈ R d×n and Y ∈ R k×n , d designate the number of input parameters, k is number of output parameters and n is number of training data. After calculating the output from the input of a mini-batch X, the activation y l at each hidden layer are saved where l = 1, ⋯, m is the order of hidden layers. At the output layer, the derivative of the loss function with

Activation Functions
Let us consider a three-point bending beam as illustrated in Figure 2. We assume plane stress conditions and a beam thickness of h = 0.05 m. The material parameters are: Young's modulus E = 20 (GPa), Poisson's ratio ν = 0.2, softening parameters α = 0.99, β = 500, κ 0 = 10 −4 . We employ the formulation for the equivalent strain as in Equation (4). The input of the process are locations l f where the force f is applied which will be created by 120 different positions from  Table 2.

Name Equation Derivative
Identity

Back-Propagation Algorithm
Though the back-propagation algorithm can be traced back to the 1970s, its popularity grew with the seminal paper of Rumelhart et al. [36]. It has become the 'backbone' of many efficient machine learning tools such as Pytorch or Tensorflow. The back-propagation algorithm basically minimizes the error function in weight space by the method of gradient descent. Mathematically, the input and output are in matrix form (X, Y) where X ∈ R d×n and Y ∈ R k×n , d designate the number of input parameters, k is number of output parameters and n is number of training data. After calculating the output from the input of a mini-batch X, the activation y l at each hidden layer are saved where l = 1, ⋯, m is the order of hidden layers. At the output layer, the derivative of the loss function with respect to z are calculated e m = ∂J ∂z m where J = 1 n n i=1 y i − y i 2 2 . Hence, the gradient can be computed ∂J ∂w m = y m−1 e m . For l = m − 1, ⋯, 1, the derivatives e l = w l+1 e l+1 ⊗ f ′ (z l ) are determined where the operator ⊗ means element wise product. Finally, the gradient is updated by ∂J ∂w l = y l e l which apparently requires the continuity and differentiability of the error function. Applicatively in the three-point bending specimen, the training data will be collected by the gradient-enhanced damage models and the training process based on the (1 − 100 − 100 − 100 − 62) architecture in Figure 3. Because of the symmetric of the specimen, the loading deflection curve are quite similar in the left from L 2 − ∆ p to L 2 and in the right from L 2 to L 2 + ∆ p with regards to the middle point. Hence, it is a good idea to divide the data into two parts, the left part and the right part with respect to the middle point. Thus, the training is separated into two processes with 61 data for each case.

Scaled Layer
For the material problem which will be discussed in Section 5, the input data were built by different parameters. Due to the wide range of these, some data can be as large as Young's modulus or softening parameter β whereas other ones can be as small as the critical value of equivalent strain. This presents some limitations to the sensitive input and the training. To overcome these issues, a scaled layer based on a convex combination is introduced to reduce the overload input data and increase the tiny ones. Let s be a bijection from the range [min, max] of a parameter to the interval [0, 1] which is defined by s(x) = x − min l where min, max are the boundaries and l is the length of the parameter region. The role of the layer was applicable to both training and predicting process. The scale layer can be added at the input layer, the output layer or both of them.

Results and Testing
Consider back to the location force problem which was trained by Adam optimizer where Rectified Linear Unit (ReLU) activation function with back propagation algorithm was employed. The learning rate of the left data is η l = 3 × 10 −3 and of the right data is η r = 4 × 10 −3 . For every period of a number of epochs (which is called the learning rate drop period), the learning rate is dropped base on a factor called the learning rate drop factor to let the jump steps are smaller after the period of iteration. This technique assists the convergent optimization. In this computation, the learning rate drop period is 100 epochs and the learning rate drop factor is 95%. The squared gradient decay factor is 0.99. The gradient decay factor is 0.95. The size of the mini-batch is 100 and the total of epochs is 10, 000. The training process for both left and right data are illustrated in Figure 4. The neural network architecture for this problem is (1 − 100 − 100 − 100 − 62). Ten data were chosen randomly for testing. The loss values is computed by root mean square error (RMSE) and by mini-batch loss (Loss) for comparison in Table 3. The testing of loading deflection curves by deep learning after training and by data for both left and right is illustrated in Figure 5.

Dimensional Problem
We reconsider the three-point bending specimen that was subjected to an external force f whose various geometry and boundary condition of the specimen are shown in Figure 6. The thickness h of the beam is 0.05 m and plane stress conditions are assumed. The input of the process are dimensional vectors [H, L] where H ∈ [0.1, 0.4] is the height and L ∈ [1.0, 2.3636] is the length of the specimen. The data will be created by 156 different dimensional vectors where the heights H are linearly generated 12 times while the lengths L are created 13 times. The number of training data is 150 and six data for testing. In this example, to avoid the sensitivity of the data where the displacements u ∈ [0.1, 0.8] and the forces can exceed 2500, the training process will be separated into two parts. One for displacement and one for forces whose architectures are illustrated in Figure 7. For each input, the vector of output is represented by [u 1 , ⋯, u 31 ] T for displacements and [ f 1 , ⋯, f 31 ] T for forces which (u i , f i ) ∀i = 1, ⋯, 31 are points of loading deflection curve. Therefore, the matrix of output is a 31 × 150 matrix. The training data will be collected by the gradient-enhanced damage models and the training process based on (2 − 100 − ⋯ − 100 7 −31) architecture.    The problem was trained by Adam optimizer by the employment of Rectified Linear Unit (ReLU) activation function with back propagation algorithm. The learning rate of the displacement data is η u = 10 −3 and of the force data is η f = 2 ⋅ 10 −3 . The learning rate drop period is 100 epochs. The learning rate drop factor is 99% for displacement training process and is 99.9% for force training process. The squared gradient decay factor is 0.99. The gradient decay factor is 0.95. The size of mini-batch is 100 and the total of epochs is 50, 000 for force -training and is 5000 for displacement-training. The training processes are illustrated in Figure 8. Six data were chosen randomly for testing. The loss values are computed by root mean square error (RMSE) and by mini-batch loss (Loss) for comparison in Table 4. The testing of loading deflection curves by deep learning after train and by data are illustrated in Figure 9.

Material Parameter Problem
The three-point bending specimen is subjected to an external force f whose geometry and boundary condition of the specimen are shown in Figure 2. In this problem, the force is placed at the middle of the specimen. The thickness h of the beam is 0.05 m and plane stress conditions are assumed. The input of the process is material parameter vector [E, ν, κ, le, α, β] where E is the Young's modulus, ν is the Poisson's ratio, κ is the critical value of equivalent strain, le is the characteristic length and α, β are softening parameters. The data is created by 4096 different parameter vectors where the parameters are taken in intervals based on the experienced average values in Table 5. In this example, to avoid the sensitivity of the data where the displacements u ∈ [0.1, 0.8] and the forces can exceed 1400, and the training process is separated into two parts. One for displacements and one for forces whose architectures are similarly and illustrated in Figure 10.   The problem was trained by Adam optimizer by the employment of Rectified Linear Unit (ReLU) activation function with back propagation algorithm. The learning rate of the displacement data is η u = 10 −4 and of the force data is η f = 3 × 10 −4 . The learning rate drop period is 100 epochs. The learning rate drop factor is 99%. The squared gradient decay factor is 0.99. The gradient decay factor is 0.95. The size of the mini-batch is 100 and the total of epochs is 5000. The training processes are illustrated in Figure 11. Six data were chosen randomly for testing. The loss values will be computed by root mean square error (RMSE) and by mini-batch loss (Loss) for comparison in Table 6. The testing of loading deflection curves by deep learning after training and by data are illustrated in Figure 12.

L-Shape Specimen
The last example is an L-shaped specimen subjected to a distributed load as shown in Figure 13, which is also a classical benchmark problem used for instance to demonstrate the performance of Isogeometric Analysis (IGA) [37]. The thickness of the specimen is 20 cm and plane stress conditions are assumed. The material parameters are: Young's modulus E = 10 (GPa) and Poisson's ratio ν = 0.2.
The effective strain in Equation (6) and the damage law in Equation (5) are used with parameters κ 0 = 4 × 10 −4 , α = 0.98 and β = 80. The non-local length scale is taken as l c = 5 √ 2 ≈ 7.07 (mm). The input of the training process are locations l f on which the force f is applied. These locations are created by 123 different positions from 0 to ∆ p where ∆ p is the span of the various forces. In this example, to avoid the sensitivity of the data where the displacements u ∈ [0, 3] and the forces can be more than 15000, the training process will be separated into two parts. One for displacements and one for forces whose architectures are illustrated in Figure 14. For each input, the vector of output is formed by The problem was trained by Adam optimizer by the employment of Rectified Linear Unit (ReLU) activation function with back propagation algorithm. The learning rate of the displacement data is η u = 10 −4 and of the force data is η f = 2 × 10 −3 . The learning rate drop period is 100 epochs. The learning rate drop factor is 99%. The squared gradient decay factor is 0.99. The gradient decay factor is 0.95. The size of mini-batch is 100 and the total of epochs is 5000. The training processes are illustrated in Figure 15. Five randomly data were chosen for testing. The loss values are computed by root mean square error (RMSE) and by mini-batch loss (Loss) for comparison in Table 7. The testing of loading deflection curves by deep learning after training and by data are illustrated in Figure 16.  Table 7. Loss values for both displacement and forces training in L-shape problem.

Conclusions
This paper introduced a deep learning technique in the context of artificial neural networks for predicting loading deflection curves of structures under gradient-enhanced damage responses. The neural network has been trained based on finite element simulations. We conducted our approach in the form of numerical experiments, using various shapes of specimen as well as inputs. The predictions based on material changes are much more challenging and the number of input parameters will trigger big data in the training phase. The Adam optimizer and Rectified Linear Unit activation function produced the best result for training. The major contribution of this study is to integrate deep learning into computational mechanics. The key feature is how to choose training parameters and deep neural network architecture. Our research is potentially competent for application to a wide range of complex practical engineering problems. In the next step, we intend to develop and apply sufficient transfer learning algorithms, which is important for computational efficiency.