FE 2 Computations with Deep Neural Networks: Algorithmic Structure, Data Generation, and Implementation

: Multiscale FE 2 computations enable the consideration of the micro-mechanical material structure in macroscopical simulations. However, these computations are very time-consuming because of numerous evaluations of a representative volume element, which represents the microstructure. In contrast, neural networks as machine learning methods are very fast to evaluate once they are trained. Even the DNN-FE 2 approach is currently a known procedure, where deep neural networks (DNNs) are applied as a surrogate model of the representative volume element. In this contribution, however, a clear description of the algorithmic FE 2 structure and the particular integration of deep neural networks are explained in detail. This comprises a suitable training strategy, where particular knowledge of the material behavior is considered to reduce the required amount of training data, a study of the amount of training data required for reliable FE 2 simulations with special focus on the errors compared to conventional FE 2 simulations, and the implementation aspect to gain considerable speed-up. As it is known, the Sobolev training and automatic differentiation increase data efﬁciency, prediction accuracy and speed-up in comparison to using two different neural networks for stress and tangent matrix prediction. To gain a signiﬁcant speed-up of the FE 2 computations, an efﬁcient implementation of the trained neural network in a ﬁnite element code is provided. This is achieved by drawing on state-of-the-art high-performance computing libraries and just-in-time compilation yielding a maximum speed-up of a factor of more than 5000 compared to a reference FE 2 computation. Moreover, the deep neural network surrogate model is able to overcome load-step size limitations of the RVE computations in step-size controlled computations.


Introduction
Nearly all commonly applied engineering materials possess, depending on the detail of investigation, some heterogeneous microstructure, e.g., fiber-reinforced polymers or rolled steel alloys, where the grains can have preferential directions because of the manufacturing process.Since this heterogeneous microstructure can significantly influence the response of these materials to mechanical loading, it is of particular interest to consider the microstructure already in numerical simulations.The development of constitutive models for materials with heterogeneous microstructures is challenging in both aspects, phenomenological constitutive modeling and subsequent experimental calibration.Thus, the so-called FE 2 method has been developed by [1][2][3][4][5][6]-to mention only a few-for coupled numerical simulation of structures at macro-and microscale with finite elements.There, in contrast to common finite element computations, a constitutive model is not assigned to an integration point at macroscale.Instead, the stress and consistent tangent quantities are obtained by solving an initial boundary value problem with finite elements on a particular microstructure followed by a numerical homogenization technique.In this context, the microstructure is usually denoted as a representative volume element (RVE).In addition to the aforementioned works, in [7], a comprehensive description of the FE 2method for the numerical solution of these coupled boundary value problems on different scales is provided.In general, there exist further methods to obtain the response of heterogeneous microstructures, such as Discrete Fourier Transforms or Fast Fourier Transforms; see, for example, [8].Even the finite cell method is applicable for the homogenization of heterogeneous microstructures; see, for example, [9].However, in this work, deep neural networks are applied to replace the computationally costly solution of initial boundary value problems at microscale.
Currently, various applications of methods of artificial intelligence exist in the field of solid mechanics.A comprehensive overview of applications in continuum material mechanics is given in [10].Further reviews are provided in [11,12] for applications in experimental solid mechanics and [13] for material development in additive manufacturing employing machine learning methods.Ref. [14] provides a general introduction to the application of machine learning techniques in material modeling and design of materials.Additionally, in [15], a review and investigation of the ability to apply machine learning in constitutive modeling is provided; however, it is in the context of soils.Most applications of machine learning methods aim to obtain feasible information from huge amounts of data or to increase the speed of particular computations.The source of the data could either be simulations, as in the present work, or directly experimental data, as in the data-driven mechanics approach, which was introduced by the aithors of [16], where it is not required to learn the response of constitutive models from simulations.
The application of artificial neural networks for data-based constitutive modeling was originally introduced in [17] and is frequently used in representing the material behavior for finite element simulations since then; see, for example, [18,19].Recently, different approaches have been published to advance numerical simulations with machine learning methods.An investigation into deep learning surrogate models for accelerating numerical simulations is presented in [20].Ref. [21] contains a proposal of a combination of physics-based finite element analysis and convolutional neural networks for predicting the mechanical response of materials.In contrast, ref. [22] contains an application of deep learning techniques for extracting rules that are inherent in computational mechanics to propose a new numerical quadrature rule that shows results superior to the well-known Gauss-Legendre quadrature.
Learning material behavior from simulations is generally covered with versatile approaches.In this context, the authors of [23] describe a material modeling framework for hyperelasticity and plasticity, where different architectures of neural networks are employed.Model-free procedures that fit into the data-driven mechanics approach for representing material behavior are described in [24][25][26][27], among others.Model-free approaches are suitable, especially for the consideration of elastoplastic material behavior; see [28,29] as well.Artificial neural networks could also be applied for calibrating known constitutive models from experimental data (parameter identification).First attempts are presented in [30,31], whereas in [32], modern deep reinforcement learning techniques are applied for the calibration of history-dependent models.Since the measurement techniques to obtain experimental data have evolved in recent years, modern calibration techniques can consider full-field data, e.g., from digital image correlation; see [33].Instead of calibrating constitutive models from experimental data with neural networks, where an error is introduced from choosing the constitutive model, the experimental data can be directly employed for discovering the material models from data.This approach is introduced in [34] for hyperelasticity and later on extended to cover elastoplasticity [35] and generalized materials [36].Similar work with automated discovery of suitable hyperelastic materials is provided in [37], where constitutive artificial neural networks are applied, introduced in [38].
Many different machine learning methods are successfully used for multiscale applications in solid mechanics.There, the main objective is to obtain the homogenized response from heterogeneous microstructures.One of the first works in this context is provided in [39], wherein the authors applied neural networks for the homogenization of non-linear elastic materials.Ref. [40] contains a proposal of a data-driven two-scale approach to predict homogenized quantities even for inelastic deformations by drawing on clustering techniques.The ability to replace microscale evaluations with artificial neural networks requires a suitable accuracy of the network after the training process.Regarding this issue, the authors of ref. [41] make use of artificial neural networks as constitutive relation surrogates for nonlinear material behavior.However, based on the evaluation of quality indicators, a reduced-order model can be employed instead of the neural network within an adaptive switching method.The authors of ref. [42] describe the so-called Structural-Genome-Driven approach for FE 2 computations of composite structures, wherein even model reduction techniques are applied.
Advanced neural network architectures such as convolutional or recurrent neural networks are regularly applied to predict the homogenized response of microstructures.Here, atomistic data can also be used showing significant acceleration compared to molecular statics [43].Elastic material behavior is investigated in [44][45][46][47].In ref. [48], significant speed up in homogenization is reached when applying three-dimensional convolutional neural networks in broad ranges of different microstructures, phase fractions, and material parameters.The authors of ref. [49] provide the generalization of data to obtain three-dimensional nonlinear elastic material laws under arbitrary loading conditions.Considering anisotropy, teh authors of ref. [50] predict effective material properties of RVEs with randomly placed and shaped inclusions.The suitability of different machine learning methods for homogenizing polycrystalline materials is studied in [51].Besides the purely mechanical homogenization approaches, the authors of ref. [52] show that neural networks can be even applied to computational homogenization of electrostatic problems.Moreover, researchers in ref. [53] employ µCT data within a data-driven multiscale framework to study open-cell foam structures.
According to [54], replacing microscale computations in the FE 2 method by surrogate models can be denoted as a data-driven multiscale finite element method.The authors of ref. [55] perform multiscale computations with feedforward neural networks and recurrent neural networks for RVEs with inelastic material behavior and further investigate the ability to generalize for unknown loading paths.Researchers in ref. [56] present a hybrid methodology denoted as a model-data-driven one.Therein, the authors apply a combination of conventional constitutive models and a data-driven correction component as a multiscale methodology.Moreover, it is beneficial to incorporate physical knowledge into the development of neural network surrogates.This is achieved, for example, in [57].The authors propose thermodynamics-based artificial neural networks (TANNs) and apply them for multiscale simulations of inelastic lattice structures, while later extending the framework to evolution TANNs [58].The application of particular physical constraints by using problem-specific invariants as input quantities and the Helmholtz free energy density as output is provided in [59].The authors provide FE ANN as a data-driven multiscale framework and minimize the number of microscale simulations, which serve as training data, by following an autonomous data mining approach.
Further, probabilistic approaches can be employed while developing the surrogate models; in [60], more accurate results are achieved with Sobolev training [61] compared to regular training for hyperelasticity.In this context, for an extension to multiscale plasticity with geometric machine learning methods, we refer to [62,63].Elastoplastic solid materials are investigated in [64] using recurrent neural networks and in [65] where the authors employ two separated neural networks for the homogenized stress and tangent information.The authord of ref. [66] apply DeepONet as a surrogate model for the microscale level with two-dimensional elastoplasticity and hyperelasticity.Currently, the authors of ref. [67] demonstrate the applicability of the encoder/decoder approach for multiscale computations with path-dependent material behavior on the microscale.
Another research track are the so-called deep material networks (DMNs), which provide an efficient data-driven multiscale framework to reproduce the homogenized response of RVEs.The introduction of DMNs for two-phase heterogeneous materials is provided in [68] together with an extension to three-dimensional microstructures [69].The authors of ref. [70] further extend the technique to take into account diverse fiber orientations, applying DMNs for multiscale analysis of composites with full thermo-mechanical coupling [71].Researchers in ref. [72] employ DMNs with the computation of the tangent operator in a closed form as an output of the network.
The main objective of the present work is to provide a consistent approach for employing deep neural networks (DNN) as surrogate models in step-size controlled multiscale FE 2 computations.As mentioned afore, various publications already deal with embedding artificial neural networks into numerical simulations especially for accelerating computational costly multiscale simulations.A novelty of the present work is that we provide a clear description of the algorithmic structure, which is in general a Multilevel-Newton algorithm (MLNA) that simplifies to a Newton-Raphson algorithm when employing DNN surrogate models.Further, current publications leave out required information; for example, the ways in which the consistent tangent at macroscale integration points is obtained from the microscale information, meaning whether the computations are performed by automatic differentiation, neural network models, or by numerical differentiation.Concerning this objective, we start in Section 2 with an explanation of the underlying equations and the algorithmic structure in FE 2 computations, where we restrict ourselves to small strains and quasi-static problems.Afterwards, two different architectures of neural networks and specific considerations of physical knowledge during the training process are described.Since the amount of training data required to obtain sufficient accuracies in the neural network outputs is of particular interest, this is investigated as well while using regular training and Sobolev training.As another novel contribution, we develop a method for efficiently coupling the different programming codes of the trained neural network and the multiscale finite element code.There, the application of high-performance computing libraries and just-in-time compilation yields significantly higher speed up of the DNN-FE 2 approach in load-step size controlled computations compared to the results presented in the current literature.Furthermore, the DNN surrogate is even able to overcome load-step size limitations that are apparent in FE 2 computations.
The notation in use is defined in the following manner: geometrical vectors are symbolized by a and second-order tensors A by bold-faced Roman letters.In addition, we introduce column vectors and matrices at the global finite element level symbolized by bold-type italic letters A and column vectors and matrices on the local (element) level using bold-type Roman letters A. Further, to distinguish quantities on macroscale and microscale levels, we indicate microscale quantities by ˇ • .Calligraphic letters A denote deep neural network surrogate models.

Classical FE 2 Computations
In this work, finite elements are employed to perform multiscale computations; see, for example, [4,7].Hence, only the main equations are recapped, which are necessary to explain the algorithmic structure.In multiscale FE 2 computations, the macro-and microscale levels have to be distinguished regarding the spatial discretization.Here, we restrict ourselves to periodic displacement boundary conditions on the microscale and refer to [4] for other boundary conditions on the microscale.The computation of the system of non-linear equations resulting from the spatial discretization is explained for the Multilevel-Newton algorithm (MLNA), which is here a two-level Newton algorithm.Further, the connection is drawn to embedding deep neural network surrogate models as predictors for homogenized quantities from the microscale in the MLNA.

Spatial Discretization
In the present work, FE 2 analyses are performed in a quasi-static setting with the restriction to small strains.Thus, no configurations have to be distinguished and we have the symmetric stress tensor T( x, t) and strain tensor E( x, t) = 1/2 grad u( x, t) + grad T u( x, t) at positions x and time t.u( x, t) represents the displacement vector.The local balance of linear momentum has to be fulfilled.Here, the weak form is employed, which is also known as the principle of virtual displacements: where δ u( x) are virtual displacements that are arbitrary but vanish at positions where the displacements u( x, t) are prescribed.Similarly, δE( x) = 1/2 grad δ u( x) + grad T δ u( x) represent virtual strains.Moreover, k symbolizes the acceleration of gravity.V and A denote the volume and surface of the material body and t are surface tractions.To develop the arising equations of the spatial discretization for three-dimensional continua on both macro-and microscale levels, a consistent matrix notation is followed.

Macroscale
Due to the spatial discretization, volume V and surface A transist into approximations Ω and Γ.Further, x ∈ Ω denote coordinates.Following a Galerkin-based finite element formulation, the ansatz for the displacements and virtual displacements read = N e (ϕ e (x))u e (t), (2) = N e (ϕ e (x))δu e .
(3) u j ∈ R 3 and δu j ∈ R 3 denote the macroscopic nodal displacement and virtual nodal displacement vector at node j.The shape functions are denoted by N j (x), while n nodes corresponds to the number of nodes on a macroscale level.In Equations ( 2) and (3), it is tacitly assumed that the displacements and virtual displacements are partitioned into unknown and prescribed quantities, i.e., u ∈ R n u are unknown macroscale nodal displacements and u ∈ R n p are known (or prescribed) nodal displacements.Analogously, the arbitrary virtual displacements are denoted by δu ∈ R n u .For the prescribed virtual displacements on the macroscale δu = 0, δu ∈ R n p holds by definition.Thus, the number of degrees of freedom is n a = n u + n p .Further, the transition to a formulation of the displacements on element level in Equations ( 2) and (3) yields the matrix of shape functions N e ∈ R 3×n e u within an element, the element nodal displacements u e ∈ R n e u , and corresponding virtual element nodal displacements δu e ∈ R n e u .Here, n e u is the number of element nodal degrees of freedom.ξ = ϕ e (x) = χ e −1 (x) are the local coordinates in the element domain with the coordinate transformation x = χ e (ξ).The assignment between global and element quantities can be formulated as u e (t) = Z e u(t) + Z e u(t), δu e = Z e δu.
Here, Z e ∈ R n e u ×n u and Z e ∈ R n e u ×n p are formally introduced incidence matrices (Boolean matrices) which are not programmed but used here for the explanation of the assembling procedure of all element contributions.They assign the global unknown and prescribed nodal displacements to element e. Regarding an explanation of an implementation of these matrices, we refer to [73].
Moreover, the resulting strains and virtual strains read = B e (ϕ e (x))u e (t) = B e (ϕ e (x))(Z e u(t) + Z e u(t)), (5) = B e (ϕ e (x))δu e = B e (ϕ e (x))Z e δu.(6) Again, a decomposition into known and unknown nodal displacements is employed.Since the strain tensor is symmetric, E = E T , the strains can be written in the Voigt notation, i.e., E h ∈ R 6 and δE h ∈ R 6 .B ∈ R 6×n u and B ∈ R 6×n p denote the global straindisplacement matrices on the macroscale level for unknown and prescribed degrees of freedom, respectively, whose mathematical representation is extremely difficult to specify.Thus, the element strain-displacement matrix B e ∈ R 6×n e u is chosen.Inserting Equations ( 3) and ( 6) into the principle of virtual displacements (1) and performing a decomposition of the discretized domain into elements yields the non-linear equations on macroscale g ∈ R n u .n e G is the number of integration points in an element on the macroscale.Further, w j denotes the weighting factors of the spatial integration technique, where here the Gaussintegration is drawn on.Accordingly, ξ j symbolizes the local coordinates of the integration points.Notation • e(j) is used to abbreviate quantities of element e at the macroscale integration point j, e.g., B e(j) := B e (ξ j ) for the strain-displacement matrix at integration point ξ j .Since the coordinates are transformed into a reference domain, J e(j) = ∂χ e /∂ξ ξ=ξ j is the Jacobian of the coordinate transformation.Moreover, p(t), p ∈ R n u represents the equivalent nodal force vector comprising the volume and surface distributed loads: In Equation (7), T e(j) ∈ R 6 are the stresses at a specific integration point j on the macroscale written in the Voigt notation.Usually, these quantities are obtained from the evaluation of particular constitutive models.In contrast, in FE 2 computations, the stresses are computed by a particular homogenization scheme of the microstructure, which is explained in the following section.The macroscopical strains at each integration point read with Equation (4) 1 E e(j) (t, u(t)) = B e(j) u e = B e(j) Z e u(t) + Z e u(t) .
Unfortunately, the principle of virtual displacements does not allow the computation of reaction forces.However, since the macroscale reaction forces are of interest in FE 2 computations, we choose here the Lagrange multiplier method; see [74] and the literature cited therein.Thus, the geometric constraint equation The prescribed displacements u(t) should be identical to the degrees of freedom û(t), û ∈ R n p , representing the degrees of freedom that are initially assumed to be unknown as well.To satisfy constraint Equation (10), the Lagrange multipliers λ ∈ R n p are required, which can be interpreted as the negative nodal forces.
The combining of Equations ( 7) and (10) provides the full system of equations for the discretized weak form of the balance of linear momentum on the macroscale, g a ∈ R n a , Remark 1.It is worth mentioning that the consideration of reaction forces with the Lagrange multiplier method is performed to obtain a consistent variational formulation.The principle of virtual displacements does not allow the computation of reaction forces since they provide no virtual work (remember that δu = 0 holds).Thus, another variational principle is required, which is here the Lagrange multiplier method.It is important to state that the Lagrange multipliers do not extend the number of unknowns in the application here, since they can be computed in a post-processing step after the computation of the nodal displacements; see Equation (11) 2 .Then, the Lagrange multipliers can be interpreted as nodal reaction forces, while considering that, of course, the accuracy of the results depends on the chosen termination criteria for the displacements.As a result, the application of the Lagrange multiplier method bypasses the evaluation of nodal equilibrium to compute the reaction forces at prescribed displacement degrees of freedom.The interested reader is referred to [74] for a detailed description of the method, and further references.

Microscale
The arising equations from the spatial discretization need to be studied also for the microstructure, which represents the discretized microscale geometry in FE 2 computations and is usually denoted as representative volume element (RVE).In contrast to common finite element simulations, where a constitutive model is evaluated at each integration point, here, the microstructure has to be evaluated.
The discretized weak form of the local equilibrium equation on the microscale can be derived analogously to the macroscale and reads ǧ e(j) a (t, u a (t), ǔ e(j) Some remarks should be made regarding the above equation.First, ǔ e(j) a are all displacements in the RVE at integration point j of macroscale element e. ňe(j) a denotes the number of displacement degrees of freedom on the microscale.It is assumed that all displacements are initially unknown in the RVE.ňe(j) e defines the number of elements on the microscale, ň ě G symbolizes the number of microscale integration points per element, and wǰ are the weights of the spatial integration.Matrix Ž ěT a ∈ R a denotes formally the assembling procedure of all element contributions and comprises matrices Ž ě ∈ R for the unknown and prescribed displacement degrees of freedom in the RVE, ňe(j) u and ňe(j) p , respectively, Ž ěT a = Ž ě Ž ě .The strain-displacement matrix on the microscale is defined by Bě( ǰ) ∈ R 6× ň ě u .Similarly to the macroscale, ňe(j) a = ňe(j) u + ňe(j) p holds.Moreover, at the microscale level, there are no volume or surface distributed loads, i.e., p = 0.
In this work, we restrict ourselves to periodic displacement boundary conditions on so-called conform spatial discretizations.We refer to [4] for a detailed description of boundary conditions on the microscale and to [75] regarding periodic displacement boundary conditions on non-conform discretizations.The underlying idea of periodic displacement boundary conditions is that the displacements of nodes, which are positioned on different parts of the surface of the RVE, are coupled.This coupling can be treated as a linear multiple point constraint problem.Thus, we introduce the primary and secondary displacements, ǔ e(j) M ∈ R ňe(j) M and ǔ e(j) Here, ňe(j) holds for the number of pair-wise coupled displacement degrees of freedom ňe(j) K and ňe(j) p = ňe(j) S for the prescribed degrees of freedom.Since the periodic displacements are applied on the surface of the RVE, the internal (within the volume of the RVE) displacement degrees of freedom are defined as ǔ e(j) V ∈ R ňe(j) V , where ňe(j) u = ňe(j) V + ňe(j) M holds.Accordingly, the decomposition of the nodal displacement degrees of freedom and the discretized local equilibrium Equation ( 12) on the microscale ǔ e(j) is obtained.
The connection between the macro-and the microscale is achieved with macroscale displacements u a (t) and microscale displacements ǔ e(j) a (t) by specifying constraint with Č e(j) a , and Ǎe(j . For the case of periodic displacement boundary conditions, matrices Ǎe( Ǎe(j) Ȟ e(j) Ȟ e(j) and Ǎe(j) where matrices Ȟ e(j) M and Ȟ e(j) S are link-topology matrices that only contain the numbers 0, +1, −1.P e(j) ∈ R 6× ňe(j) K is a matrix that comprises the differences in the corresponding nodal positions.Constraint ( 14) can be reformulated with M e(j) = Ȟ e(j)−1 Constraint ( 16) is, again, enforced with the microscale Lagrange multipliers λe(j) ∈ R ňe(j) K .
With decomposition ( 13), the microscale strain vector Ěě( ǰ) ∈ R 6 of microscale element ě and integration point ǰ in dependence of the macroscale strains (9) reads where we abbreviate ǔ e(j) = ǔ e(j) V ǔ e(j) ǔ e(j) ∈ R ňe(j) u denotes the assembled displacement vector, while Ž ě ∈ R In contrast to the macroscale, on the microscale, constitutive models are applied to describe the mechanical behavior of the materials.Since in this contribution, only elastic materials are studied, the constitutive model has the general form of i.e., the evaluation of algebraic equations yields the stresses Ťě( ǰ) ∈ R 6 at the microscale integration points, which are already contained in Equation (12).What remains is the question of how to determine the required macroscale stress T e(j) from the microscale evaluation.Here, a homogenization procedure is applied that fits into the general form of In the case of periodic displacement boundary conditions, the secondary displacements ǔ e(j) S can be expressed by constraint ( 16) Further, enforcing constraint ( 16) yields, on the RVE level, This allows computation of microscale Lagrange multipliers λe(j) : As a result, the homogenized stresses (20) on macroscale element e and integration point j read for periodic displacement boundary conditions: P e(j) Ȟ e(j)−T S ǧ e(j) S (t, u a , ǔ e(j) V , ǔ e(j) M ).
2.1.3.General System of Non-Linear Equations The entire system of equations of FE 2 computations with non-linear elastic material at the microscale level is obtained by formally assembling all independent variables of the RVEs: K , as well as equations Čc (u a , ǔa ) = The entire system of non-linear equations is obtained by compiling macroscale Equations ( 10) and (11) and microscale Equations ( 26) and (27).The number of equations can be essentially reduced by assuming that constraint (10) is fulfilled after solving the entire system of non-linear equations and by employing Equation (23) on the microscale for periodic displacement boundary conditions.Further, Ǎ e(j) 1 , as given in Equation ( 15) 1 , can be applied.Then, the reduced system of non-linear equations, which is the result of the spatial discretization, has to be solved at each load-step (time-step) with the vector of unknowns In the equations mentioned above, ǔV and ǔM are the vectors of assembled internal microscale nodal displacements ǔ e(j) V and primary nodal displacements ǔ e(j) M , respectively.
Remark 2. Further, an important aspect is the time discretization.Since only non-linear elastic material is studied here, Equation (28) represents a purely algebraic system of equations.Nevertheless, time integration methods, such as the Backward-Euler method, can be applied when formally extending Equation (28) with ṫ = 1 to obtain a system of differential-algebraic equations (DAE), as it is common in finite element computations, where the inelastic material behavior is described by evolution equations for some internal variables [76].As a result, the application of time integration methods to elastic problems leads to an incremental application of the prescribed loads, which is therefore achieved in the numerical examples of this work.In the case of non-linear elastic material behavior, the load is often applied step-wise, where the previous solution of the nodal displacements is inserted into some Newton-like scheme as starting vector to be close to the solution.Otherwise, problems in the convergence of the iterative scheme are observed.In this sense, the step-wise increase in the load (displacement-or force-controlled) can be interpreted as time integration.

Multilevel-Newton Algorithm
What remains is the question of how the system of non-linear Equation ( 28) is solved in multiscale simulations.Further, to sufficiently embed DNN surrogates, it is important to make clear which parts of the overall computation scheme can be substituted with minimal changes in a finite element program, as it is discussed later on.
There are different approaches to solve the system of non-linear Equation (28).First, the entire system of equations could be solved with the Newton-Raphson method, but this would only be possible for smaller problems due to the extremely large number of equations.An alternative-see [76]-would be to use the Newton-Schur complement, which, on the one hand, requires some intervention in the coding and provision of the derivatives [77].In traditional approaches, on the other hand, one uses the MLNA considering periodic boundary conditions for the microstructures.Therefore, in order to see what advantage neural networks have here, the Multilevel-Newton algorithm (MLNA) approach is briefly explained.
Since the MLNA, which was originally introduced by [78,79], is frequently applied, we refer especially to [76] regarding the differences between the MLNA and the wellknown Newton-Raphson scheme.Thus, only the required equations are recapped here to show the general algorithmic structure of FE 2 computations and the incorporation of DNN surrogate models.
If we interpret the incremental load-control as time integration, the non-linear system (28) has to be evaluated at time t n+1 , t n+1 = t n + ∆t n .Thus, in each time-step, the unknown microscale displacements ǔ = { ǔV , ǔM } T , see Equation (18) 1 , and the unknown macroscale displacements u are sought.For further treatment of the equations, we also introduce decomposition Thus, the system of non-linear equations has to be solved.

Multilevel-Newton Algorithm for FE 2 Computations
In the traditional manner, the MLNA is applied to solve Equation (31).The scheme draws on the implicit function theorem, i.e., it is assumed that function ǔ = û(u) exists.In other words, G(u, û(u)) = 0 (32) has to be solved.The Newton-Raphson method applied to the non-linear system (32) requires in each iteration step the computation of linear system ∂G ∂u Here, the iteration index is omitted for brevity.Quantities d û/du and ǔ = û(u) have to be provided by two additional computational steps, since û(u) is assumed to exist, but its representation is unknown.First, is evaluated for a given u, and second, the chain-rule is applied to The entire procedure is shown in Algorithm 1.
In greater detail and with the problem at hand, we proceed as follows.On a microscale, the system of non-linear equations has to be be solved for the case of periodic displacement boundary conditions, with Here, Ž ě still has representation (18) 2 , which then leads to the second and third equations of Equation (28).For purely elastic problems, the solution of Equation ( 36) leads to a linear system on global microscale level within the Newton-iteration step to solve Equation (34), which reads, in detail, as with the vector of unknowns y according to Equation (29) 1 .We apply another relationship between microscale displacements ǔ e(j) and the assembled microscale displacements ǔ, ǔ e(j) = Z e(j) ǔ ǔ.The system of linear Equation ( 38) can be re-written employing the chain rule and applying the decomposition into the macroscale integration point contributions, i.e., contributions of each RVE, ∂ ǔ e(j) ∆ ǔ e(j) + Ǧ e(j) (u, ǔ e(j) ) = 0.
In other words, the solution of system ( 43) is repeated until a local convergence criterion is fulfilled.Then, the microscale displacements ǔ are obtained.
As the next step, the macroscale consistent linearization (35) implies These matrices read under consideration of Equations ( 9), ( 17) and ( 42): d Êe(j) du where dE e(j) = − Ǩ e(j) Ȟ e(j)−1 S P e(j)T (48) in order to compute d ǔ e(j) /dE e(j) and, finally, d û/du.Therewith, the local macroscale computations consisting of the two steps, the global microscale level and the macroscale consistent linearization, are finalized.As a last step in the MLNA, the system of linear equations ∂G ∂u see Equation ( 33), has to be solved at a global macroscale level to compute the increment ∆u of the macroscale displacements.Here, G is the discretized weak formulation of the equilibrium equation at the macroscale; see Equation (7), w j B e(j)T he(j) (t n+1 , u, ǔ e(j) ) T e(j) det J e(j)    − p(t n+1 ) = 0, (50) in dependence of the homogenized stress state T e(j) from Equation (20).Analogously to Equation ( 41), we define the global tangential stiffness matrix = K ∈ R n u ×n u , where Equation ( 20) is already employed in Equation (52).Again, the derivatives can be re-formulated applying the chain rule and the microscale element stiffness matrix ǩě from Equation (42): Ȟ e(j)−1 d û e(j) with Ǩ e(j) ∈ R ňe(j) S × ňe(j) S .Inserting Equations ( 53) and ( 54) into Equation ( 52), the global tangential stiffness matrix reads Here, C e(j) ∈ R 6×6 denotes the consistent tangent matrix at integration point j of the macroscale having the representation P e(j) Ȟ e(j)−T S Ǩ e(j) − Ǩ e(j)T Ǩ e(j)−1 Ǩ e(j) Ȟ e(j)−1 S P e(j)T (56) with the matrices in Equations ( 47) and ( 53) and the application of Equation (48).

Newton Algorithm for FE 2 Computations with DNN Surrogate Models
Obviously, the computations of the non-linear system (34) and the linear system with several right-hand sides (35) within an iterative scheme are very time consuming.Thus, an alternative approach is of particular interest.The basic idea is that the recurrent stress and tangent calculation is learned by a deep neural network and, thus, an efficient evaluation can be achieved.Since we embed DNN surrogate models to accelerate the FE 2 computation, it is important to make clear which quantities are applied as input and as output and what parts of the aforementioned MLNA are replaced by the surrogate.
As mentioned in the introduction, many different architectures of neural networks exist and are regularly applied in the field of computational mechanics.In this work, we draw on common feedforward neural networks to replace the entire local macroscale computations in Algorithm 1.The particular details for the neural networks are given later on; hence, we focus here on the algorithmic structure.During the solution of the boundary value problem, macroscale strains E e(j) (t n+1 , u(t n+1 )) at integration point j of macroscale element e are provided in dependence of the macroscale displacements u by Equation (9).As explained beforehand, in FE 2 computations, strains E e(j) serve as an input to compute the displacement boundary conditions (16) under the assumption of periodic displacement degrees of freedom.Thus, strains E e(j) are input quantities for the DNN surrogate models to predict the homogenized stresses T e(j) and the consistent tangent matrix C e(j) with the two surrogate models T and C, T e(j) ≈ T (E e(j) (u); θ T ) and C e(j) ≈ C(E e(j) (u); θ C ), (57) where θ T and θ C are the parameters concerned of the surrogate models.Here, the notation of the surrogate models T and C indicates that the models are evaluated for the strains E e(j) while parameters θ T and θ C are assumed to be given after sufficient training of the neural network.At first, we introduce two different surrogate models for the stress and consistent tangent prediction.Later on, different realizations of the surrogate models are discussed as well.
The predicted stresses T e(j) are employed to evaluate the local equilibrium Equation ( 50), here, of course, without being dependent on ǔ, Again, we omit the iteration indices and the load-step index n + 1 for brevity.The predicted consistent tangent matrices C e(j) are assembled into the global stiffness matrix K according to Equation ( 55), As a result, when following the DNN-FE 2 approach for multiscale FE 2 computations, only the solution of the linear system of equations is necessary on a global macroscale level in each iteration.The entire Newton algorithm for FE 2 computations with DNN surrogate models and non-linear elastic material on a microscale is provided in Algorithm 2.
Algorithm 2: Newton algorithm for FE 2 computations following the DNN-FE 2 approach.
Given: starting vector estimation u (0) ; surrogate parameters θ T and θ C Repeat α = 0, . . .local (macroscale) level; given: u (α) evaluate DNN surrogates for macroscale integration point j of element e T e(j) ≈ T (E e(j) (u (α) ); θ T ) C e(j) ≈ C(E e(j) (u (α) ); θ C ) global (macroscale) level solve linear system of equations It should be emphasized that the explained algorithm for DNN-FE 2 simulations in Algorithm 2 only holds for elastic problems.The mapping between macroscale strains and homogenized stress and consistent tangent changes essentially when applying viscous or path-dependent materials, such as plasticity or viscoplasticity, which is not discussed here.

Deep Neural Networks
In this section, a brief introduction is provided to deep neural networks and stateof-the-art frameworks for the implementation of these learning methodologies.First, a fully connected deep neural network-also known as a multilayer perceptron (MLP)-is considered.An MLP consists of a consecutive repetition of so-called layers.Each layer contains a set of nodes, so-called neurons, which are densely connected to the nodes of the preceding and succeeding layers.A deep neural network (DNN) is a neural network with multiple layers between the input and output layers which are the so-called hidden layers.Data sample x in space χ ⊂ R n and the corresponding target output y in space ψ ⊂ R m are considered.Then, the objective of a deep neural network is to learn the mapping, F : χ → ψ, from the data by minimizing a scalar-valued loss function L(F (x; θ), y) for all the samples in the training data set, where θ ∈ R n θ represents the trainable parameters of the network.To this end, the data are processed through each layer i as where are the input and output of the ith layer with the number of neurons p (i) in the previous layer and q (i) neurons in the current layer.Further, represents a weighting matrix and b (i) ∈ R q (i) is the bias vector.ϕ (i) : R q (i) → R q (i) symbolizes the element-wise applied activation function in layer i. Parameters θ of the network are determined by applying a gradient-descent optimization technique for minimizing the loss function on the training data set.The updates of the parameters are obtained as ∆θ = η ∂L/∂θ where η denotes the learning rate.The gradient of the loss function with respect to the trainable parameters can be obtained using automatic differentiation (AD) [80].All the neural networks discussed in this study were developed applying machine learning software frameworks developed by Google Research called TensorFlow [81] and JAX [82].
Automatic differentiation (AD), also known as algorithmic differentiation or "auto-diff" (automatic differentiation), is a family of methods for evaluating the derivatives of numeric functions expressed as computer programs efficiently and accurately through the accumulation of values during code execution.AD has an extensive application in machine learning and also well-established use cases in computational fluid dynamics [83], atmospheric sciences [84], and engineering design optimization [85].In the field of computational solid mechanics, see [86] and the literature cited therein.The idea behind AD is to break down the function into its elementary operations and compute the derivative of each operation using symbolic rules of differentiation.This means that instead of relying on numerical approximations or finite differences to compute the derivative, AD can provide exact derivatives with machine precision.To do this, AD keeps track of the derivative values at each stage of the computation applying a technique called forward or reverse mode differentiation.This allows AD computing the derivative of the overall composition of the function by combining the derivatives of the constituent operations through the chain rule.The benefit of AD is that it can be applied to a wide range of computer programs, allowing for the efficient and accurate computation of derivatives.This makes it a powerful tool for scientific computing, optimization, and machine learning, where derivatives are needed for tasks such as gradient descent, optimization, and training of neural networks.AD techniques include forward and reverse accumulation modes.Forward-mode AD is efficient for functions f : R → R m , while for cases f : R n → R m where n m, AD in its reverse accumulation mode is preferred [80].For state-of-the-art deep learning models, n can be as large as millions or billions.In this research work, we utilized reverse-mode AD for the training of the neural networks and also for obtaining the Jacobian of the outputs with respect to the inputs.This should be demonstrated for both applied frameworks, TensorFlow and JAX.If one considers a batch of input vectors x and the corresponding outputs y, then the Jacobian matrix J can be easily computed in batch mode using AD via TensorFlow and JAX according to Algorithm 3.

Algorithm 3:
Computing the Jacobian matrix J of function f via reverse mode AD in TensorFlow and JAX frameworks for a batch of samples x.

Deep Neural Networks as Surrogate Models for Local RVE Computations
In the MLNA described in Section 2.2.1, the computations on local macroscale level are very expensive to perform.Thus, the objective is to develop a data-driven surrogate model for substituting the local macroscale computations with deep neural networks.To this end, the macroscale strains E e(j) (t n+1 ) at each integration point ξ j and time (load-step) t n+1 are taken as the input and macroscale stresses T e(j) (t n+1 ) and the consistent tangent matrix C e(j) (t n+1 ) are provided as the output of the surrogate model.In the following, the FE 2 analysis is performed in a quasi-static setting with the restriction to small strains.For the sake of simplicity of the notation and for a two-dimensional set-up, we refer to the input of the surrogate model as E = {E 11 , E 22 , E 12 } T , E ∈ R 3 , and to the outputs as Here, it should be mentioned that we do not employ the symmetry of the consistent tangent matrix due to the application of AD, where we compute the Jacobian matrix of the neural network containing the partial derivatives of each element of T with respect to each element of the input E. Thus, we apply a soft symmetry constraint to the Jacobian matrix of the neural network by the data.
The inputs of the surrogate model are computed using an MPI (message passing interface) parallelized FORTRAN code.We employ FORPy [87], a library for FORTRAN-Python interoperability, to perform the data communications between FORTRAN and Python codes in an efficient and parallel manner.In particular, we load the required Python libraries and the DNN models only once and conduct the RVE computations in parallel, which leads to a considerable speed-up.The obtained outputs from the RVE surrogate model are passed to the FORTRAN code for further computations.

Training and Validation Datasets
Since the FE 2 framework in Section 2 is derived for the case of small strains, we consider a domain of application for our surrogate model with the upper and lower bounds of E i,min = −0.04 and E i,max = 0.04, respectively, where E i represents the ith component of the strain input E. A dataset is generated by imposing different strain inputs to an RVE and computing the corresponding stress components and consistent tangent matrices.We utilized Latin hypercube sampling (LHS) [88] to efficiently sample from the input space.To generate the data, we consider two global symmetries in the input space: It is important to mention that the assumed symmetries can be employed as long as the materials in the RVE show no tension-compression asymmetry, anisotropy, or rate-or path-dependent behavior, i.e., the reduction in the input space is not applicable for more complex behavior such as plasticity or anisotropic behavior.Thus, the data are generated using the numerical solver for one quarter of the input space according to the region marked by blue in Figure 1.62) and ( 63), are marked using pink and green colors, respectively.Blue dots show a subset of the sampled data points using LHS.
The dataset is augmented by transforming the generated data through the aforementioned global symmetries ( 62) and ( 63).This leads to a reduction in the computation time required for the preparation of the training data.After generating the dataset, it is decomposed into 80% for training and 20% for validation.Later on, in Section 4.2, the effect of the size of the training dataset on the accuracy of the final solution obtained from the DNN-based FE 2 simulation is investigated.It should be noted that the DNN models are tested by conducting DNN-FE 2 simulations and comparing the obtained solutions with those of the reference FE 2 simulations.

Architecture and Training Process
As it was mentioned in Sections 2.2 and 3.1, the surrogate model for the local RVE takes E ∈ R 3 as the input and predicts T ∈ R 3 and C ∈ R 9 as the outputs.The obtained stress components and the consistent tangent matrix are assembled at the global finite element level for computation of the next global iteration in the Newton-Raphson method.In this work, two model architectures for our DNN-based surrogate models are considered, both are developed based on MLPs.In the first architecture, two separate neural networks T and C are implemented that map E to T and C, where θ T and θ C represent the trainable parameters of deep neural network, T and C, respectively.The notation in use indicates that the deep neural networks are evaluated for strain inputs E with the given parameters after training the neural network.Throughout the article, this architecture is denoted as NN-2.However, this architecture does not consider that the consistent tangent matrix is the functional matrix of the stress components with respect to the strains, where T i and E j are the corresponding entries in T and E, respectively.Thus, this is taken into account in the second architecture by computing C as the output of the Jacobian function T as where T is obtained by applying reverse mode AD on the deep neural network surrogate T , which is parameterized with trainable parameters θ T .This architecture is denoted as NN-AD.Moreover, this approach is known as the so-called Sobolev training [61] in which both the target and its derivative with respect to the input are considered for supervised learning.Particular explanations regarding the application of Sobolev training in multiscale simulations are provided in [63], while the method is also employed in [60].By optimizing the parameters of neural networks to approximate not only the function's outputs but also the function's derivatives, the model can encode additional information about the target function within its parameters.Therefore, the quality of the predictions, the data efficiency, and generalization capabilities of the learned neural network can be improved.
In the following, we provide a detailed discussion on the data pre-processing, model training, model selection, and hyperparameter tuning.

Data Pre-Processing
We perform a standardization step on both input and outputs of the model to obtain efficient training of the networks using the statistics of the training dataset.A training dataset D train = [ E, T, C ] train is considered with its mean and standard deviation over the samples as vectors µ and σ, respectively.For training of the NN-2 model, the input and the outputs are standardized independently with their means and standard deviations, where V i represents the ith component of E or T with the mean and standard deviation µ i and σ i , respectively.In contrast, for the NN-AD model, the consistent tangent matrix C should be scaled consistently with the scaling of E and T so that their relationship is preserved.Therefore, scaling ( 67) is performed for the NN-AD model for the strains and stresses, while the components of C are scaled as

Training
In the following, a detailed discussion of the training process of all DNNs implemented in this research work is provided.We utilize an extended version of the stochastic gradient descent algorithm, known as Adam [89], for optimizing the parameters of the network during the training process.The weights and biases of the DNN are initialized using the Glorot uniform algorithm [90] and zero initialization, respectively.All the neural networks are trained for 4000 epochs with an exponential decay of learning rate of η = η initial γ (current step/decay step) , (69) where η represents the learning rate.η initial = 10 −3 is the initial learning rate and γ = 0.1 is the decay rate.Here, a decay step of 1000 is employed.The decay of the learning rate is applied according to Equation ( 69) every 1000 epochs to obtain a staircase behavior.
For different sizes of the training dataset, the batch size is set such that 100 batches are obtained in order to have the same number of training updates for different sizes of the training dataset.The mean squared error (MSE) is utilized as the loss function.For the NN-2 model, the loss for a sample can be obtained as where is the corresponding prediction.The loss for a data sample for the NN-AD architecture is computed as where α and β are the weighting coefficients for the two components of the loss.In all the cases, the loss for a batch of data is calculated by taking the average of the per-sample losses in the batch.

Model Selection
During the training process of each model, we track the validation loss and save the parameters of the model which lead to the lowest validation loss as the best model parameters.This helps to avoid overfitting of our deep neural networks.As mentioned earlier, the data are decomposed randomly into 80% for training and 20% for validation.

Hyperparameter Tuning
Hyperparameters in machine learning are the parameters that are defined by the user.Their values are set before starting the learning process of the model, such as number of neurons and hidden layers.The values of the hyperparameters remain unchanged during the training process and the following prediction.In machine learning applications, it is important to set the hyperparameters of a model such that the best performance is obtained regarding both prediction and generalization.Here, we perform hyperparameter tuning using a simple grid search algorithm to optimize the model performance on the validation dataset.For this experiment, a dataset with the size of N D = 10 5 is selected.We investigate three hyperparameters, i.e., the number of hidden layers N h , the number of neurons per each hidden layer N n , and the activation function ϕ, and carry out the hyperparameter tuning for the NN-2 architecture.Further, the same hyperparameters are employed for the NN-AD architecture for the sake of comparability.Moreover, for the NN-AD model, the weighting coefficients of the two components of the loss, α and β, are studied as well.
Results of the hyperparameter tuning for the number of hidden layers N h , the number of neurons per each hidden layer N n , and the activation function ϕ are reported in Appendix A. We observe that a model with eight hidden layers, 128 neurons per each hidden layer, and a swish activation function leads to L val T = 3.52 × 10 −8 and L val C = 2.84 × 10 −7 .Moreover, our results show that increasing the model complexity to more than the aforementioned values would not lead to a significant gain in the model accuracy.Thus, we select these model hyperparameters for further analysis.
Other hyperparameters for the NN-AD model are the weighting coefficients α and β of the components of the loss, i.e., L T and L T .Here, the effect of the weighting on the obtained validation losses is investigated.The results are reported in Table 1.We choose α = 1 for all the cases and change β from 0.01 to 100.It can be observed that having a small β may lead to an imbalanced training where a difference of almost two orders of magnitude between validation losses L T and L T exists.However, a β of one or larger results in a more balanced training leading to validation losses with nearly the same scale.According to the results of this study, we select a model with α = 1 and β = 100 for further analysis.

Numerical Experiments
In this section, the DNN-FE 2 approach for the simulation of two canonical test cases in computational solid mechanics is investigated, i.e., an L-profile and Cook's membrane, and compare the results with those of FE 2 reference simulation.The numerical experiments are performed using both architectures, NN-2 as well as NN-AD, to provide a detailed discussion regarding the accuracy and efficiency of the simulations.The results are reported for the accuracy of the simulations, required time for model development (e.g., computational time needed for training), time of numerical simulation, number of load steps, and the total number of global iterations required for reaching the convergence of the FE 2 simulation.To this end, the absolute percentage error is utilized, where V indicates any component of stress or strain tensors T and E, respectively.To avoid the division by zero, the absolute mean of the reference solution on the global grid is employed as the denominator, where • shows the ensemble average.All DNN models are trained on an NVIDIA RTX A2000 Laptop GPU with CUDA 12.1.The DNN-FE 2 simulations are performed on an 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz CPU with 16 threads.In contrast, the FE 2 reference simulations are conducted on a second-Gen Intel(R) Xeon(R) Silver 4216 @ 2.10GHz CPU with 16 processes and one thread per process.The speed-up gain is calculated by dividing the total time of computation required by the FE 2 reference simulation by that of the DNN-FE 2 simulation.

Problem Setup
For the numerical experiments, we restrict ourselves to two-dimensional test cases, where a plane strain case is always assumed.The representative volume element (RVE) under consideration is chosen as a commonly applied geometry in the mechanical analysis of composite materials; see Figure 2. Here, the fiber volume fraction of the applied RVE is approximately 55%.The fibers are assumed to behave linearly, in an isotropic elastic manner with bulk modulus K f and shear modulus G f ; see Table 2.
Table 2. Material parameters for elastic fiber and non-linear elastic matrix material in the RVE.
In contrast, the matrix material is modeled with a non-linear elastic material behavior, which is extracted from an originally viscoplastic constitutive model where the shear modulus is deformation-depending; see [91].The particular stress-strain relation reads The material parameters for the non-linear elastic material are the bulk modulus K m and the parameters α 1 and α 2 ; see Table 2 The spatial discretization is achieved with eight-noded quadrilateral elements.As a result, n e G = 9 integration points are present in each macroscale element, which means that n e n e G calls of the RVE in Figure 2 are necessary in each global Newton iteration of a FE 2 computation.The L-profile is spatially discretized with n e = 200 elements and n nodes = 709 nodes.For the Cook's membrane, n e = 600 elements and n nodes = 1901 nodes are used.The L-profile has a prescribed displacement boundary condition on the top right edge with ū2 (t) = −3 mm s −1 t.In contrast, the Cook's membrane is fixed on the left edge and has an applied displacement boundary condition ū2 (t) = 2 mm s −1 t on the right edge.
The initial time-step size of both numerical examples is ∆t 0 = 1 × 10 −3 s, whereas the simulation is performed for t ∈ [0, 1].The time discretization is achieved with the Backward-Euler method; see also Remark 1 regarding the time discretization for purely elastic problems.Here, the time-step size ∆t is not fixed but chosen based on the number of Newton iterations N iter and the time-step size of the current step ∆t n , In this work, the quantities f max = 1.2 and f min = 0.3 are chosen.The termination criteria of the global Newton iteration are applied as It should be mentioned that the applied tolerance values are rarely reported in current literature to DNN-FE coupling in multiscale applications, which makes it difficult to draw comparisons.In this work, tolerances tol u = 1 × 10 −6 and tol G = 1 × 10 −3 are chosen.

Investigation on the Size of Dataset
Next, the effect of increasing the size of the dataset on the performance of the DNN models as well as the efficiency and accuracy of the DNN-FE 2 simulations are assessed.Different sizes of datasets, i.e., N D ∈ {10 3 , 10 4 , 10 5 , 10 6 , 4 × 10 6 }, are considered, while each dataset is generated according to the explanations in Section 3.2.Note that in all the cases, 80% of the samples in the dataset are used for training and 20% for validation.Results are reported for both NN-2 and NN-AD architectures for having a comprehensive comparison.It should also be noted that the efficient size of the dataset depends on the complexity of the model and the mapping that must be learned.Here, results are reported for NN architectures containing eight hidden layers with 128 neurons per each hidden layer and swish as the activation function, which are selected based on the results of hyperparameter tuning; see Section 3.3.4. Figure 4 illustrates the lowest L val T obtained during the training of the DNNs using different sizes of the dataset.It can be seen that increasing the size of the dataset from 10 3 to 10 5 leads to a significant reduction in L val T .However, improvements in the performance of the model are not significant when further increasing N D .We also report the required time of training in Figure 4.The training time t train is normalized by the computational time t FE 2 comp,Cook needed for FE 2 simulation of the Cook's membrane, to offer an insight into the cost of developing an NN-based surrogate model for the RVE in comparison with the FE 2 reference simulation.As it is expected, increasing N D results in an increase in the required training time.It can be observed that even for the largest dataset (N D = 4 × 10 6 ), the training time is much shorter than the computational time of the FE 2 simulation.For the dataset with N D of 10 5 , only 1.39% of the computational time of the FE 2 simulation is needed for the training of the NN-AD model.

Numerical Simulations
The numerical results obtained from DNN-FE 2 simulations are reported in this section for the L-profile and Cook's membrane test cases and compared with the FE 2 reference simulations regarding accuracy and efficiency.The accuracy of the simulations is estimated by computing the mean and standard deviation of the absolute percentage error, defined in Equation ( 74), mean and std , respectively, over all the components of E and T and over all the global grid points.The speed-up gain is computed by dividing the total time of computation for the FE 2 reference simulation by that of the DNN-FE 2 simulation.Results are reported for different sizes of the dataset N D and for both architectures NN-2 as well as NN-AD architectures.We refer to the models trained on datasets with different sizes as model- * where * × 10 3 shows the size of the dataset.

L-Profile
Results for the simulation of the L-profile test case are summarized in Table 3.It is evident that both DNN surrogate models, NN-2-1 and NN-AD-1, which were trained on a dataset with only 1000 samples, are accurate enough for achieving the convergence of the FE 2 simulation.This is also the case for the NN-2-10 model leading to mean and std of 3.20% and 4.87%, respectively.Employing the NN-AD-10 model as the DNN surrogate model leads to the convergence of the simulation and provides very accurate results with mean and std of 0.42% and 0.69%, respectively.All the DNN-based models trained on larger datasets lead to the convergence of the simulation.Our results show the superior performance of the NN-AD models in comparison with the NN-2 models in all the numerical experiments.For instance, the NN-2-100 model results in mean and std of 2.48% and 3.68%, respectively, while the NN-AD-100 model provides more accurate results with mean and std of 0.21% and 0.40%, respectively.Moreover, we observe that the NN-AD architecture is more efficient regarding the required size of the dataset where the NN-AD-10 model performs better than the NN-2-4000 model.
Apart from accuracy aspects, the speed-up gain obtained from the DNN-FE 2 simulation is of particular interest.In Table 3, it can be observed that a speed-up of 400× can be obtained from the NN-AD-10 model.This huge speed-up gain shows the excellent potential of the DNN-FE 2 approach for fast and accurate multiscale simulations of solid materials.Our results show that the NN-AD-4000 model leads to the best performance regarding the accuracy, speed-up, and the required number of iterations.In general, we can observe that the NN-AD architecture is more efficient than the NN-2 architecture regarding the speed-up gain where, for instance, the NN-AD-100 model obtains a speed-up of 462× against 254× of the NN-2-100 model.The models of both architectures require a quite similar number of time-steps N t , which are here load-steps.The lesser number of load-steps for the NN-AD architecture results from the fewer number of Newton iterations, which leads to slightly higher load-step sizes according to Equation (76).Moreover, it is evident that the speed-up of the NN-AD architecture is higher than for the NN-2 architecture.This is caused, on the one hand, by the lesser number of global Newton iterations N iter because of the higher prediction accuracy of the consistent tangent matrix.On the other hand, in our implementation, the NN-AD model consisting of one feedforward neural network and the backpropagation step for AD is faster to evaluate than the NN-2 model, which comprises two different feedforward neural networks.
Figures 5 and 6 depict the results obtained from the DNN-FE 2 simulation using the NN-AD-100 model as the DNN surrogate model in comparison with that of the FE 2 reference simulation for all the components of E and T, respectively.The reference solution is illustrated on the left panel, the DNN-FE 2 solution is in the middle, and the absolute percentage error is on the right.It can be observed in Figure 5 that for the normal components of the strain tensor, E 11 and E 22 , a maximum absolute percentage error of 1.03% is achieved which shows excellent performance of our DNN-FE 2 approach.For the shear strain E 12 , the error is slightly higher where a maximum absolute percentage error of 4.38% is obtained.The same conclusion can be drawn from Figure 6, where for the normal components of the stress tensor T 11 and T 22 , the maximum percentage errors are 2.12% and 1.02%, respectively, while for the shear stress, T 12 , the maximum percentage error is slightly higher and is equal to 4.26%.
Moreover, Figure 7 shows the distribution of the absolute percentage error in the solution for the components of T (up) and E (bottom) obtained from the DNN-FE 2 simulation using the NN-AD-100 model on all the integration points of the L-profile test case.The green area and the marked percentage indicate the samples with less than 1% of error and their population proportion.We observe that excellent simulation results are obtained where an absolute percentage error of less than 1% is acquired for most of the samples.For instance, it can be observed that 94.06% of the samples have an error of less than 1% in the solution for the shear stress T 12 .

Cook's Membrane
Furthermore, we apply the DNN-FE 2 approach for the simulation of Cook's membrane test case and compare the obtained solution with that of the reference FE 2 .Table 4 summarizes the results for models based on the NN-2 and NN-AD architectures trained on datasets with different sizes.The NN-2-1 and NN-AD-1 models provide a converged solution with less than 1% of error.This is also the case for the NN-2-10 and NN-2-100.Utilizing NN-AD-10 and NN-AD-100 models leads to the convergence of the simulation with excellent accuracy.We obtain mean and std of 0.02% using the NN-AD-100 model, which shows the excellent capability of the NN-AD architecture for surrogate modeling of the local macroscale computations.The results show that the NN-AD architecture outperforms the NN-2 architecture in all the tests where, similar to the results obtained for the L-profile, the errors mean and std are almost an order of magnitude lower.
The results are also reported regarding the computational efficiency of the proposed framework in Table 4 for the second numerical experiment of Cook's membrane; see Figure 3B.It can be observed that the NN-AD-10 model obtains a speed-up gain of 542× for this example.The results show that increasing the size of the dataset leads to a more efficient simulation with a lesser number of iterations and lower computational time.The NN-AD-4000 model provides a speed-up of 611× and leads to the convergence of the simulation in 30 time-steps.Compared to the L-profile test case, higher speed-up gain is achieved for the Cook's membrane.This is due to the fact that the number of elements, which require microscale computations, is three times higher than that of the L-profile.This suggests that utilizing the DNN-FE 2 approach for more expensive computations, e.g., three-dimensional problems, could even lead to a higher speed-up gain.
Figures 8 and 9 show the results obtained from the DNN-FE 2 simulation employing the NN-AD-100 model as the DNN surrogate model in comparison with that of the FE 2 reference simulation for all the components of E and T, respectively.It can be observed that for all the components of strain and stress tensors, very accurate results can be obtained.For normal strains E 11 and E 22 and the shear strain E 12 , the maximum absolute percentage errors are 0.14%, 0.16%, and 0.18%, respectively.Moreover, for the stress components T 11 , T 22 , and T 12 the maximum errors are equal to 0.23%, 0.38%, and 0.10%.We also report the distribution of the absolute percentage errors over all the integration points for Cook's membrane test case in Figure 10.
For this case, the green area and the marked percentage indicate the samples with less than 0.1% of error and their population proportion.It can be seen that for most of the sample points, an error of less than 0.1% has been obtained.Our results show the excellent capability of the NN-AD models for very accurate and efficient FE 2 simulations.

Load-Step Size Behavior
Since FE 2 computations usually require small initial time-step sizes ∆t 0 , the application of certain time-step control schemes is reasonable.In general, the determination of the time-step size ∆t new for the following time-step depending on the current time-step size ∆t n can be achieved via different methods.In this contribution, we consider the number of (global) Newton iterations N iter for the load step-size control; see Equation (76).
The step-size behavior, when using a step-size control based on global Newton iterations, is shown for our two numerical experiments in Figure 11.For the L-profile, the overall step-size behavior is quite similar for both the FE 2 reference simulation and with embedding the DNN surrogate model NN-AD-100; see Figure 11A.However, it is evident that the initial time-step size, which is chosen as ∆t 0 = 10 −3 s, is suitable for the surrogate model, but not for the FE 2 reference simulation as the time-step size is rejected multiple times and convergence is initially reached at ∆t = 9 × 10 −5 s.The rejections of the initial time-step size, which represents here the initially applied load-step, result from divergence at the global macroscale level.
A similar behavior is observed for the Cook's membrane and shown in Figure 11B.The first accepted time-step size is ∆t = 3 × 10 −4 s, whereas the DNN surrogate already shows convergence at ∆t 0 = 10 −3 s.In contrast to the L-profile, the step-size behavior shows significant differences.For the reference FE 2 simulation of the Cook's membrane, a certain limit exists, where the step size decreases because of failures in the local-level computations of the MLNA (RVE computations), i.e., the applied load leads to certain limitations in the step-size behavior of the RVEs.However, the DNN-FE 2 computation with the embedded DNN surrogate model is successfully converging even for higher step sizes.The different step-size behaviors for L-profile and Cook's membrane are obtained due to different magnitudes of the strains at each macroscale integration point that result from the loading conditions in Figure 3; see the results in Figures 5 and 8 as well.
As a result, the application of DNN surrogate models is not only possible for load-step size controlled FE 2 computations, but it also leads to certain advantages.On the one hand, higher initial load-step sizes are possible and, on the other hand, certain limitations in the load-step size can be overcome and thus larger step sizes can be applied compared to classical FE 2 computations.

Speed-Up with JAX and Just-in-Time Compilation
JAX [82] is a Python library developed by Google Research for high-performance numerical computing.It utilizes an updated version of Autograd [92] for automatic differentiation of native Python and NumPy functions.JAX supports reverse-mode differentiation as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.Moreover, JAX uses XLA (accelerated linear algebra) [93] to compile and run NumPy programs on GPUs and TPUs (tensor processing units), which is performed by just-in-time (JIT) compilation and execution of the calls.JAX also allows just-in-time compilation of user-defined Python functions into XLA-optimized kernels using a one-function application programming interface, jit.Compilation and automatic differentiation can be composed arbitrarily, so one can express sophisticated algorithms and obtain maximal performance without leaving Python.These properties allow the implementation of our NN-AD architecture for RVE surrogate modeling efficiently using JAX.In the following, we discuss just-in-time compilation and its application in our DNN-FE 2 simulation framework in combination with FORPy [87].

Just-in-Time Compilation
Just-in-time (JIT) compilation is a technique used in modern programming languages to improve the performance of code execution at runtime.With JIT compilation, the code is compiled from a high-level language into machine code at the moment it is needed, rather than ahead of time.This allows a more efficient use of resources and can lead to significant performance improvements, especially for applications that require repeated execution of the same code.JIT compilers work by analyzing the code being executed and dynamically generating optimized machine code that is tailored to the specific execution context.In particular, when a program is executed, the JIT compiler analyzes the code being executed and identifies hot spots or sections of code that are frequently executed.These sections of code are then compiled into machine code and stored in memory for future use.The next time the same section of code is executed, the JIT compiler can use the pre-compiled machine code instead of interpreting the code again.This leads to significant performance improvements, as the program spends less time interpreting code and more time executing the machine code.
The JIT compiler in JAX is based on XLA, a domain-specific compiler that optimizes numerical computations for modern hardware architectures.With JAX, users can write a Python code that looks like a NumPy code but runs much faster on specialized hardware.This makes JAX an ideal library for scientific computing, machine learning, and other high-performance computing tasks.In addition to JIT compilation, JAX also provides tools for distributed computing and parallelization, making it a versatile library for a wide range of applications.

Speed-Up with JAX and JIT
JIT compilation is of interest in the DNN-FE 2 approach since a repeated execution of the surrogate model in every iteration and for every integration point occurs.Thus, the JIT compilation of the prediction function, which is called from the FORTRAN finite element code through FORPy, allows for more efficient use of resources and can lead to significant performance improvements.To this end, we developed the NN-AD-100 model using JAX and a neural network library and ecosystem for JAX called Flax [94].Further, the prediction function is compiled using jax.jittransformation.
The results are reported in Table 5, where we compare the computational efficiency of our TensorFlow and JAX implementations.It should be noted that we utilize the same set of hyperparameters and similar training processes for both implementations.It can be seen that the required time of training is shorter for the JAX implementation, and it is equal to 9.49 × 10 −3 of the computational time required for FE 2 simulation of the Cook's membrane.Moreover, we gain a significant speed-up from the JAX implementation in comparison with the TensorFlow implementation.The speed-up gain for the L-profile and Cook's membrane test cases are, respectively, equal to 4629× and 5853× for JAX and 462× and 554× for TensorFlow.To the best of the authors' knowledge, our JAX implementation provides the highest speed-up in the context of DNN-FE 2 simulations for non-linear elastic material behavior in the literature.

Conclusions
In the present work, a DNN-FE 2 approach is explained in detail to significantly accelerate multiscale FE 2 simulations.In general, the algorithmic structure of FE 2 computations is a Multilevel-Newton algorithm, even for the case of purely elastic material behavior without internal variables.The main source of computational costs are the local macroscale computations, which include the numerous computations of representative volume elements.Thus, in the DNN-FE 2 approach, we replace the local macroscale computations by drawing on a deep neural network surrogate model, which is very fast to evaluate after sufficient training.Here, it turns out that using automatic differentiation and Sobolev training to obtain the consistent tangent information is superior to an approach with two deep neural networks for the prediction of stresses and consistent tangent regarding data efficiency and prediction accuracy.Moreover, in step-size-controlled computations, the deep neural network surrogates are able to overcome certain step-size limitations of the FE 2 reference computations.For the Cook's membrane as a particular example in this contribution, we achieve a speed-up factor of more than 5000 compared to a FE 2 reference simulation when using just-in-time compilation techniques together with an efficient coupling between different programming codes using the FORPy library.The main advantage of the explained DNN-FE 2 approach is that it can be easily implemented to the existing finite element codes since just the evaluation of a surrogate model for each macroscale integration point has to be considered.

MPI Message passing interface LHS
Latin hypercube sampling XLA Accelerated linear algebra JIT Just-in-time

Appendix A. Hyperparameter Tuning
We conduct hyperparameter tuning using a grid search algorithm to optimize the performance of the model on the validation dataset.For this purpose, we employ a dataset with a size of N D = 10 5 .Our attention is directed towards three specific hyperparameters: the number of hidden layers N h , the number of neurons per hidden layer N n , and the choice of the activation function ϕ.We apply hyperparameter tuning to the NN-2 architecture.Results obtained from different sizes of the neural network (number of hidden layers N h × number of neurons per hidden layer N n ) with swish activation function are reported in Table A1.The training and the validation losses are reported for both mappings, T and C, which are selected during the training process as the best models based on the lowest validation loss.Also, the relative time of training t rel,train is outlined.It is evident that the lowest validation loss L val T = 4.96 × 10 −8 is obtained from a deep neural network with eight hidden layers and 128 neurons per hidden layer, where the corresponding validation loss L val C = 3.10 × 10 −7 is obtained.The results show that networks with a larger number of parameters lead to losses with the same order of magnitude and do not show a considerable improvement while requiring more time for training.Therefore, we select N h = 8 and N n = 128 for our analysis, see Table A1.
Moreover, Figure A1 illustrates the influence of the choice of activation function on the learning curves for both mappings, T and C. Results are reported for deep neural networks with N h = 8 and N n = 128.Similar influence in all the other cases with different sizes of the neural network can be observed.Further, it can be seen that the swish activation function performs better than sigmoid and tanh.It should be noted that rectified linear unit (ReLU) activation function is not used since the mapping requires to be continuously differentiable, especially for the NN-AD architecture.

Figure 1 .
Figure 1.Domain of application for the surrogate model.The two global symmetries in the input space, Equations (62) and (63), are marked using pink and green colors, respectively.Blue dots show a subset of the sampled data points using LHS.
T and L C indicate the loss for the mapping T and C, respectively.Here, T ref i denotes the reference stress value and T pred i is the prediction of the neural network.Accordingly, C ref i is the reference value in the consistent tangent and C pred i

1 1Figure 2 .
Figure 2. Geometry of the RVE (dimensions in mm) used as microstructure in the numerical experiments with fibers (grey) and matrix material (blue).

Figure 3 .
. The spatial discretization of the RVE is achieved with ňe(j) e = 3456 eight-noded quadrilateral elements and ňe(j) nodes = 10, 561 nodes.The application of DNN surrogate models in multiscale simulations is studied for two macroscale test cases-L-profile and Cook's membrane; see Figure 3. Spatial discretization and boundary conditions for macroscale test cases.(A) L-profile; (B) Cook's membrane.

Figure 4 .
Figure 4. Influence of the size of the dataset N D on training time t rel,train (78) and validation loss L val T (70) (dashed lines corresponds to NN-AD architecture and solid lines to NN-2 architecture).

Figure 5 .
Figure 5. Reference data (left) and results obtained from DNN-FE 2 simulation (middle) with NN-AD-100 model as well as error measure (74) (right) for the components of strain tensor E.

Figure 6 .
Figure 6.Reference data (left) and results obtained from DNN-FE 2 simulation (middle) with NN-AD-100 model as well as error measure (74) (right) for the components of stress tensor T.

Figure 7 .
Figure 7. Histograms of the error (74) for the L-profile when applying the NN-AD-100 model.The top and bottom panels illustrate the error for the components of the stress and strain tensors, T and E, respectively.

Figure 8 .
Figure 8. Reference data (left) and results obtained from DNN-FE 2 simulation (middle) with NN-AD-100 model as well as error measure (74) (right) for the components of strain tensor E.

Figure 9 .
Figure 9. Reference data (left) and results obtained from DNN-FE 2 simulation (middle) with NN-AD-100 model as well as error measure (74) (right) for the components of stress tensor T.

Figure 10 .
Figure 10.Histograms of the error (74) for the Cook's membrane when applying the NN-AD-100 model.The top and bottom panels illustrate the error for the components of the stress and strain tensors, T and E, respectively.

Figure 11 .
Step-size behavior of DNN-FE 2 simulations (red) with NN-AD-100 model and FE 2 reference simulation (blue), only accepted time-step sizes are shown.(A) L-profile; (B) Cook's membrane.

Figure A1 .
Figure A1.Influence of the choice of activation function on the learning process; L val T (left) and L val C (right).Results are reported for models containing 8 hidden layers and 128 neurons per each hidden layer.

Table 1 .
Effect of the weighting coefficients α and β for the components of the loss (72) on the performance of the NN-AD model.

Table 3 .
Results of the DNN-FE 2 simulation of the L-profile for different sizes of the training/validation dataset.

Table 4 .
Results of the DNN-FE 2 simulation of the Cook's membrane different for sizes of the training/validation dataset.

Table 5 .
Comparison of JAX and TensorFlow implementations of the surrogate model NN-AD-100 regarding the computational efficiency.

Table A1 .
Summary of the results obtained for training and validation losses and the required time of training for different sizes of the NN-2-100 model.Results are reported for models with swish activation function.