Parallel matrix-free higher-order finite element solvers for phase-field fracture problems

Phase-field fracture models lead to variational problems that can be written as a coupled variational equality and inequality system. Numerically, such problems can be treated with Galerkin finite elements and primal-dual active set methods. Specifically, low-order and high-order finite elements may be employed, where, for the latter, only few studies exist to date. The most time-consuming part in the discrete version of the primal-dual active set (semi-smooth Newton) algorithm consists in the solutions of changing linear systems arising at each semi-smooth Newton step. We propose a new parallel matrix-free monolithic multigrid preconditioner for these systems. We provide two numerical tests, and discuss the performance of the parallel solver proposed in the paper. Furthermore, we compare our new preconditioner with a block-AMG preconditioner available in the literature.


Introduction
Many applications require the solution of partial differential equations (PDEs). Frequently, these are solved by Finite Element Methods (FEM), or related approaches like Isogeometric Analysis, which discretize the continuous PDE. To obtain accurate solutions, we eventually need to solve huge linear systems of equations. This may become a challenging task since the computational effort increases rapidly. Hence, solvers that are able to handle large-scale linear systems are required. This gives rise to specialized solvers, that are adapted towards a specific PDE. In this work, we focus on efficient parallel solvers for problems in phase-field fracture (PFF) propagation.
The origin of crack modeling dates back to Griffith [23] in 1921, who proposed the first model for fractures in brittle materials. The phase-field approach we are using was developed by Francfort and Marigo [21], who put the fracture model in an energy minimization context. Therein, the originally lowdimension crack is approximated by means of a continuous phase-field function. The PFF model was extended in [45,6,3] by decomposing the stress terms into tensile and compressive parts. However, the physically correct way to do this splitting is disputed within the fracture community. Examples exist where either one or the other method is superior. There has been developed a multitude of further extensions of the original model, e.g., to pressure-driven fractures [46,56,58] in porous-media [44,47,26,18], multi-physics [43] and many more; see also the survey papers [13,57,52] and the monograph [55]. PFF problems have the advantage, that they reduce to a system of PDEs, which can be solved by adapting well-known strategies. On the other hand, a smooth approximation of an originally sharp fracture comes with some drawbacks, e.g., crack-length computation or interface boundary conditions along the fracture. There exist different approaches than PFF that allow treatment of sharp cracks, e.g., [8,41,1].
The numerical solution of PFF problems is particularly challenging, mainly due to the so-called noheal condition. This ensures that a crack does not regenerate itself. Mathematically, this leads to an additional variational inequality. The solution gets further complicated by the non-convexity of the underlying energy functionals. Many different approaches have been proposed in the literature to handle PFF; see, e.g., [14,53,42,27,35]. In this work, we use the primal-dual active-set method, a version of the semi-smooth Newton method [30], which was applied to PFF in [27].
The most time consuming part in the simulation of PDEs is the solution of the arising linear systems of equations. In the context of nonlinear PDEs, these appear after linearization within Newton's algorithm, or similar linearization approaches. Hence, multiple linear systems need to be solved per simulation step. For increasing problem sizes, this becomes a severe bottleneck, and well optimized solvers are required. In [33], we presented a solver based on the matrix-free framework of the C++ FEM library deal.II [2]. Further strategies treating the linear systems are presented in [20]. Matrix-free methods avoid storing the huge linear system, which can become a real issue in terms of memory consumption. The linear systems are solved using the Generalized Minimum Residual (GMRES) iterative solver using a geometric multigrid preconditioner. This setup does not need explicit knowledge about the matrix entries of the linear system, but only need to perform matrix-vector multiplications. These can be carried out without actually storing the matrix.
Matrix-free approaches are particularly favorable for high-polynomial shape-functions. In these cases, the number of non-zero entries within the matrix increases, rendering the storage of the matrix and sparsematrix-vector multiplication (SpMV) a lot more expensive. On the other hand, matrix-free methods do not suffer from an increase polynomial degree. These approaches originate from the field of spectral methods [48,19], where high-order ansatz functions are used.
One key ingredient to keep the computational time low is parallelization. In this work, we show the effects of different ways of parallelization. First, we are interested in the distributed solution of the whole problem. Therein, the computational work is split across multiple CPUs (ranks), each with its own independent memory. Necessary data exchange among CPUs is done using the Message Passing Interface (MPI). Using multiple CPUs gives us the possibility to fit larger problems into memory, as each core only stores small, almost independent parts of computational domain. All cores run in parallel, resulting in faster computations. Secondly, we explicitly utilize vector instruction sets provided by modern CPUs. These are special instructions for the CPU that are capable of performing multiple computations at once. This is more thoroughly described in Section 4.2.
In this paper, we study the performance of the AMG-based solver by [28] and the matrix-free geometric multigrid from our previous work [33]. Both approaches as well as the treatment of the nonlinearities are given in Section 3. In Section 6, we compare both approaches with respect to (wrt) computational time, memory consumption and parallel efficiency. Details on the parallelization are presented in Section 4. We focus on dependence on the polynomial degree of the finite elements. For studies regarding h-dependence, we refer to the respective papers. Section 5 describes the numerical examples and compares the results to those found in the literature to validate our implementations. Details on the phase-field model are briefly described in Section 2, for more information we refer to the corresponding literature. We also added a discussion on the derivative of the eigensystem in Section 3.5.

Preliminaries
In the following, we will repeat the basic notation used in [33] as well as the governing equations. Furthermore, we will introduce the primal-dual active-set method that can be interpreted as semi-smooth Newton method. Finally, we present the computation of the derivative of the eigensystem that plays a fundamental role in Miehe-type splittings.  Figure 1 shows the computational domain D ⊂ R d with an inscribed fracture C ⊂ R d−1 . The remaining undamaged parts of D are denoted by Ω. We use the phase-field approach from [14,4,5], where the lowerdimensional fracture is approximated by using an additional scalar function ϕ : D → [0, 1]. Values of 0 denote a completely fractured domain, whereas 1 represents undamaged parts. The size of the fracture approximation is controlled by the length-scale parameter ε.

Phase-Field Fracture Model
with additional displacement boundary conditions on ∂D.
The decay of the material stiffness is controlled by the degradation function g(ϕ). In this work, we use g(ϕ) := (1 − κ)ϕ 2 + κ, with a small regularization parameter κ 1 to avoid singularities in fractured domains. Different choices of g can be found in the literature; see, e.g., [40,14,49,22].
This leads to the extended energy functional Furthermore, fractures are only allowed to grow, i.e. ∂ t ϕ < 0. In our quasi-static formulation, this means that, for each loading step n, we need to enforce ϕ n ≤ ϕ n−1∀ x ∈ D. The first-order necessary condition (Euler-Lagrange system) of Section 2.1 is given by the variational inequality problem Problem 2.1 (Euler-Lagrange System). Find displacement u ∈ V and fracture ϕ ∈ W in such that with spaces V := H 1 0 (D) d , W := H 1 (D), W in := {w ∈ W : w ≤ ϕ old a.e.}; see e.g. [47]. Remark 2.2. In the literature, the phase-field approximation described here is also referred to as AT 2 model (named after Ambrosio/Tortorelli) introduced in [4] for the Mumford-Shah problem and the original variational fracture formulation [14]. Changing the last line in Problem 2.1 leads to another well-known model referred to as AT 1; see, e.g., [15]. However, in this work, we only consider the AT 2 model.

Treatment of the Variational Inequality Constraints
Solving such variational inequalities is a challenging task that admits many different solution techniques. Popular choices are penalization approaches, where one incorporates violations to the no-heal condition into the variational form, i.e. adding terms like γ(ϕ − ϕ n−1 ) + for large enough values of γ. Improvements as in [53,54] aim to determine γ adaptively, or provide lower bounds as in [22]. In [42], the inequality is replaced by a so-called history field H(x, t) := max{E + s (e(u))}, which again leads to an equality system to be solved.
In this work, we use the primal-dual active-set method for fractures as presented in [27]. In the next section, we continue with the description of the solution approaches used in our simulations, including details to the finite element version of the active-set algorithm mentioned above.

Discretization and Solution Algorithms
In this section we briefly discuss the discretization methods used. We start with a very short description of FEM. More details can be found in one of the standard textbooks. We continue with the key parts of our solution strategy: the primal-dual active-set method and the linear solvers used therein. In Section 3.5, we illustrate how the derivative of the eigenvector and eigenvalues is computed in the case of Miehe-type splitting.

Finite Element Discretization
In particular, we use the Finite Element Methods on quadrilateral/hexahedral meshes to obtain discrete spaces for displacement and phase-field variables. The shape functions are chosen to be globally continuous piecewise polynomials of degree p in every coordinate direction (Q p ). The code is implemented in C++ with the help of the FEM library deal.II [2]. We make particular use of the matrix-free geometric multigrid framework included, which is described in more detail in [34,32].

Primal-Dual Active-Set Algorithm
The idea of the active-set algorithm is to enforce the constraints on a set A, and perform a Newton step on the remaining dofs as outlined in Algorithm 1. The non-convexity of E leads to catastrophic convergence behavior of the active-set algorithm. To overcome this, we employ an extrapolation scheme as used in [27]. In the residual (1), this replaces the critical terms ϕ 2 in the degradation function g by a predictioñ ϕ 2 , computed via linear extrapolation from the solutions of the previous two time-steps.

Algorithm 1 Primal-dual active-set
Repeat for k = 0, . . . until A k = A k−1 and R k ≤ ε as : The nonlinear residual at the evaluation point (u, ϕ) ∈ V × W in is given by with test-functions (w, ψ) ∈ V × W . Its derivative at (u, ϕ) in directions δu and δϕ equals to The partial derivatives wrt ϕ of the first term in R vanish since we have replaced ϕ by an extrapolatioñ ϕ. The corresponding discrete versions at the k-th step of Algorithm 1 are denoted by R k and G k .

Linear Solvers
The crucial step in the solution of Problem 2.1 by the active-set method is the solution of the linear system G k · δU = R k on A c k . The system to be solved has the following block structure A class of very powerful methods to solve linear systems arising from discretized PDEs are the multigrid methods. These aim to reduce and solve the problem on a hierarchy of nested spaces. Two frequently used types of multigrid solvers are the Algebraic Multigrid (AMG) and the Geometric Multigrid (GMG). Within the GMG approach, nested spaces are constructed by a hierarchy of meshes, i.e., it requires information about the geometry. On the other hand, AMG solvers work solely on the given sparse-matrix. There, hierarchies are extracted from the connectivity graph of the matrix. Since AMG only needs the linear system to be solved as an input, they are often considered to be black-box solvers. This is, however, a huge oversimplification, as there are many different AMG methods and options tailored to specific classes of PDEs. For more details on AMG, GMG and multigrid in general, we refer to [25,16,50,24,51].
To solve these equations, we use two different approaches. First, we use the matrix-free multigrid framework from deal.II extended to handle nonlinear equations as described in our previous work [33]. This uses a monolithic geometric multigrid for the whole system. The special structure of the matrix is utilized inside the smoother, i.e., we use with Chebyshev-Jacobi smoothers S acting on the single blocks. These smoothers are frequently used in the context of matrix-free methods. Contrary to standard smoothers like Gauss-Seidel, Chebyshev-Jacobi smoothers only require matrix-vector multiplications and scaling with the diagonal. Both operations can be done in a matrix-free fashion, which we will describe later on. Chebyshev-type methods work on a given eigenvalue spectrum [a, b]. Depending on the choice of this interval, we can use the Chebyshev method as smoother but also as a solver. A solver should handle the whole spectrum of the operator. Hence, we want to have λ min ≤ a < b ≤ λ max with the extremal eigenvalues λ min and λ max of G uu resp. G ϕϕ . This is required on the coarsest grid of the multigrid algorithm, where we want to solve the system more accurately. The main intention of the Chebyshev-Jacobi method, however, is smoothing. Here, we are not interested in treating the whole spectrum, but only the high-frequency parts. A choice of [a, b] := [c λ max , C λ max ] with c ≤ 1 and C ≥ 1 is frequently used. In our applications, we use c = 0.24 and C = 1.2. This involves knowledge about λ max (and also λ min for the solving part). For symmetric positive definite matrices, CG can be used to obtain an estimate of both eigenvalues. Within PFF, symmetry of G uu and G ϕϕ follows from the second derivative of the energy functional. We could not rigorously show positive definiteness for these blocks, but numerical studies did not give us contradicting results. In a second approach, we compare our MF-GMG preconditioner to the block-diagonal AMG preconditioner presented in [28]. The linear system is then preconditioned by with Trilinos/MueLu AMG [29,10] solvers for the diagonal blocks. In both cases, we use GMRES as outer solver and use P −1 AM G resp. GM G(G) as preconditioners.

Matrix-Free Representation
The main idea of matrix-free approaches is to avoid assembling and storing sparse (but huge) matrices. This is reasonable since iterative solvers like Conjugate Gradients (CG) or Generalized Minimal Residual (GMRES) do not require explicit knowledge about the entries of the matrix. Rather, these solvers only need the result of the matrix-vector multiplication (MV). We combine the assembly and sparse matrixvector multiplication (SpMV) by with the number of elements n e , possible constraints C, element-wise global-to-local mapping P k and the local stiffness matrices G k corresponding to (2). This is already a valid matrix-free matrix-vector product (MFMV). However, without further optimizations, its performance cannot be expected to be anywhere near to the classical SpMV performance. These optimizations are described thoroughly in [37], and are included in the deal.II library. The main point is to utilize the transformation of the local stiffness matrices G k to the unit element and abuse the tensor-product structure there. In our previous work [33], we illustrate this in more detail for the nonlinear terms appearing in the PFF problem.

Derivatives of the Eigensystem
In the case of Miehe-type splittings of the stress tensor, we require the derivatives of σ + and σ − in direction δu, which depend on the eigensystem of the strain tensor. In the following, we illustrate how to compute these quantities for arbitrary (but small) dimensions. The following derivation holds for real symmetric matrices, and in particular for the solid strain tensor. We assume, that the matrix A and δA, as well as the eigenvector and eigenvalue of A, denoted by λ and v, are available such that holds. In our application, the direction δA is given in terms of the displacement test functions. The eigensystem of A itself can easily be computed explicitly in 2d and 3d, e.g., with the help of some computer algebra system. For higher dimensions, more sophisticated methods are available. Taking the derivative of (3), we get by the product rule Forming the 2 -scalar-product of (4) with u yields Using the symmetry of A and the normalization (v, v) = 1, we can reduce it to Since δA and v are given, we can now compute the derivative of λ by the formula above.
Computing the derivative of the eigenvector v is slightly more involved. We start by collecting the terms δv in (4): Unfortunately, we cannot simply invert (A − λI) =: L to obtain δv since λ is an eigenvalue of A. Hence, L is singular by definition. This is seen easily from (3), which can be rearranged to L · v = 0. Therefore, we need the concept of a pseudo-inverse (or left-inverse), see e.g. [9]. The pseudo-inverse L † has the properties (among others) that L † L = I and L † w = 0 ∀w ∈ N (L) = span{v}. Utilizing this, we can apply L † to (6), and get which finally reduces to δv = −L † δA · v.
With (5) and (7) we can now compute the linearized eigensystem of A.
If the eigensystem (λ i , v i ) of A is known, we can decompose A = V DV T with the matrix of eigenvectors V = (v 1 , . . . , v d ) and diagonal matrix D = diag(λ 1 , . . . , λ d ). Furthermore, we have L = V (D − λI)V T . Then, the pseudo-inverse can be computed as This machinery can be applied to compute e + and its derivative for arbitrary dimensions, see Algorithm 2 for the detailed steps. The derivative of σ + (e) follows by tr(e(du)) else I + 2µ∂ e (e + )(e(du)).
Furthermore, each of the steps can be implemented using SIMD parallelism. The (hidden) "if" statement in line (10) can be handled via masking as explained in Section 4.2.

Parallel Implementation
The finite element library deal.II as well as our implementation are parallelized on multiple levels. First of all, the problem is split across multiple CPUs such that each of them only stores and handles a small subset of the whole problem (distributed parallelization). Necessary communication between different CPUs is handled via the message passing interface (MPI). Furthermore, each CPU runs explicitly vectorized code, that is, multiple data elements are handled with the same CPU instruction (SIMD). The next subsections explain these concepts in more detail.

Distributed Parallelization
When dealing with large problems, it may no longer be possible for a single CPU to store all the required infrastructure in memory. Hence, the computational problem is split into smaller chunks which are distributed among multiple CPUs. In the context of FEM, this is typically done by assigning each CPU some parts of the mesh. Such a partitioning can be achieved by using libraries like p4est [17]. Each CPU then handles all the dofs associated to its part of the mesh (owned dofs). Furthermore, it may occasionally require knowledge about additional dofs which are owned by another CPU (ghost dofs), e.g. if we want to evaluate solutions close to the interface between different CPUs. These information needs to be shared among multiple CPUs, which is done via the message passing interface (MPI). This provides methods for communication in distributed networks. deal.II provides an easy to use infrastructure for these kind of tasks; see, e.g., [7].

Vectorization
Within the matrix-free framework, deal.II makes use of the CPUs vector instruction set. These operations perform single instructions on multiple data (SIMD There are several important points to consider here. First of all, vectorization requires the same work to be done on all entries. Hence, code with branching, e.g., if statements with a condition depending on the values cannot be efficiently vectorized. Workarounds usually involve executing both branches for each vector element and merging them via bit masks, see Algorithm 3. These type of instructions are called "blend". This is mainly used for simple if-then-else assignments, i.e., where the additional overhead of executing all branches and combining the results does not dominate the benefits of using vectorized code.
Algorithm 3 Vectorized if-then-else statement using bit masks. An additional drawback of vectorization is the difficulty of implementation and portability. For simple algorithms, up-to-date compilers can automatically vectorize the loop. Unfortunately, this autovectorization ceases to work for more complex scenarios. This leaves the developer to explicitly call SIMD instructions of the form _mm512_add_pd, which makes the implementation more tedious. Usually, wrapper libraries are used to improve code readability and portability. For instance, the VectorizedArray class in deal.II or Vc [36] among others. Recently, there has also been effort to include vectorization capabilities into the C++ standard.
Within the deal.II matrix-free implementation, vectorization is done over elements. That is, each CPU instruction acts on multiple elements at once. Several implementational details are described in [37,38,39].
We now investigate the influence of different vectorization levels and compiler settings on the performance of the MFMV. All of these test are carried out on a single core, using the single-edge notched shear test described in Section 5.2. We used Q 1 elements on a 6-times refined grid, but the observed speedup is valid also for different settings, as long as the problem size is not too small. Compilation was done using GCC-9.2.0 using the O3 optimization level.  Table 1: Effect of vectorization within the matrix-free operator evaluation. Baseline is given by a standard release build using the O3 optimization flag without vectorization.
GHz clock speed, which did not support the most recent instruction set AVX-512. All computations were carried out using double precision floating point numbers. The best run out of 100 is reported. Our findings are summarized in Table 1. There, "ftree-vectorize" and "fno-tree-vectorize" are the GCC flags to enable resp. disable automatic vectorization. n-bits denotes explicit vectorization using the given number of bits, i.e. 128 for SSE and 256 for the AVX instruction set.
We notice that auto-vectorization does only yield a speedup of 13%. Explicit vectorization over elements doubles the performance, hence, yielding the best possible speedup using the 128 bit instruction set. Additionally enabling auto-vectorization still increases the performance a little bit. This is mainly due to vectorization of simple parts like adding vectors, that are not explicitly vectorized but relying on auto-vectorization. Vectorization capabilities of compilers have improved throughout the years. However, efficient vectorization requires knowledge about the structure of the computation to be carried out that is beyond the compilers ability to identify, e.g., realizing that the work on most elements is identical.
Using 256 bit SIMD instructions only improved the results by another 27%. Possible explanations are the automatic throttling of the CPU speed [31] or memory bandwidth restrictions, but the actual reasons are unclear. Surprisingly, auto-vectorization interferes negatively with explicit vectorization in any case.

Numerical Examples
In the upcoming sections, we present our numerical examples. We focus on a pair of well-known test cases for fracture propagation: the single-edge-notched shear and tension test. These have been well studied in the literature, such that we can compare our findings to those reported by other groups. Results regarding the performance of our solvers and their behavior under p-refinement follows in Section 6.

Example 1: Single-Edge Notched Tension Test
We start with the single-edge notched tension test. This test case has been studied by other groups as well, see e.g. [45,42,3,11,27].

Discussion
First, we perform a refinement study for the single-edge notched tension test. To this end, we fix all parameters and perform global mesh refinement. In particular, the phase-field parameter is fixed to ε = 4 · 10 −3 as used in [3]. The resulting load-displacement curves are shown in Figure 3 for the anisotropic case (Miehe splitting). Whereas the solutions on the coarse meshes deviate a lot, the solutions on the fine meshes are almost indistinguishable showing the convergence of our implementation. In particular, the deviation to the solution on the finest mesh seems to decrease by approximately 0.5 in every refinement step. The splitting does not seem to have much impact on this simulation. Indeed, the isotropic results agree almost perfectly with the anisotropic findings.
In comparison to the results presented in [3], the resulting maximum load is slightly higher in our computations. Furthermore, the decay of the loading force seems to take longer, i.e. the amount of displacement when the load hits zero is roughly 7 · 10 −3 in our case, whereas it is slightly less than 6 · 10 −3 in [3]. However, we would like to point out that our computations used significantly more elements.

Example 2: Single-Edge Notched Shear Test
The following test scenario is closely related to the previous tension test. Both have the same geometry and material parameters, but the boundary conditions are flipped.

Description
The computational domain is again given by a unit square, with the same slit imposed as in the previous test case. In this scenario, however, the applied displacement conditions moves from the top boundary to the right, i.e., u = (t · du) on Γ top . Again, the body remains fixed at the bottom by u = (0, 0) on Γ bottom .

Discussion
In the isotropic case of the single-edge notched shear test in Figure 5, the load-displacement curves do not coincide with the observations in [3] but look more similar to their results with Amor-splitting. However, the observed crack pattern is similar, i.e., two symmetric branches appear as seen in Figure 4 (right). We like to point out that neither the Miehe-splitting nor the isotropic model lead to physically sound results, see e.g. [3] and also the early work [45]. Nonetheless, one clearly observes convergence as in the tension test.
Contrary to the tension test before, the resulting load-displacement curves show lots of small oscillations. This is partly also observed in [3], although the oscillations there are seemingly smaller. The results with Miehe splitting seem match better, but the maximum load and displacement are lower than those reported by [3]. Figure 6 shows the test results for different polynomial degrees p. The refinement level corresponds to = 7, all other parameters are the same as before. We immediately notice, that the small oscillations are gone for the high-order simulations. Furthermore, we also see convergence with respect to p. The number of dofs for degree p can be obtained from Table 4.      Figure 6: Different polynomial degrees for the anisotropic single-edge notched shear test. Refinement level is fixed at = 7 and ε = 4 · 10 −3 .

Performance Studies
In the upcoming sections, we compare our matrix-free version with the AMG based preconditioner described in [28] in terms of iteration counts, memory requirements, computational time and scalability. We would like to point out that most of the code is shared between the two implementations, i.e., only the parts related to handle the linear system differ. This allows for a fair comparison of the results.

Memory Requirements
We start by investigating the memory requirements of the two approaches, which should be an obvious advantage of the matrix-free framework. Due to the required multigrid dofs for the geometric multigrid, the dof structures require slightly more memory compared to the AMG version. This is more than compensated by larger storage requirements of the sparse matrix and AMG hierarchies compared to the multilevel matrix-free structures in the GMG implementation. In particular for high-order polynomial degrees p, the sparse matrix gets increasingly dense, leading to huge memory costs. Unlike that, the cost for a dof within the matrix-free approach is almost independent of p.
All these effects are shown in Figure 8 for varying polynomial degree p. The line "Matrix (AMG)" refers to the storage cost of the sparse-matrix. The entry "Matrix (GMG)" denotes memory requirements for all necessary matrix-free operators, and additionally required data, including those on the coarser levels needed for the geometric multigrid. The additional cost of handling the dofs on coarser levels as well is reflected in the difference between "DoFs (AMG)" and "DoFs (GMG)". For a degree of 4, we save approximately a factor of 20 in terms of memory. These results are shown for a 2d test case. In 3d, this gap would even be more prominent. Note that you must not get confused by the decrease of the "Matrix (GMG)" line, as we plot the required bytes per dof and not the total memory needed. Remark 6.1. Data structures of the deal.II library (matrix-free, GMG, dofs, etc.) are able to report their memory consumption very precisely. Unfortunately, this is not the case for the AMG structures of Trilinos/ML. Measuring the memory requirements for the AMG using /proc/self/status did not yield reliable results. Hence, we did not include the storage of AMG structures in Figure 8. Although this makes the comparison slightly unfair, it would only make the results even more favorable for the matrix-free implementation.

Iteration Counts
Next, we have a look at the number of iterations required to solve the arising linear systems of equations. We would like to point out that these counts are highly dependent on the actual settings used, i.e., which simulation is run, selected number of smoothing steps, coarse level and further parameters. For the AMG solver, we stuck to the default configuration by deal.II. The GMG uses = 2 as coarse grid, i.e. 16 elements. We always use 5 Chebyshev-Jacobi smoothing iteration sweeps. Increasing the number of smoothing steps would decrease the number of iterations. Finding a good balance between smoothing quality and iteration count to obtain the best performance is a very challenging task, which we did not consider in detail by now. In all our simulations, we use relative convergence criteria of 10 −6 for the linear solver and 10 −8 for the active-set strategy. A representative comparison of the iteration behavior is given in Figure 9, with AMG results on the left and GMG results on the right side. Again, we vary the polynomial degree for the shear test, as this scenario shows more interesting behavior. Both approaches show an expected growth in the number of iterations for increasing p. We also observe that both solvers require more work as the fracture grows. Degree p = 4 behaves somewhat surprisingly. It requires less steps in the middle of the simulation for both the AMG and GMG strategy. The iteration count jumps up towards the end of the simulation, when the domain is close to total failure. Mesh refinement corresponds to = 7, the length-scale parameter is fixed at ε = 4 · 10 −3 .

Performance Analysis
By avoiding storing the matrix explicitly, we are forced to compute (almost) everything on the fly. This makes matrix-vector operations more costly compared to the standard case, where the matrix is at our disposal. On the other hand, matrix-free implementations start to outperform standard methods for higher-order elements, in particular in 3d computations. We continue to investigate the computational performance of the matrix-free approach. To this end, we start by comparing the sparse-matrix vector multiplication (SpMV) time with the corresponding matrixfree evaluation (MFMV). For all tests, we use explicit vectorization based on AVX-256 in the matrix-free parts as described in Section 4.2. The most recent instruction set AVX-512 is not supported by any of our CPUs. We plot the best run out of 100.
In [33], we compared the behavior with respect to h-refinement. Here, we want to focus on the dependence on the polynomial degree p. In Figure 10, we observe that SpMV is faster for linear and quadratic elements. However, this only considers the raw matrix-vector multiplication ! In particular, it does not include the time that is required to assemble the matrix. We also notice, that starting from degree 3, MFMV starts to outperform SpMV even for single evaluations. Let us now look at the perfomance of a full simulation. This is visualized in Figure 11, where the total time spent in the linear solvers, i.e., GMRES with either AMG or GMG as preconditioners, is plotted. Here, only a single simulation of the single-edged notched shear test is performed. Possible fluctuations in the timings should be evened out over the length of the simulation.
The observed behavior is very similar to the vmult-times presented before. The faster SpMV for linear and quadratic elements is able to overcome the time of assembling the matrix and initializing the AMG solver, thus ending up faster than the matrix-free solver. The situation shifts in favor of the MF-GMG preconditioner for polynomial degrees 3 and higher. AMG GMG Figure 11: Total time spent solving the linear system for a full simulation of the single-edge notched shear test using 16 cores. This includes setup times for the sparse-matrix and AMG solver as well as the eigenvalue computation and setup routines required for the MF-GMG. Again, = 7 and ε = 4 · 10 −3 .

Parallel Scalability
Finally, we investigate the parallel performance of both solvers. We consider again the shear test, but similar results can be expected for other scenarios as well. For this test, we take the mesh at = 9, and consider p = 1, 2, 3 and 4 leading to problem sizes of 7.9 · 10 5 , 3.2 · 10 6 , 7.1 · 10 6 , and 1.3 · 10 7 dofs, respectively, see Table 4. We notice that, for low degrees, the AMG approach is faster than our MF-GMG solver, which we have already observed in Figure 11. However, the solution time for the GMG solver grows slower than that for AMG solver with increasing p.
The parallel scaling behavior is similar for both preconditioners. For large enough problem sizes, strong scaling is close to perfect. As soon as the local size gets too small for each CPU, the communication overhead starts to become noticeable. Both show good scaling down to approximately 25k dofs, although the AMG solver seems to perform slightly better. This is due to the bad scaling behavior on the coarse grids of the GMG hierarchy, which only contain very few cells.
Both approaches do not show perfect weak scaling behavior. However, due to the complexity of the PDE considered here, these are more than satisfying results.
We mentioned earlier, that memory consumption is a huge advantage of matrix-free methods. In fact, the matrix-based AMG simulation for p = 4 exceeded our available RAM (128 GB) on a single core.

Conclusions
We presented and compared two approaches to solve the linear systems arising in the PFF problem. First, we considered our matrix-free based geometric multigrid implementation from [33]. Compared to our previous work, we extended our solver to handle higher polynomial degrees, which is a huge improvement for the matrix-free approach. We compared our MF-GMG solver to the AMG-based approach from [28]. This method is comparably easy to implement since the AMG solvers by MueLu (multigrid library in Trilinos) [10] are almost black-box algorithms. These only require the sparse matrix, and compute the multigrid hierarchies by themselves. Implementing the MF-GMG approach is a lot more involved since we have to define the operators on each level. This can be particularly challenging in the presence of nonlinear terms and varying constraints (active-set), as it is the case in PFF.
The matrix-free approach really excels at high-order polynomial elements. It requires less storage per dof as we increase the polynomial order p, whereas the sparse-matrix becomes more and more costly. This is also reflected in the performance of the linear solver, i.e., the time spent for solving the equations using AMG increases faster than using GMG during p-refinement. Nonetheless, the AMG approach is slightly faster for degrees less than 3. Furthermore, the AMG solver requires a lot less implementational effort. The situation shifts more in favor of the matrix-free approach in 3 dimensions, as storage costs tend to be even more limiting.
So far, we only considered parallelization using SIMD instructions and distributed computing via MPI. Fine-grained parallelization within a node using threads (e.g., by means of OpenMP, TBB, std::thread) could improve the performance even more. An entirely different programming model is given by Graphic Processing Units (GPUs) using CUDA. GPUs are able to run several thousand threads in parallel, giving them a huge boost in performance for suitable application. Currently, an extension of the matrix-free framework in deal.II using GPUs is under development.

Acknowledgments
This work has been supported by the Austrian Science Fund (FWF) grant P29181 'Goal-Oriented Error Control for Phase-Field Fracture Coupled to Multiphysics Problems'.