epSFEM: A Julia-Based Software Package of Parallel Incremental Smoothed Finite Element Method (S-FEM) for Elastic-Plastic Problems

In this paper, a parallel Smoothed Finite Element Method (S-FEM) package epSFEM using incremental theory to solve elastoplastic problems is developed by employing the Julia language on a multicore CPU. The S-FEM, a new numerical method combining the Finite Element Method (FEM) and strain smoothing technique, was proposed by Liu G.R. in recent years. The S-FEM model is softer than the FEM model for identical grid structures, has lower sensitivity to mesh distortion, and usually produces more accurate solutions and a higher convergence speed. Julia, as an efficient, user-friendly and open-source programming language, balances computational performance, programming difficulty and code readability. We validate the performance of the epSFEM with two sets of benchmark tests. The benchmark results indicate that (1) the calculation accuracy of epSFEM is higher than that of the FEM when employing the same mesh model; (2) the commercial FEM software requires 10,619 s to calculate an elastoplastic model consisting of approximately 2.45 million triangular elements, while in comparison, epSFEM requires only 5876.3 s for the same computational model; and (3) epSFEM executed in parallel on a 24-core CPU is approximately 10.6 times faster than the corresponding serial version.


Introduction
Currently, numerical methods are the most important tools for solving various scientific and engineering problems [1].For example, the Finite Element Method (FEM), one of the most successful numerical methods, has been widely employed in different scientific and engineering fields because of its mathematically rigorous proof and satisfactory efficiency [2][3][4].However, the shortcomings and deficiencies of FEM are becoming increasingly significant [2,[5][6][7][8].(1) The FEM applies the problem domain of finite degrees of freedom to the problem domain of infinite degrees of freedom, which makes the system stiffness matrix "too rigid".(2) The conventional FEM has high requirements for mesh quality and cannot deal with distorted meshes.(3) When the conventional FEM uses simple and low-order elements to calculate large and complex structures, the calculation accuracy is often unsatisfactory, while when higher-order elements with higher accuracy are used, the computational cost is quite expensive.
To cope with the above deficiencies of FEM or decrease the computational cost of generating meshes, meshfree methods have emerged, such as Radial Point Interpolation Method (RPIM), Element Free Galerkin (EFG) and Meshless Local Petrov-Galerkin (MLPG) [2,5,9,10].The mesh-free methods can be used to analyze crack problems and large deformation problems because mesh-free methods employ a group of scattered nodes in the discrete problem domain, avoiding the requirement for continuity of the problem domain.However, the more complex computational process of mesh-free methods leads to the desire to achieve higher computational accuracy, which is not only computationally time-consuming but also inefficient [7].
In recent years, the Smoothed Finite Element Method (S-FEM), a new numerical method combining the FEM and strain smoothing technique was proposed by Liu G.R. et al. [7,11].The system stiffness matrix of S-FEM model is softer than the FEM model for identical grid structures, has lower sensitivity to mesh distortion, and usually produces more accurate solutions and a higher convergence speed [7].Due to the above characteristics, S-FEM is frequently used in the fields of material mechanics [12,13], dynamics [14,15], fracture mechanics [16], plate and shell mechanics [17], fluid structure interaction [18], acoustics [19], heat transfer [20] and biomechanics [21].
Typical S-FEM models include cell-based S-FEM (CS-FEM) [15,22], node-based S-FEM (NS-FEM) [23,24], edge-based S-FEM (ES-FEM) [14,16] for 2D and 3D problems and facebased S-FEM (FS-FEM) [25] for 3D problems.In addition, there are hybrid and modified types of S-FEM.For example, Chen et al. [26] proposed an edge-based smoothed extended finite element method, ESm-XFEM, for the analysis of linear elastic fracture mechanics.An improved ES-FEM method, bES-FEM, was proposed by Nguyen-Xuan et al. [27].bES-FEM can be applied to almost incompressible and incompressible problems.Xu et al. [28] proposed a hybrid smoothed finite element method (H-SFEM) for solving solid mechanics problems by combining FEM and NS-FEM based on triangular meshes.Zeng et al. [29] proposed a beta finite element method (βFEM) based on the smooth strain technique applied to the modeling of crystalline materials.
Compared with FEM, the calculation of the S-FEM has the following two differences.First, we need to construct the smoothing domains and modify or reconstruct the strain field in the S-FEM.Moreover, because the smoothing domain may involve a portion of adjacent elements, the memory requirements for S-FEM will be larger [7].The two differences mentioned above may lead to a higher computational cost for the S-FEM than the FEM for the same grid structure.However, given the calculation cost, the results calculated by the S-FEM model are more accurate than the FEM, and thus, achieve higher efficiency.To make S-FEM more applicable to large-scale engineering problems, parallel strategies of multicore CPUs and/or multicore GPUs are usually used to improve and optimize the computational power of S-FEM.
Currently, there are many software packages developed for utilizing FEM to solve various scientific and engineering problems, while the development of software and library packages for the S-FEM is still in progress [30].Current S-FEM software packages are mostly implemented in C++ and Fortran.However, static languages such as Fortran and C/C++ have more complex language structures, are more difficult to learn, and require high programming skills.Although high-level dynamic languages, such as MATLAB and Python, are easy to learn, highly visual and interactive, the computing speed of dynamic language is slow and there are expensive licensing fees associated with the use of commercial software such as MATLAB.Julia is an efficient, user-friendly, open-source programming language, developed by MIT in 2009 [31].Furthermore, it balances the problems of computing performance, programming difficulty and code legibility [32].
Many researchers have used the Julia language to develop software packages related to numerical computation.For example, Frondelius et al. [33] designed an FEM structure by using the Julia language, which enables large-scale FEM models to be processed by using distributed simple programming models across a cluster of computers.Sinaie et al. [34] implemented the Material Point Method (MPM) in the Julia language.In the large strain solid mechanics simulation, only Julia's built-in characteristics are used, which has better performance than the MPM code based on MATLAB.Zenan Huo et al. [35] implemented a package of S-FEM for linear elastic static problems by using Julia lan-guage.Pawar et al. [36] developed a one-dimensional solver for the Euler equation, and an arakawa spectral solver and pseudo-spectral solver for the two-dimensional incompressible Navier-Stokes equation for the analysis of computational fluid dynamics using the Julia language.Heitzinger et al. [37] used the Julia language to implement numerical stochastic homogeneity of elliptic problems and discussed the advantages of using Julia to solve multiscale problems involving partial differential equations.Kemmer et al. [38] designed a finite element and boundary element solver using Julia to calculate the electrostatic potential of proteins in structural solvents.Fairbrother et al. [39] developed a package for Gaussian processes, GaussianProcesses.jl, using the Julia language.GaussianProcesses.jl takes advantage of the inherent computational benefits of the Julia language, including multiple assignments and just-in-time compilation, to generate fast, flexible and user-friendly packages for Gaussian processes.
In this paper, a parallel incremental S-FEM software package epSFEM for elastic-plastic problems is designed and implemented by utilizing the Julia language on a multi-core CPU.Distributed parallelism and partitioned parallelism are used for the assembly of the stiffness matrix, allowing multiple cells to be assembled simultaneously, avoiding excessive for loops and saving computation time.The system of equations is solved using the PARDISO [40] parallel sparse matrix solver.epSFEM applies to more common and complex elastic-plastic mechanical problems in practical engineering.Moreover, epSFEM adopts an incremental theory suitable for most load cases to solve elastic-plastic problems, and the calculation results are more reliable and accurate.
The contributions of this paper can be summarized as follows: (1) A parallel S-FEM package epSFEM using incremental theory to solve elastic-plastic problems is developed by Julia language.
(2) The computational efficiency of epSFEM is improved by using distributed and partitioned parallel strategy on a multi-core CPU.
(3) epSFEM features a clear structure and legible code and can be easily extended.The rest of this paper is organized as follows.The theory related to S-FEM and Julia language are presented in Section 2. The detailed implementation steps of the software package epSFEM are described in Section 3. Two sets of numerical examples are used to assess the correctness of the epSFEM and to evaluate its efficiency in Section 4. The performance, strengths and weaknesses of the epSFEM and the future direction of work are discussed in Section 5. Section 6 presents the main conclusions.

Background
In this section, the theoretical basis of the S-FEM and parallelization strategy of the Julia language on a multicore CPU are introduced.

Overview of the S-FEM
The S-FEM is the implementation of the FEM by employing the strain smoothing technique to modify or reconstruct the strain field such that more accurate or special performance solutions can be obtained.NS-FEM, for example, has an upper bound solution to the model because of its weak super-convergence, insensitivity to mesh deformation and an overly soft system stiffness matrix.In the ES-FEM and FS-FEM models, there are no unphysical modes, so both methods give good results for dynamic and static problems.In the S-FEM, the most important goal is the modification of the compatible strain field or reconstruction of the strain field only from the displacement field [11].To guarantee the stable and convergent properties of the established S-FEM model, this strain modification or reconstruction needs to be conducted in an appropriate way to obtain special characteristics.Strain modification or reconstruction can be implemented within the element, but it is generally conducted across the element to obtain more information from adjacent elements.Different modification or reconstruction methods correspond to separate S-FEMs, that is, CS-FEM, NS-FEM, ES-FEM and FS-FEM.
For two-dimensional static problems, ultra accurate numerical solutions can be obtained using ES-FEM, and the calculation results of ES-FEM based on T3 elements are even more accurate than traditional FEM with Q4 elements (same number of nodes) [11,14].Therefore, the ES-FEM is employed to solve the two-dimensional elastic-plastic problem in this paper, and the implementation steps are introduced as follows.

Workflow of the ES-FEM
The ES-FEM calculation process is similar to that of the FEM, except that the ES-FEM needs to construct a smoothing domain on the basis of the FEM model and modify or reconstruct the strain field.As shown in Figure 1, many techniques designed for FEM can be adapted for ES-FEM.In short, the difference between the ES-FEM and FEM is that all calculations of the FEM are based on elements, while all calculations of the ES-FEM are conducted on smoothing domains.The two-dimensional solid mechanics problem with problem domain Ω and boundary Γ = Γ u ∪ Γ t are considered, where Γ u is the essential boundary where displacement conditions are prescribed and Γ t is the natural or force boundary.

FEM ES-FEM
The calculation procedure of ES-FEM is as follows [7,11,14]: (1) Discretization of the problem domains and construction of the smoothing domains In the ES-FEM, general polygonal elements are used to divide the problem domain, mainly T3 elements suitable for solving two-dimensional problems.When the T3 element is used, the meshing can be the same as the standard FEM, such as the widely used Delaunay triangulation method.
As shown in Figure 2, on the basis of the polygonal element mesh, the smoothing domain is constructed.The problem domain is divided into N e polygonal elements, including N eg edges.The edge-based smoothing domain is composed of two nodes connecting one edge and the centroid of its adjacent elements.The two nodes A and B connecting edge AB and centroid D of the triangle element form the smoothing domain (ABD), see Figure 2. The construction of the smoothing domain, such as the discrete problem domain, must follow the principle of no gap and no overlap, that is, , Ω s i ∩ Ω s j = ∅, and i = j.(2) Creation of the displacement field The generalized displacement field ũ at any point in the triangular element is approximated as: where N n is the number of smoothing domain nodes, di is the nodal displacement at node i, and N i (x) is the shape function: where n is the degree of freedom of the smoothing domain nodes.The Gauss integration point interpolation distribution of the ES-FEM shape function is illustrated in Figure 3.As shown in Figure 3, the commonly used linear triangular elements are employed to divide the mesh.Here, the shape function values at the Gauss integral point are calculated in two cases: boundary edge and internal edge.The results are shown in Tables 1 and 2.
(3) Construction of the smoothed strain field For triangular, quadrilateral and polygonal elements, strain smoothing techniques can be used to construct the strain field directly from the boundary integrals of the shape function without the need for coordinate mapping.In the ES-FEM, the smoothed strain ε is computed as follows: where ε(x) = L d u is the strain that satisfies the compatibility condition in the traditional FEM, Φ(x) is the smoothing function, and Ω s k is the smoothing domain, which can be defined as follows: where A s k is the area of the smoothing domain.Combining the Gaussian divergence theorem, the domain integral is transformed into the edge integral to obtain the following smoothed strain calculation equation: where L d is the partial differential matrix operator, L n is the outward unit normal vector and Γ s k is the boundary of the edge-based smoothing domain.
where n x and n y are the x-axis and y-axis components of the normal vector outside the unit, respectively.Similar to the FEM, the smoothed strain field is divided into: where B I is the smoothed strain matrix: where bIx (x k ) and bIy (x k ) is defined as shown in Equation ( 9).The boundary integral method is used to solve the smoothed strain matrix.This method is applicable to any polygonal geometry in the smoothing domain.
where N I is the number of segments of Γ s k , n i,x and n i,y are the outer normal vectors of the Ith integration segment, x G i is the midpoint of each segment of the boundary, that is, the Gauss integration point, and N I (x G i ) is the shape function value at the Gauss integration point.(4) Establishment system of equations The smoothed Galerkin weak form is utilized to establish the system equation in the ES-FEM.During this process, only a simple summation calculation of the relevant parameters of the smoothing domain is required.
The linear system of equations of ES-FEM is: where d is the displacement vector of all nodes in the S-FEM and f is the vector of all loads.KES−FEM is the system stiffness matrix of the ES-FEM and defined as Equation (11): where c is the matrix of elasticity coefficients.
(5) Imposition of the boundary conditions In the ES-FEM, the application process of displacement boundary conditions is similar to that of the FEM because the shape function used in the S-FEM has the same delta property as the FEM.The main methods include the direct method, set "1", multiple large numbers, Lagrange multiplier and penalty function.The force boundary conditions are added directly to the corresponding nodes.
(6) Postprocessing The weighted average rule is used to obtain the equivalent nodal stress in the smoothing domain, and the shape function interpolation technique is used to obtain the continuous stress field in the problem domain.The process is similar to the FEM.Finally, the accuracy of the results is assessed in relation to the actual problem.

The Julia Language
Julia, officially released in 2012, is a flexible dynamic language for scientific and numerical computation [41].To solve large-scale numerical computation problems, parallel computing is considered essential.There are useful built-in features in Julia that make it easier for developers to design efficient parallel code.Three of the parallel strategies, that is, coroutine, multithreaded and distributed computing, are dependent on a multicore CPU.Developers can select the appropriate parallelism method for their needs.Parallel computing on many-core GPUs can be conducted by using specific packages or utilizing the built-in function of Julia and parallel arrays [1].
In this paper, parallelism on a multicore CPU is applied to effectively improve the calculation and assembly efficiency of the global stiffness matrix.In the Julia language, distributed computing based on a multicore CPU first redistributes tasks according to the number of CPU cores of the computer and then dynamically allocates computing tasks to each process so that multiple processes can be calculated at the same time, thus improving the computing efficiency.In the parallel computing of Julia language, "SharedArray" is used to reduce memory usage and improve computational efficiency.Moreover, when a "SharedArray" is employed, multiple processes are allowed to operate on the same array in the meantime [42,43].

The Implementation of Package epSFEM 3.1. Overview
A parallel S-FEM package using incremental theory to solve elastic-plastic problems is developed on a multicore CPU.This package contains the following three components: Preprocessing: The preprocessing includes mesh generation and the construction of smoothing domains based on the mesh.After the preprocessing is completed, the model details of constructing the smoothing domain can be obtained, and stored in sparate five files: nodes, elements, internal edges, external edges and the centroids of mesh elements.
Solver: The solver uses incremental S-FEM to solve the elastoplastic problems, which is the main part of the whole software package.It is mainly categorized into: (1) assembly of the elastic stiffness matrix and (2) incremental loading and semismooth Newton method iterations to solve the system of equations.The calculation procedure of the incremental loading and semismooth Newton method iterations of the solver is illustrated in Figure 4.
Postprocessing: ParaView [44] is utilized to visualize the numerical calculation results.The WriteVTK.jlpackage in Julia is used to write the "vtu" format file needed for ParaView visualization.

Mesh Generation
Mesh generation is the first step of the numerical analysis, which also affects the accuracy and efficiency of the numerical analysis.Currently, there are many mature mesh generation algorithms and software.The focus of this paper is on the solver section, so a simple direct generation method is used to generate the mesh and then divide the smoothing domain on this basis.Because T3 elements have good adaptability and are most used in science and engineering practice, we choose T3 elements to divide the problem domain.

Construction of the Smoothing Domain
Constructing the smoothing domain based on the meshing of the problem domain is one of the key tasks of the S-FEM.According to the methods of constructing the smoothing domain and storing model information in Refs.[35,45], the smoothing domain of the mesh is constructed, and the model information after dividing the smoothing domain is output.To get the best performance out of the Julia language, the following calculations can be looped in the unit of column, and the model information is stored based on the column.In this paper, we address the mesh details by integrating the features of ES-FEM and Julia parallel computation and then utilize five matrices to save the mesh details in an appropriate way; see Figure 5.
The "Node" matrix stores the x and y coordinates of the mesh nodes.The "Centroid" matrix stores the x and y coordinates of the center of the cell.The node numbers corresponding to the mesh cells are stored in the "Element".The three node numbers of the triangular cells are stored in the three rows of "Element" in a counterclockwise order, and the number of columns is the number of mesh cells.
In the ES-FEM, the smoothing domain is constructed by using edges as the basis.We divide all the edges of the model into two categories: the outer edges are saved in the "Edge_out" matrix, and the inner edges are saved in the "Edge_in" matrix.For the matrix "Edge_out", the two node numbers of the outer edge are stored in the first two rows, the serial number of the triangle is appended to the third row, and the rest point of the triangle is appended to the fourth row.Because one inner edge belongs to two triangles, the first two rows store the node numbers of the inner edges, the third and fourth rows of "Edge_in" are the serial numbers of neighboring triangles and the last two rows are the numbers of the other points in the triangle.

Solver
The incremental S-FEM is utilized to address the elastoplastic problem, choosing the implicit constitutive integration algorithm of the linear kinematic hardening Von Mises constitutive model and the corresponding consistent tangent modulus.First, the elastic predicted stress is calculated according to the strain of the equilibrium iteration, and then the modified stress is calculated according to a certain direction to make the stress return to the updated yield surface [46,47].The nonlinear equations are solved by employing the semismooth Newton method.
The solution process is composed of two major procedures: (1) assembly of the elastic stiffness matrix and (2) incremental loading and semismooth Newton method iterations.The second procedure is composed of multiple incremental step cyclic calculations.Each incremental step can be divided into three steps: (1) assembly of tangent stiffness matrix, (2) solving of equations and (3) updating of hardening variables and plastic strain.According to the characteristics of parallel computing, the calculation of the latter step cannot be dependent on the previous step, so when assembling the elastic stiffness matrix, the tangent stiffness matrix can be calculated in parallel to improve efficiency.Distributed computing is used in Julia to calculate the elastic stiffness matrix and the tangent stiffness matrix for multiple elements in parallel.When solving the overall nonlinear system equations, we utilize the semismooth Newton method for each iteration.For the set of equations in each iteration, a parallel sparse equation solver, PARDISO, is used [40].The detailed procedure of the solver in epSFEM will be presented in the subsequent sections.

Assembly of Elastic Stiffness Matrix
After the model is preprocessed, it needs to be assembled with an elastic stiffness matrix first.In epSFEM, we calculate the stiffness matrix of the associated smoothing domain by dividing the outer edge and the inner edge, and the calculation process is basically the same.Taking the internal edge as an example: (1) The areas of the two triangle elements that share the inner edge is attached are computed.This process is conducted by procedure "area.jl"according to Equations ( 12)-( 16): where xi and yi is the coordinate of the node i.
(2) The length of each edge of the smoothing domain is calculated in Equation ( 17).For the smoothing domain of the inner side, there are four edges.This process is realized in the file "lp.jl": (3) The normal outward vector v1 is calculated for each side of the smoothing domain.This is performed in "vectorin.jl",according to Equations ( 18)-( 20): (4) Assembly of the global elastic stiffness matrix and the global smoothed strain matrix.First, the stiffness matrix of the smoothing domain element is computed and then assembled according to the smooth domain nodes.Due to the large number of zero elements in the matrix, to reduce the memory occupation, the matrix is stored in a sparse form.There are three common ways to construct sparse arrays: Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and COOrdinate (COO).
First, the COO format is used to construct the global elastic stiffness matrix, since the multi-dimensional arrays in Julia are stored according to column-based sequence.Then, for the convenience of solving the subsequent system equations, we replace it with the CSC format.To construct a sparse array according to COO format, we first need to construct three one-dimensional arrays, that is, IK_elast, JK_elast and VK_elast.
IK_elast, JK_elast and VK_elast denote the row number, column number and value of each entry in the global stiffness matrix according to the order of each row.Since the sparse functions can accumulate the entries at the same position automatically, the magnitudes of IK_elast, JK_elast and VK_elast can be predetermined.
The assembly method of the global smoothed strain matrix is basically the same as the stiffness matrix, except that when it is assembled, the rows are carried out according to the elements, and the columns are carried out according to the nodes.Three one-dimensional arrays, IB, JB and VB, are constructed in advance.
For parallel computing, the six arrays of IK_elast, JK_elast, VK_elast, IB, JB and VB need to be converted to "SharedArrays" in advance, and the elastic stiffness matrix is assembled in parallel using the "@distributed" macro in Julia.Because the stiffness matrix calculation of each element has no data dependence, there will be no data interference when performing parallel computing.
The number of processes needs to be added using the function "addprocs" before all parallel computing starts.In the parallel elastic stiffness matrix assembly, we use the "@distributed" macro to automatically allocate tasks to each process according to the number of processes and the total number of tasks for parallel computing of the loop.The total number of tasks currently is equal to the total number of smoothing domains.The "@distributed" macro is executed asynchronously on the loop; it will generate independent tasks on all available processes and return immediately without waiting for the computing to complete.To wait for the computing task to complete, the "@sync" macro must be used before the call.The procedure of assembling the elastic stiffness matrix for the internal edges by distributed parallel computation is illustrated in Algorithm 1.After the global stiffness matrix and the global smoothed strain matrix are assembled, the "Sparse" function is used to convert them into the CSC format.

Algorithm 1 Parallel calculation and assembly of the elastic stiffness matrix
Input: Node, Centroid, Element, Edge_in, shear, bulk Output: K_elast, B 1: Set the number of CPU cores for the Julia program.2: Set IK_elast, JK_elast, VK_elast, IB, JB and VB to SharedArrays.
3: @sync @distributed begin 4: for every internal edge do 5: Compute the area of the interior quadrilateral.

6:
Compute the side lengths of the interior quadrilateral.

7:
Compute the normal unit vectors of the four sides of the interior quadrilateral.

8:
Compute the smoothed strain matrix of an interior quadrilateral.

9:
Compute the stiffness matrix of an interior quadrilateral. 10: Compute the elastic coefficient matrix.

Assembly of the Tangent Stiffness Matrix
The tangent stiffness matrix of the model needs to be calculated when solving the elasticplastic problem using incremental theory.Equation ( 21) is used instead of Equation ( 22) to calculate the global tangent stiffness matrix.Among them, elastic stiffness matrix K elast , smoothed strain matrix B and elastic matrix D elast can be obtained in advance at the stage of assembling the elastic stiffness matrix; only elastoplastic matrix D tangent depends on the plastic model, and must be partially reorganized or modified in each Newton iteration.When most portions of the model are in the elastic stage, D tangent − D elast is more sparse than D tangent [48,49].
D tangent is calculated by the constitutive integral.The implicit discrete method is used to solve the constitutive integral, that is, elastic prediction and plastic correction.For the constitutive relation, the linear kinematic hardening Von Mises model is employed.
The steps to calculate the tangent stiffness matrix are as follows: (1) Calculation of the smoothed strain field.Since the global smoothed strain matrix B has been calculated and assembled in the stage of assembling the elastic stiffness matrix, the smoothed strain field ε can be acquired according to the strain coordination Equation ε = Bu.
(2) The implicit Von Mises constitutive integral algorithm is used to obtain the stress S and tangent operator DS of the model, by procedure "constitutive_problem1.jl".The formula, according to [48][49][50][51], is: where T k (ε k ) represents the stress-strain operator, , a is the hardening parameters and Y is the yield stress.
where T 0 k (ε k ) is the derivative of the stress-strain operator, C = KI ⊗ I + 2GI D , I ⊗ I is the unit second-order tensor, I D = I − I⊗I 3 , K = E/3(1 − 2µ) is the bulk modulus and G = E/2(1 + µ) is the shear modulus.
The modification of hardening variable β k and plastic strain ε p k is: where β k−1 hardening tensor from the previous incremental step and ε p k−1 plastic strain tensor from the previous incremental step.
To check whether plastic correction is needed, the array CRIT of 1 × s_n_e is defined representing the yield criterion, that is, s tr k − Y , and the corresponding logical array IND_p of 1 × s_n_e with the smoothing domain of plastic behavior, where s_n_e represents the total number of smoothing domains.The parallel implementation of the implicit Von Mises constitutive integral is shown in Algorithm 2.
In the parallel computing of constitutive integrals, all processes can access the underlying data.To avoid conflicts, we first construct a "myrange" function to assign tasks to each process according to the number of CPU cores added.Then, the main computing process is defined as a kernel function "assembly_tangent", and a wrapper "shared_constructive" is defined to encapsulate the kernel function.Finally, the function "constitutive_problem" is constructed to call the packaged kernel function for partition parallel computing.The "con-stitutive_problem" function minimizes the communication between the processes so that each process can continue to compute the allocated part for a period of time, and improve the efficiency of parallelism.The "@async" macro is used to wrap arbitrary expressions into tasks.For any content within its scope, Julia will start to run this task and then continue to execute the next code in the script without waiting for the current task to complete before executing it.The "@sync" macro means that the next task will not be executed until the dynamic closure defined by the macro "@async" is completed.
(3) Calculation of the global tangent stiffness matrix.First, the sparse elastoplastic matrix D tangent is constructed according to the tangent operator DS obtained by the constitutive integral and then the global tangent stiffness matrix is calculated according to Equation (21).Check whether the smoothing domain yields according to the yield criterion.

7:
Calculate the consistent tangent operator.

8:
Plastic correction of the stress tensor.

9:
Plastic correction of the consistent tangent operator.10: end for 11: Parallel computing using "remotecall" in Julia language.

Solution of System of Equations
In this process, the internal force of the model is calculated by using the stress obtained from the constitutive relationship and the smoothed strain matrix.Then, the displacement boundary conditions are applied by the direct method; that is, the corresponding rows and columns with displacement boundary conditions of "0" are deleted.A logical array Q is designed, which sets the displacement boundary condition of "0" to "0" and the rest to "1".Then, the stiffness matrix, displacement and force are calculated with a logical array index.After that, the Pardiso.jlpackage is added and the "MKLPardisoSolver" solver in the package is used to solve the system of equations.Finally, the node displacement increment "dU" of one Newton iteration in an incremental step can be obtained.
In this paper, the semismooth Newton method is employed to solve nonlinear system of equations and check whether iteration is convergent according to Algorithm 3. "MKLPar-disoSolver" is the solver in the Pardiso.jlpackage, Q is the logical array corresponding to the displacement boundary conditions, f is the external force vector, F is the internal force vector, the subscript k represents the kth incremental step and the superscript it represents the it-th iteration step, and U 2 e = U T K elast U.In each Newton iteration, the tangent stiffness matrix K tangent is used to solve the linear problem, which corresponds to the system of linear equations: Algorithm 3 Newton iteration terminates judgment 5: dU it e /( U it−1 k e

Update of Hardening Variable and Plastic Strain
After each incremental step is calculated, the hardening variable and plastic strain need to be updated by using Equations ( 25) and (26).Based on the implicit constitutive integration algorithm of Algorithm 2, the modification of the hardening variable and plastic strain is added.The parallel strategy in this part is consistent with Algorithm 2.

Postprocessing
After the execution of the solver, the widely used visualization software ParaView is used to visualize the numerical computational results.The relevant package WriteVTK.jl in Julia can write VTK XML files and use ParaView to visualize multidimensional datasets [44].The VTK format files support include straight line (.vtr), structured mesh (.vts), image data (.vti), unstructured mesh (.vtu) and polygon data (.vtp) [52].
An unstructured mesh "vtu" format file is designed.Its implementation steps are as follows: (1) we need to define a cell type, which is defined in this paper as "VTKCell-Types.VTK_TRIANGLE", representing the linear triangular element; (2) the "MeshCell" function is used to define the mesh model and obtain an array containing all mesh cells; (3) to generate a "vtu" format file, we need to initialize the file with mesh nodes and element information and then add node displacement data and other information to the file; (4) we can save the file as a "vtu" format file.

Validation and Evaluation of epSFEM
In this section, two sets of benchmark tests are performed on a powerful computational platform to evaluate the correctness and efficiency of epSFEM.The details of the workstation computer used are shown in Table 3.

Validation of the Accuracy of epSFEM
To validate the correctness of epSFEM, we use the model shown in Figure 6a to perform elastoplastic analysis and compare its calculation accuracy with traditional finite element software.In this example, a symmetric displacement boundary condition is set up on the left and bottom of the computational model.The traction force of F t = 200 N/m acts on the top of the model along the normal direction, and the traction force is added in increments through the cyclic load shown in Figure 6b.The elastic parameters are: E = 206,900 (Young's modulus) and µ = 0.29 (Poisson's ratio).The parameters related to plastic materials are specified as follows: a = 1000 , Y = 450 (2/3) .The mesh computational model with 150 triangular elements is illustrated in Figure 7a, and the computational model after constructing the smoothing domain is shown in Figure 7b.
To demonstrate the accuracy of the calculation, the displacement calculation of the model in Figure 6a is conducted, and comparisons are made in the three following cases.
(1) epSFEM is employed to calculate the displacement of a mesh model, which includes 341 nodes and 600 triangular elements (T3 elements); see Figure 8a.
(2) According to Ref. [49], the conventional FEM is used to calculate the displacement of a mesh model, which includes 341 nodes and 600 triangular elements (T3 elements); see Figure 8b.
(3) According to Ref. [49], the conventional FEM is employed to calculate the displacement of a highly accurate mesh model that includes 231,681 nodes and 76,800 eight-node quadrilateral elements (Q8 elements).
The displacements of the top node of the model calculated by the above three methods are compared in Figure 9.As shown in Figure 9, the displacement calculated by epSFEM has higher accuracy than FEM-T3 and slightly lower accuracy than FEM-Q8.Hence, the correctness of epSFEM is proven.

Evaluation of the Efficiency of epSFEM
To better analyze the computing efficiency of epSFEM, the computational efficiency of the serial and parallel versions of the epSFEM are recorded and compared.Five mesh models were created based on the same size model, shown in Figure 6a.Detailed information on the mesh is shown in Table 4.In epSFEM, the calculation procedure can be composed of two steps: (1) assembly of the elastic stiffness matrix and (2) incremental loading and semismooth Newton method iterations.In this paper, we focus on the solution of elastic-plastic problems, so the time consumption is predominantly in the second step, which is composed of multiple incremental cyclic loading steps.Each of the incremental steps can be composed of three stages: (1) assembly of tangent stiffness matrix, (2) solving of system of equations and (3) updating of hardening variables and plastic strain.Since the Pardiso.jlpackage is employed to solve equations in serial and parallel code, the efficiency of solving equations in serial and parallel ways are not discussed.For the assembly of the elastic stiffness matrix, its time consumption accounts for a small proportion in the whole elastic-plastic analysis, which is not discussed in this paper.The parallel method of the hardening variable and plastic strain update part is consistent with the parallel method of tangent stiffness matrix assembly.Therefore, we mainly evaluate the computing efficiency of assembling the tangent stiffness matrix in this paper.
As shown in Figure 10, the time to compute the parallelizable section of the tangent stiffness matrix in the serial and parallel versions for five different scale mesh models is compared.As shown in Figure 10, it takes only approximately 335 s to compute a mesh model, including 2.45 million elements on the parallel version, while it takes approximately 3537.6 s to compute the same model on the serial version.On the 24-core CPU, the parallel speedup can reach 10.6.
To reflect the computational efficiency of epSFEM, we also made a comparison between commercial software and epSFEM in terms of the time required to calculate the five scale models, as shown in Table 4.The total time required for the solver computing is recorded for comparison.As shown in Figure 11, for a model containing 2.45 million elements, ABAQUS requires 10,619 s to compute, while the parallel version of epSFEM needs only 5876.3 s to complete the computation.The parallel version of epSFEM is approximately 1.8 times faster than ABAQUS.ABAQUS was also used to calculate the displacements for a mesh model with 341 nodes and 600 triangular cells and to compare the displacements obtained by ABAQUS with those obtained by epSFEM_T3 and FEM_Q8 in Section 4.1.Using the displacement solution of FEM_Q8 as the reference solution, it can be seen that the displacement calculation accuracy of epSFEM is higher than that of ABAQUS; see Figure 12.It can be seen from the above results that the calculation time of epSFEM is shorter than that of ABAQUS when calculating the same mesh model, and the calculation accuracy of epSFEM is higher than that of ABAQUS, so the calculation efficiency of epSFEM is higher than that of ABAQUS.

Discussion
In this section, the capability, strengths and weaknesses of the epSFEM software package, as well as the future direction of work, are discussed.

Computational Accuracy
The accuracy of the calculation is the first guarantee for whether a software package can be used.To verify the correctness of the epSFEM calculation, a numerical example is used in Section 4.1.As listed in Table 5, the total displacement results of the six nodes at the top y = 10 m of the model are selected for comparison.Taking the displacements of FEM-Q8 as the baseline and comparing the displacements of epSFEM-T3 and FEM-T3 with them, it can be shown that the displacements of epSFEM-T3 are significantly closer to the baseline.The result difference is expressed by relative error.As seen in Table 5, for the node displacement at x = 0 and y = 10 m, the error of FEM-T3 compared with FEM-Q8 is 25.24%, while the error of epSFEM-T3 compared with FEM-Q8 is only 2.96%.This is because the S-FEM is based on the smoothing domain calculation that optimizes the system stiffness matrix and enables the displacements to be closer to the reference values.In this paper, the efficiency of computation is contrasted in two aspects: parallel speedup of parallelizable code and solver computation time; see Figures 10 and 11.
In this paper, we recorded the time required to compute the parallelizable portion of the tangent stiffness matrix, that is, constitutive integral algorithm, for seven different size mesh models using serial and parallel epSFEM.As shown in Table 6, the parallel speedup is 10.2 for the computing model with 38,400 elements, increases to 14.8 for the computing model with 0.6 million elements and decreases to 10.0 for the computational model with 1.2 million elements, after which the parallel speedup increases slightly with the increase of the computational model size and basically stabilizes.The reasons why the parallel speedup shows a pattern of increasing then decreasing and finally converging as the mesh scale increases are analyzed are as follows: (1) Parallel computing includes the time to allocate tasks; the amount of computation allocated to each process cannot be exactly the same, and there is the problem of load imbalance for each process, so the parallel speedup cannot reach the ideal parallel speedup.(2) When the mesh scale is small, such as 38,400 to 614,400, the total computation time increases as the mesh scale increases, the percentage of assigned tasks in the total time decreased, and the parallel speedup increases.(3) When the mesh scale increases to 1.2 million, the performance of the code decreases due to the larger memory allocation required and the increased garbage collection time during the code run.In the benchmark tests of this paper, the above effects do not have a significant impact on the overall performance of epSFEM as the scale continues to increase.On the contrary, it tends to a steady state.
TimerOutputs.jlpackage is used to test the time consumption and memory allocation in each part of the calculation process and generate the formatted table to output [53].As listed in Table 7, the allocation of time and memory for each part of the parallel epSFEM solver when the number of elements is 600,000.Table 7 shows that the time proportion of the elastic stiffness matrix is very small, which is only 0.07% when the number of elements is 600,000.Therefore, we focus on the time and memory consumption of each part of the incremental loading and the semismooth Newton iteration, which is the plastic section in Table 7. Figure 13 presents the time occupancy of the tangent stiffness matrix assembly, solving equations, hardening variables and plastic strain updating when calculating the model with 2.45 million elements using the serial and parallel versions of epSFEM.Because the hardening variable and plastic strain only need to be updated once for each incremental step, the time proportion is the smallest.The tangent stiffness matrix assembly and solving equations need to be calculated not only for each incremental step, but also for each iteration, so the time proportion is longer.As shown in Figure 13, the proportion of time spent solving the equations in parallel computing is considerably larger than in serial computing, accounting for approximately 80%.In summary, epSFEM combines the features of incremental theory and the parallel strategy of the Julia language to achieve a parallel and efficient incremental S-FEM for solving the elastoplastic problem.Although epSFEM can take full advantage of multicore processors, it still requires a considerable amount of time to solve linear system equations for large sparse matrices.Moreover, due to the use of incremental theory, the calculation of the latter incremental step depends on the previous incremental step, and multiple incremental steps cannot be calculated in parallel, which also limits the computational efficiency of the code.

Comparison with Other S-FEM Programs
Compared with the S-FEM packages implemented with C++, epSFEM code is more readable and convenient for further development, and has lower requirements for programming ability.In contrast with the S-FEM packages implemented by MATLAB, epSFEM does not require the payment of licensing fees for the Julia language; additionally, the computational efficiency of the Julia language is higher than that of MATLAB.Moreover, epSFEM has a clear structure and modular implementation, and each calculation step is highly customized and has the characteristics of high efficiency and simplicity.
The epSFEM is suitable to more common and complex elastoplastic mechanical problems in practical engineering and has a wider range of applications than the elastic S-FEM package implemented using the Julia language.In contrast with the elastic-plastic S-FEM package with total strain theory realized by the Julia language, epSFEM uses incremental theory suitable for most loading situations to solve elastic-plastic problems, and the calculation results are more reliable and accurate.

Outlook and Future Work
epSFEM is an incremental ES-FEM to solve two-dimensional elastoplastic problems.The next step is to expand it to an incremental FS-FEM to solve three-dimensional elastoplastic problems.Currently, the S-FEM has been commonly utilized in material mechanics and biomechanics, but it is still less applied in the field of geotechnical mechanics [54,55].We plan to extend epSFEM to use the Mohr-Coulomb criterion combined with the strength reduction method to analyze the deformation and failure of slopes.With the maturity of artificial intelligence technology such as machine learning and deep learning, mechanical analysis and numerical simulation methods can be well integrated with machine learning, which provides a new direction for computational mechanics [56][57][58][59].In the future, the authors wish is to use machine learning combined with epSFEM to solve partial differential Equations (PDEs), or study parameter inversion.

Conclusions
In this paper, a parallel incremental S-FEM package epSFEM for elastic-plastic problems has been designed and implemented by the Julia language on a multicore CPU.epSFEM has a clear structure and legible code and can be easily developed further.epSFEM utilizes incremental S-FEM to solve elastic-plastic mechanics problems for complex load cases more common in practical engineering, and the calculation results are more accurate and reliable.A partitioned parallel strategy was designed to improve the computational efficiency of epSFEM.This strategy can avoid conflicts when accessing the underlying data in parallel computing.To demonstrate the correctness of epSFEM and assess its efficiency, two sets of benchmark tests were performed in this paper.The results indicated that (1) when calculating the same mesh model, the calculation accuracy of epSFEM is higher than that of the traditional FEM; (2) it requires only 5876.3 s to calculate an elastoplastic model, consisting of approximately 2.45 million T3 elements using the parallel epSFEM software package, while it needs 10,619 s to calculate the same model using the commercial FEM software ABAQUS; (3) on a 24-core CPU, the parallel execution of epSFEM is approximately 10 times faster than the corresponding serial version.

Figure 1 .
Figure 1.Flow chart of the FEM and ES-FEM.

Figure 2 .
Figure 2. Polygon element mesh and the edge-based smoothing domain in ES-FEM.

Figure 3 .
Figure 3. Illustration of the interpolation distribution of the Gaussian integration points of the ES-FEM shape function.

Table 1 .
The shape function entries at different points on the boundary of the smoothing domain connected to the outer edge 1-2 in Figure3.

Figure 4 .
Figure 4. Illustration of the calculation procedure of incremental loading and semismooth Newton method iterations.

Figure 6 .Figure 7 .
Figure 6.(a) Simplified 2D geometry of the elastic-plastic problem and (b) history of the traction force.

Figure 8 .
Figure 8.(a) The contour of displacement calculated using epSFEM and (b) the contour of displacement calculated using FEM-T3.
p l a c e m e n t ( m ) T h e x -c o o r d i n a t e o f t h e n o d e ( m )

Figure 9 .
Figure 9.Comparison curves of node displacements at the top of the model calculated by different methods.

Figure 10 .
Figure 10.Comparison of serial and parallel epSFEM computing time of the parallelizable section of the tangent stiffness matrix.

Figure 11 .
Figure 11.Comparison of the computation time of serial and parallel epSFEM and ABAQUS solvers for elastic-plastic problems.

Figure 12 .
Figure 12.Comparison curves of node displacements at the top of the model calculated with ABAQUS and epsFEM.

Figure 13 .
Figure 13.The proportion of time in each part of the epSFEM solver when calculating the model with 2.45 million elements using (a) the serial version of epSFEM and (b) the version of epSFEM.

Table 2 .
The shape function entries at different points on the internal smoothing domain connected with the inner edges 3-5 in Figure3.

:
E, Ep_prev, Hard_prev, shear, bulk, a, Y, S, DS, IND_p Output: S, DS, IND_p 1: Set the number of CPU cores for the Julia program.2: Set E, Ep_prev, Hard_prev, S, DS, IND_p to ShareArray.3: Assign the number of tasks for each process according to the number of processes.
4: for number of tasks in each process do 5:

Table 3 .
Specifications of the workstation computer for performing the benchmark tests.

Table 4 .
Details of the used five mesh models.

Table 5 .
Validation of the accuracy of the epSFEM by comparison of calculated displacements.

Table 6 .
The parallel speedup of the parallelizable section of the tangential stiffness matrix.

Table 7 .
Time and memory allocation of each part of the parallel epSFEM solver when the number of elements is 600,000.