Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis

Kim, Honglae; Hong, Seokmoo; Kim, Naksoo

doi:10.3390/jmmp10060197

Open AccessArticle

Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis

by

Honglae Kim

¹,

Seokmoo Hong

^2,*

and

Naksoo Kim

^1,*

¹

Department of Mechanical Engineering, Sogang University, Seoul 04107, Republic of Korea

²

Department of Future Automotive Engineering, Kongju National University, Cheonan 31080, Republic of Korea

^*

Authors to whom correspondence should be addressed.

J. Manuf. Mater. Process. 2026, 10(6), 197; https://doi.org/10.3390/jmmp10060197

Submission received: 24 April 2026 / Revised: 25 May 2026 / Accepted: 30 May 2026 / Published: 3 June 2026

Download

Browse Figures

Versions Notes

Abstract

We describe a GPU-resident execution pipeline for explicit large-deformation finite element analysis in which every stage of the timestep—internal force evaluation, contact processing, nodal update, time integration, and minimum edge-length reduction—operates on arrays that remain in device memory, so per-step bulk transfers across PCIe are avoided. Contact is handled on the device through a shared-memory brute-force proximity search with warp-ballot stream compaction. We exercise the solver on a hemisphere compression benchmark at six mesh resolutions (83 K–1.89 M elements). On an NVIDIA L40, per-step speedups over a single CPU core range from about 99× to 138×, increasing with problem size and approaching a plateau near 137× for the largest meshes (above roughly 1 M elements); the contact-enabled configuration adds a net ON/OFF overhead of +13% to +21% to the step time. Against LS-DYNA running in SMP mode on the same problem, the proposed solver is roughly 94× faster than the best 8-core configuration, a margin consistent with the multicore saturation observed in the SMP measurements. The remaining limitations—single-GPU execution, FP32 arithmetic, and rigid-body contact search without a BVH broad phase—are identified as specific targets for multi-GPU, mixed-precision, and scalable-contact extensions.

Keywords:

GPU computing; explicit finite element method; large-deformation contact; contact algorithm; CUDA; computational efficiency; benchmark validation

1. Introduction

Explicit time integration has long been the workhorse for large-deformation, high-rate problems in metal forming, crash, and blast analysis [1,2,3]. The conditional stability of the central-difference scheme forces very small increments, so a single analysis routinely runs for hundreds of thousands to millions of steps. This regime is consistent with GPU throughput characteristics—short, repetitive, memory-bound work with abundant element- and node-level parallelism. Yet, most GPU implementations reported to date still straddle the host and the device: the internal-force kernel runs on the GPU while contact search and time integration stay on the CPU [4,5,6,7]. That arrangement forces mesh state to cross PCIe every timestep, which turns bus bandwidth, rather than arithmetic throughput, into the limiting factor.

In this study, the complete timestep is kept on the device. Internal force computation, contact processing, nodal update, time integration, and minimum edge-length reduction are scheduled as a single consecutive sequence of CUDA kernels that all read and write device-resident arrays. The only quantities that cross PCIe per step are two scalars: the minimum edge-length squared (4 bytes, synchronous, needed for the CFL update) and a fracture flag (4 bytes, asynchronous). Per-step host traffic, therefore, totals 8 bytes, which is negligible against the bandwidth of any modern interconnect and, more importantly, no longer a function of mesh size.

The contributions of this work are the following.

A GPU-resident integrated execution pipeline (Figure 1) in which the five core kernels form a single device-side sequence, removing the bulk mesh and contact round-trips that dominate partial-offload schemes (Section 3.5).
A contact stage that is fully resident on the device: proximity search is performed with shared-memory tiling, the penetrating-node list is built by warp-ballot stream compaction, and contact forces are written back via atomicAdd into the device external-force array, eliminating every PCIe transfer associated with contact (Section 3.4).
A displacement-Jacobian (H-formulation) update for the deformation gradient that avoids catastrophic cancellation in single precision when F is close to I, keeping FP32 viable for large portions of the analysis (Section 3.6).

Alongside these architectural contributions, we contrast the GPU-integrated path against the SMP multicore scalability of LS-DYNA on the same benchmark, which gives an empirical reading of how far CPU parallelization can be pushed under the conditions we tested before synchronization and memory-contention overheads take over.

2. Related Work

2.1. GPU-Accelerated Finite Element Analysis

GPU-accelerated finite element analysis has been pursued since the late 2000s. Early work reduced explicit-dynamics cost at the formulation level [4], ported TLED to the GPU for real-time surgical simulation [5], and implemented nonlinear FEM kernels in CUDA-based frameworks [8,9].

For structural mechanics, Huthwaite [6] and Bartezzaghi et al. [7] demonstrated GPU acceleration for elastodynamics and thin-shell explicit dynamics, while Simpson et al. [10] and Shi et al. [11] highlighted both the scalability potential and the remaining bottlenecks in nonlinear structural dynamics on modern GPU hardware.

Full-pipeline studies are fewer and differ from the present explicit contact setting. Fu et al. [12] focused on assembly and an AMG-preconditioned CG solve for elliptic/implicit problems, whereas Johnsen et al. [13,14] and Cao et al. [15] advanced GPU contact and multi-GPU explicit FEM in application domains and architectures distinct from the single-GPU, fully resident timestep pipeline examined here.

2.2. Contact Algorithms

Contact is typically the single most expensive stage in an explicit analysis. Hallquist et al. [1] established the sliding-interface and contact-impact machinery used in large-scale Lagrangian codes, including the lineage that LS-DYNA inherits. Zhong [16] set out finite element procedures for contact-impact in a unified way, and Belytschko et al. [2] treat the theoretical and numerical sides in detail; the LS-DYNA theory manual [3] remains the canonical reference for penalty-based node-to-segment contact in production explicit FEM codes.

At production scale, contact search is almost always split into a broad phase (candidate-pair identification) and a narrow phase (penetration and force evaluation), with bounding-volume hierarchies and spatial hashing used to keep the broad-phase candidate set tractable. Attaway et al. [17] already showed parallel contact search with dynamic load balancing in PRONTO3D. Johnsen et al. [14] combined BVH construction and update with penalty-based response in a matrix-free explicit FEM setting and made the case that data-structure design is where contact bottlenecks are either resolved or entrenched.

2.3. Differentiation from Prior Work

Relative to this body of work, the present study addresses a gap that has been noted but not directly examined in the existing literature. Most GPU implementations of explicit FEM fall into one of three categories. The first offloads only the internal-force kernel and leaves contact search and time integration on the CPU [5,6,7]; the analysis then pays a PCIe transfer on every step, and contact remains a host-side cost. The second is domain-specific and has been driven by real-time soft-tissue or surgical simulation [8,9], with constitutive choices that do not translate directly to industrial metal forming. The third is explicitly contact-free, covering linear or weakly nonlinear elastodynamic problems [6,7]. Commercial vendors have also reported GPU acceleration in LS-DYNA [18,19], but the publicly documented effort has concentrated on linear algebra in the implicit solution path rather than on the explicit dynamics loop. What has not been reported, to our knowledge, is a single GPU-memory-resident execution pipeline in which internal force, contact, nodal update, and time-increment evaluation are scheduled as one consecutive device-side sequence, combined with a direct comparison against the SMP multicore scalability of a production explicit FEM code on the same problem.

The present work addresses that gap along two axes. First, the five core stages of the timestep—internal force evaluation, contact processing, nodal update, time integration, and minimum edge-length reduction—are arranged so that they execute consecutively in device memory inside a single CUDA stream (Section 3.5). Second, the resulting performance is characterized through three targeted DOE studies that vary mesh scale, contact activation, and multicore CPU thread count, so that the contribution of each factor to the per-step cost can be read directly rather than inferred (Section 4).

The novelty of this work is not the acceleration of an isolated element or contact kernel. It is the full-timestep residency of the explicit loop: internal-force evaluation, contact processing, nodal update, time integration, and minimum-edge reduction remain in device memory and are scheduled as a consecutive CUDA pipeline. Table 1, therefore, compares prior work primarily by GPU-resident scope and contact treatment, not only by reported speedup. The revised validation further supports this architecture through mesh-scaling measurements, contact ON/OFF decomposition, LS-DYNA SMP comparison, and a GPU FP32 versus CPU FP64 field-quantity accuracy check.

3. Methods

3.1. Governing Equations

The formulation is updated Lagrangian with explicit central-difference time integration. The semi-discrete form of the equation of motion is

M a = f_{e x t} + f_{c o n t a c t} - f_{i n t}

(1)

where

M

is the lumped mass matrix,

f_{i n t}

is the internal force vector,

f_{e x t}

is the external force/load vector, and

f_{c o n t a c t}

is the contact force vector. The central-difference scheme yields

v^{n + \frac{1}{2}} = v^{n - \frac{1}{2}} + a^{n} Δ t^{n}, x^{n + 1} = x^{n} + v^{n + \frac{1}{2}} Δ t^{n + \frac{1}{2}}

(2)

3.2. Element Formulation

Four-node tetrahedral (Tet4) and ten-node tetrahedral (Tet10) elements are supported. Tet4 uses single-point integration and Tet10 uses four-point integration. Mean dilatation can be optionally applied to alleviate volumetric locking.

3.3. Constitutive Model

J2 elastoplasticity and rigid-plastic constitutive models are supported. The J2 model employs a radial return-mapping algorithm, and isotropic hardening follows Swift/Voce laws or tabular input.

3.4. Contact Algorithm

Contact between rigid surfaces and deformable-body nodes is handled entirely on the device. The kernel proceeds in three stages.

4.: Proximity search: For every deformable node, the squared distances to all rigid-surface triangles are computed via shared-memory tiling. Within each warp, ballot/shuffle-based stream compaction constructs the penetrating-node list.
5.: Penetration and contact-force computation: For each penetrating node, the penetration depth, normal direction, and contact force (penalty-based) are computed.
6.: Force accumulation: Contact forces are accumulated directly into the device-memory external-force array $d_{f_{e x t}}$ using atomicAdd.

As the contact algorithm runs to completion in device memory, no PCIe round-trip is required during contact processing.

3.5. GPU-Resident Integrated Execution Pipeline

Figure 1 shows how the pipeline is laid out. A single timestep on the GPU executes, in order, (1) internal-force computation with the constitutive update and stress evaluation; (2) contact search and contact-force accumulation; (3) the nodal update for acceleration, velocity, and displacement; and (4) minimum edge-length evaluation for the adaptive time step. An optional asynchronous output submission is triggered on an external cadence and runs off the critical path; it is described in Section 3.7 as an implementation convenience.

Device-to-host traffic per timestep is restricted to two scalars: the minimum edge-length squared (4 bytes, synchronous, needed before the next CFL update) and a fracture flag (4 bytes, asynchronous, inspected one step late). The per-step host-transfer budget is, therefore, 8 bytes in total, and the bulk mesh state never leaves device memory.

This 8-byte count refers only to the mandatory per-step solver-control exchange and excludes optional result output, HDF5/XDMF field output, logging, profiler traces, debug diagnostics, and post-processing transfers.

In the present implementation, the minimum edge-length reduction itself is performed on the GPU, but the final CFL timestep decision is kept host-visible to preserve the existing timestep-control, termination, and output-scheduling logic. This design removes the mesh-size-dependent host transfer while retaining only a scalar synchronization. Moving the final CFL decision fully onto the device is technically possible and is planned together with GPU-resident output scheduling, but it was not required to eliminate the dominant PCIe mesh-state transfer targeted in this study.

3.6. Single-Precision Numerical Stability

The GPU path operates in single precision (FP32). When the deformation gradient F is close to the identity, evaluating F − I directly risks catastrophic cancellation, so we instead work through a displacement-Jacobian:

H = J_{x} \cdot J_{0}^{- 1} - I

(3)

where

J_{x}

is the Jacobian of the current coordinates and

J_{0}

is the Jacobian of the initial coordinates. Recovering F from H preserves the significant digits of H even in the F ≈ I regime. This approach is implemented for the Tet4 element on both CPU and GPU paths; the Tet10 GPU SymGradU path currently uses an (F − I)-based computation rather than H.

3.7. Asynchronous I/O Pipeline

Result output is optional and off the critical path. When enabled, nodal fields are transferred asynchronously to pinned host buffers and written by a separate host thread through the HDF5 library [20]. This mechanism is an implementation convenience and is not treated as a performance contribution in this paper.

4. Experiments

4.1. Problem Setup

The benchmark case is a quarter-symmetry hemisphere compression. A rigid quarter-spherical punch compresses a deformable specimen, which exercises both contact and the large-deformation regime. Figure 2 shows the quarter-symmetry hemisphere compression setup and a representative deformed configuration.

The constitutive model is J2 elastoplasticity, and contact is treated with a penalty-based node-to-surface method. Detailed material parameters, boundary conditions, contact parameters, and run scripts are provided in the Supplementary Materials.

4.2. Hardware and Software Environment

Table 2 lists the hardware and software environment used for the benchmark runs.

4.3. Experimental Design

Three design-of-experiments (DOE) studies isolate the factors that matter for performance. Table 3 summarizes the mesh levels used in the benchmark.

DOE-A (mesh scaling, CPU vs. GPU): Compares CPU and GPU step-time scalability as a function of element count. Fixed conditions: contact = ON, output disabled during timing measurements, material = J2. Variable: mesh level (L1–L6) × backend (CPU/CUDA). Cases: 6 meshes × 2 backends = 12.

DOE-B (contact overhead): Measures the net GPU step-time overhead associated with enabling contact. In the revision, DOE-B was regenerated using the latest solver build with contact_v2 and CUDA Graph disabled. Fixed settings were output disabled during timing measurements, material = J2, backend = CUDA, and GPU = NVIDIA L40. The variables were contact state (ON/OFF) and mesh level (L2, L3, and L5). The L3 case was added in the revision to avoid drawing the contact-overhead trend from only two mesh levels.

DOE-C (CPU multicore scalability, LS-DYNA comparison): Compares the multicore CPU scalability of the commercial code LS-DYNA (SMP mode) with the single-GPU result. Fixed: contact = ON, output disabled during timing measurements, material = J2. Variable: CPU threads (1, 2, 4, 8, 16, 32), mesh = L6. Cases: 6.

After deduplication, the original DOE-A and DOE-C studies comprised 12 and 6 unique runs, respectively; the revised DOE-B contact-overhead study contributes a further 6 latest-solver ON/OFF timing cases (mesh levels L2, L3, and L5 with contact ON and OFF).

4.4. Measurement Protocol

The run-to-run variation of the revised DOE-A and DOE-B L40 timing measurements is reported in Supplementary Table S5 as min–median–max, standard deviation, and coefficient of variation. The main figures retain median values for readability.

All benchmarks follow the same measurement protocol.

Warm-up exclusion: The first 200 steps are excluded from measurements to allow GPU cache/scheduling stabilization.
Measurement interval: At least 2000 steps were executed after warm-up, and per-step timing logged by the solver was aggregated. For small GPU meshes (L1–L3), the step count was further increased for statistical stability.
Repetition: Three independent runs; the median value is reported.
Metrics: Mean step time (µs/step), integration kernel time (µs/step), contact processing time (µs/step), and total wall time (ms).

5. Results

5.1. Mesh-Scaling Performance (DOE-A)

Table 4 summarizes the key results of the DOE-A mesh-scaling benchmark (CPU vs. GPU). Figure 3 shows the CPU and GPU step times and the speedup ratio as a function of element count.

The per-step speedup of the GPU path rises from about 99× at L1 (83 K elements) to about 127× at L3 (384 K elements) and continues to a peak of about 138× at L5 (998 K elements), remaining about 137× at L6 (1.89 M elements) (Figure 3). The speedup increases with problem size and approaches a plateau near 137× for the largest meshes. The lower speedup at L1 reflects kernel-launch and host-side overheads that remain fixed as element count drops.

On the CPU path, step time increases nearly linearly with element count (L1: 24,053 µs, L6: 463,434 µs), whereas the GPU path exhibits a markedly slower rate of increase (L1: 244 µs, L6: 3378 µs). The GPU, therefore, makes proportionally better use of its parallel-compute capacity as the problem grows.

5.2. Contact Overhead (DOE-B)

Figure 4 and Table 5 summarize the revised DOE-B contact-overhead study. To address the reviewer concern that the original DOE-B study used only two mesh levels, the revision adds an intermediate L3 mesh and reports L2, L3, and L5 results. The revised DOE-B runs used the latest solver build with contact_v2 and CUDA Graph disabled. The measured net ON/OFF overhead decreases monotonically with mesh size, from 20.9% for L2 to 16.6% for L3 and 13.0% for L5. This trend supports the interpretation that, for this benchmark, contact remains a bounded surface-related cost relative to the volume-dominated internal-force computation. As in the original study, the contact OFF timing is not zero because, under the present instrumentation, the contact-labeled region also absorbs the minimum-edge/CFL synchronization wait; the reported overhead is, therefore, the net ON/OFF difference.

The revised DOE-A and DOE-B GPU timings use the same latest solver build, contact_v2 backend, CUDA Graph disabled mode, and NVIDIA L40 hardware. The DOE-B contact ON values for L2, L3, and L5 are, therefore, consistent with the corresponding GPU entries in the revised DOE-A mesh-scaling table (Table 4), agreeing to within the run-to-run variation (coefficient of variation below 0.7%).

The decreasing overhead fraction with increasing mesh size (+20.9% → +13.0%) is consistent with the contact computation scaling as

O (n^{2 / 3})

(proportional to the number of surface nodes) while the internal-force computation scales as

O (n)

(proportional to the volume-element count), causing the relative weight of contact to diminish.

The

O (n^{2 / 3})

argument assumes geometrically similar mesh refinement of the present three-dimensional benchmark, so that the number of potential surface-contact nodes scales with surface area while the number of volume elements scales with volume. This assumption is not valid for arbitrary geometries, thin structures, highly localized contact patches, self-contact, or multibody contact where the candidate-pair count depends on the broad-phase search strategy.

Because the current timer groups the contact-labeled region together with the CFL/minimum-edge synchronization wait, the DOE-B value should be interpreted as a net ON/OFF overhead under the present instrumentation rather than as a pure narrow-phase contact-kernel time. The three tested mesh levels, therefore, support a benchmark-specific observation: enabling the device-resident contact path increased the measured step time by 13.0–20.9% for this geometry and implementation. They should not be read as a general contact-scaling law for arbitrary multibody or self-contact problems.

5.3. Kernel-Time Decomposition

Two timing views are distinguished here. In the host-labeled timing used for DOE-B, the contact-labeled region includes the blocking minimum-edge/CFL synchronization and can, therefore, appear as a large measured region. In contrast, the CUDA-event kernel-level decomposition isolates device kernel execution and shows that the tetrahedral internal-force kernel dominates the device-side work. Thus, the two observations are not contradictory; they reflect different timer boundaries.

The host-labeled breakdown and the CUDA-event breakdown are, therefore, reported for different purposes. The former quantifies the net ON/OFF timing impact observed by the full timestep driver, including synchronization effects; the latter separates the device kernels and is the appropriate view for identifying where GPU work is actually spent. Supplementary Figure S1 provides the host-labeled timing breakdown for the representative L2 and L5 cases.

The principal CUDA launch mappings, data layouts, timing regions, and output-handling details are summarized in Supplementary Table S4. In the implementation used for the reported timings, the tetrahedral internal-force kernels are element/integration-point parallel, while the nodal update and minimum-edge evaluation use node- and edge-parallel mappings, respectively. Mesh coordinates, velocities, accelerations, lumped masses, connectivity, material state variables, contact flags, and force arrays are stored as contiguous device arrays to maintain coalesced access in the dominant kernels. The contact stage uses shared-memory triangle tiles and warp-ballot compaction to form penetrating-node work items before device-side atomic accumulation into the external-force array. CUDA events bracket individual device kernels, whereas the host-labeled timers bracket higher-level timestep regions and, therefore, include synchronization boundaries.

The CUDA-event per-kernel decomposition summarized in Supplementary Table S4 shows that the per-step device cost is dominated by the tetrahedral internal-force kernel, which rises from about 72% of the device–kernel time at the smallest mesh to about 93% at the largest and scales almost linearly with element count. By contrast, the rigid-surface contact kernel contributes about 5–22% and scales sub-linearly with the contacting surface. The host-labeled contact region appears larger only because it contains the blocking minimum-edge copy that synchronizes the stream and can absorb previously launched asynchronous device work.

Taken together, these measurements support a narrower interpretation than a pure contact-kernel scaling law: for the present benchmark and instrumentation, enabling the contact path adds a bounded net ON/OFF overhead, while the GPU device work remains dominated by element integration at larger mesh sizes. The exact split is implementation-dependent; the profiler traces in Supplementary Figure S1 and the implementation summary in Supplementary Table S4 are provided as the underlying evidence.

The natural next steps are, therefore, twofold: introduce a BVH or spatial-hashing broad phase to reduce contact-search work for complex contact cases, and move the remaining synchronization-sensitive timestep/output bookkeeping further toward device-resident execution.

5.4. Field-Quantity Accuracy: GPU FP32 Versus CPU FP64

Because the GPU implementation uses single-precision (FP32) arithmetic whereas the CPU reference path is evaluated in double precision (FP64), we performed a direct field-quantity comparison on the same hemisphere compression benchmark used for the timing study (Section 5.1). The comparison used the L2 mesh (162,000 Tet4 elements; 29,872 nodes in the displacement output), the J2 elastoplastic material (E = 210 GPa, ν = 0.3,

σ_{y}

= 250 MPa, tangent modulus 1 GPa), and the automatic node-to-surface penalty contact between the rigid hemisphere-shaped punch and the deformable block, with matched material, contact, boundary-condition, and output settings. Because the original benchmark binary predates the separation of the single- and double-precision build targets, the paired comparison was generated with an FP32/FP64-capable baseline build that was verified to reproduce the contact behavior of the paper benchmark; both binaries were compiled from the same solver-source commit so that the cross-binary difference reflects the precision setting. The GPU FP32 run completed in 20.6 s of GPU time and the CPU FP64 reference run in 50.2 min of single-thread CPU time, with the auto-penalty contact stiffness agreeing between the two paths to a relative difference of 1.9 × 10⁻⁵.

Table 6 summarizes the field errors at the final output step and the force-history error over all recorded time steps. The displacement-magnitude field shows a relative L2 error of 0.99% (RMS 0.67 mm); its maximum absolute difference of 2.38 mm is localized on the rigid hemisphere-shell rim, a kinematic round-off accumulator, rather than on the deformable block. The element-centered von Mises stress field shows a relative L2 error of 0.27%, with a maximum absolute difference of 6.81 MPa and a peak-stress difference of 1.37 MPa (0.18%). The effective plastic strain shows a relative L2 error of 0.81% (maximum absolute difference 6.8 × 10⁻³). The punch reaction-force history differs by 0.31% in curve-relative L2 norm and by 0.22% at the peak value. The reported solver scalars—kinetic energy, maximum von Mises stress, and maximum effective plastic strain—agree between the FP32 and FP64 runs to four to five significant figures (relative differences of order 10⁻⁵).

To examine the large-deformation and near-rigid regions separately, the von Mises error was also evaluated by region (Supplementary Table S2). At the final output step, the global relative L2 error was 0.27% over 162,000 elements, the contact/plastic-region error (effective plastic strain > 1 × 10⁻³) was 0.23% over 151,593 elements, and the near-rigid/low-strain-region error was 0.83% over 10,407 elements. The near-rigid region was interpreted using both absolute errors and denominator-regularized relative errors, because relative measures can be amplified when the reference stress approaches zero; the 95th-percentile denominator-regularized relative error in the near-rigid region was 2.0%, confirming that the elevated relative L2 there is a denominator-regularization effect rather than a precision-loss signature. Per-frame breakdowns are provided in Supplementary Table S3, and field-error histograms and the force–displacement overlay are provided in Supplementary Figures S2–S6.

This comparison demonstrates that, for the tested L2 hemisphere benchmark, the GPU FP32 solution reproduces the CPU FP64 displacement, stress, plastic-strain, and force-history quantities within the reported tolerances. It should not be interpreted as a universal guarantee of FP32 accuracy for all contact and large-deformation problems; more severe near-rigid regimes, the Tet10 strain evaluation, and more complex contact configurations remain targets for mixed-precision and broader validation work.

6. Commercial Code Comparison (DOE-C)

Experimental Setup

DOE-C uses the commercial explicit analysis code LS-DYNA (R13.0 MPP-S, double precision, Intel MPI runtime) running in SMP mode to measure CPU scalability as a function of core count (1, 2, 4, 8, 16, 32) for the same hemisphere compression model (1.89 M elements), and compares these results with the present GPU solver result from Table 4, L6.

For reproducibility, the LS-DYNA model used one-point tetrahedral solid elements (ELFORM 10) with a kinematic elastic–plastic material (E = 210 GPa, ν = 0.3, initial yield 250 MPa, tangent modulus 1000 MPa) and a rigid punch, an automatic node-to-surface penalty contact, and a timestep scale factor of 0.67, integrated to a termination time of 2.0 × 10⁻³ s using the same 1.89 M-element hemisphere keyword deck employed by the proposed solver. The complete setup, including the per-solver output-cadence settings used during timing, is listed in Supplementary Table S1. Table 7 summarizes the LS-DYNA SMP timing results used in DOE-C.

The CPU scalability of LS-DYNA peaks at 1.80× with 8 cores, then degrades at 16 cores (1.44×) and 32 cores (1.21×). The regression past 8 cores is consistent with the structure of the explicit loop: per-step work is quantitatively small under lumped mass and single-point Tet4 integration, and once that is the case the overheads of inter-thread synchronization and memory-access contention overtake whatever additional cores would contribute. Table 8 compares the proposed GPU solver with the LS-DYNA SMP configurations.

Compared with the present GPU solver result (Table 4, L6: 3378 µs/step), the proposed solver achieves approximately 168× speedup over LS-DYNA single-core (569,000 µs/step) and approximately 94× speedup over the optimal 8-core configuration (316,800 µs/step). This margin is consistent with the GPU-resident architecture being insulated from the CPU-side scalability ceiling visible in the tested SMP configuration. Figure 5 visualizes the LS-DYNA CPU-core scaling and the direct comparison with the proposed GPU solver.

7. Discussion

7.1. CPU Multicore Scalability Saturation and the Advantage of GPU Integration

The saturation visible in the LS-DYNA SMP results is consistent with the structure of the explicit loop. Per-step arithmetic in an updated Lagrangian formulation with lumped mass and single-point Tet4 integration is quantitatively small, and every step still requires a global synchronization for the time-increment decision and for the contact phase. The parallelizable portion of each iteration is, therefore, limited, which manifests as a classic Amdahl ceiling. Beyond eight cores on the test machine, NUMA-crossing memory access and cache-coherence traffic come to dominate, and additional cores spend more time synchronizing and waiting on the memory subsystem than doing useful work.

The GPU-integrated architecture does not aim to outperform CPU-optimized solvers in their own operating regime; it targets a different throughput profile. Thousands of CUDA lanes provide element- and node-level SIMT parallelism; peak memory bandwidth on the L40 (864 GB/s) exceeds that of a server DDR5 socket (on the order of 100 GB/s) by close to an order of magnitude; and because every analysis array resides on the device, the PCIe bottleneck is not on the per-step critical path. This behavior is enabled by reducing per-step host–device data exchange and by matching the computation pattern to the GPU memory hierarchy, replacing the inter-thread cache contention of a CPU multicore run with a throughput-oriented parallelization profile.

The GPU speedup rises from about 99× at L1 to about 127× at L3 and to about 135–138× for the larger meshes, plateauing near 137× at the largest sizes, which is consistent with kernel-launch and host-side overheads being amortized once per-step compute dominates; there is no sign of bandwidth or compute saturation up to 1.89 M elements.

7.2. Current Efficiency and Future Directions for the Contact Algorithm

DOE-B places the net contact overhead at +13% to +21% (L5 and L2, respectively). Because the contact-labeled host timer includes the CFL-related synchronization, this percentage is reported as a net ON/OFF overhead rather than as a pure narrow-phase contact-kernel cost.

As a supplementary check on the contact-force output, we also evaluated a rounded flat-punch normal-contact problem against the analytical reference solution of Willert [21]. This comparison is included in Appendix A because it verifies the pressure reconstruction and force accumulation of the current rigid–deformable contact path under a controlled elastic contact condition, rather than the full hemisphere field solution. For the coarse Tet4 case at δ = 0.1 mm, the full normal force differed from the reference by 1.69%, the active contact radius by 1.0%, and the mean contact pressure by 2.6%; the pressure-area integral and the accumulated nodal contact force agreed to a relative difference of 6.3 × 10⁻¹¹. This result supports the consistency of the contact-output path and partially validates the contact-output aspect of the broader contact-algorithm scope within the implemented rigid–deformable formulation; however, it does not address deformable–deformable contact, self-contact, frictional sliding, multiple contact bodies, or the FP32-versus-FP64 field-accuracy question.

A BVH-based broad phase would bring the contact-search complexity down toward O(N log N) and shift the relative weight of the pipeline back onto the integration kernels, which would alter the relative cost distribution within the pipeline. Higher-fidelity contact formulations—for example, frictional response with a regularized tangential law or geometrically consistent surface-to-surface couplings—are complementary rather than replacements for the penalty-based stage in use here [1,16]; porting such formulations to a GPU-resident implementation is left to subsequent work.

7.3. Limitations

Three limitations should be stated plainly. First, the GPU path uses single-precision (FP32) arithmetic, which leaves a residual significant-digit risk in near-rigid regimes and will likely require a mixed-precision strategy before the Tet10 path can be pushed to production. Second, all measurements reported here were taken on a single NVIDIA L40 up to L6 (1.89 M elements); problems substantially larger than that need a multi-GPU extension which is beyond the scope of the present study. Third, the LS-DYNA comparison is restricted to SMP (shared-memory) mode; the MPP (distributed) scalability picture is substantively different and warrants a separate study in its own right. The present study focuses on execution architecture and computational efficiency; broader accuracy validation against experimental data or commercial-code field quantities is left to subsequent work.

Although the added L2 comparison shows close agreement between the GPU FP32 and CPU FP64 paths for displacement, von Mises stress, effective plastic strain, and force history (Section 5.4), the result is still benchmark-specific. It does not remove the need for mixed-precision treatment in more severe near-rigid regimes or in the Tet10 strain path, nor does it address deformable–deformable contact, self-contact, frictional sliding, multibody contact, or multi-GPU execution.

The present contact implementation should also be interpreted within its intended scope. It treats rigid-surface to deformable-node penalty contact using a shared-memory tiled search and does not yet include a BVH or spatial-hashing broad phase. Consequently, the present contact-overhead result should not be extrapolated to deformable–deformable contact, self-contact, multiple interacting bodies, or frictional contact. These cases require a separate broad-phase data structure, candidate-pair management strategy, and validation campaign.

8. Conclusions

The solver described in this paper targets, within a single coherent execution architecture, the two factors that set the wall-clock cost of explicit large-deformation analysis: contact search together with time-increment control, and stable time-increment bookkeeping. All analysis data remain on the device; internal-force evaluation, contact processing, and the nodal update are scheduled as one continuous pipeline; and the only synchronous host exchange per step is an 8-byte scalar pair used for the CFL decision and fracture signaling.

On the hemisphere compression benchmark at 1.89 M elements, the solver is approximately 137× faster than a single CPU core and on the order of 94× faster than the best SMP configuration of LS-DYNA (eight cores) on the same problem. The per-step speedup over a single CPU core rises from about 99× at 83 K elements to a peak of about 138×, while remaining about 137× at 1.89 M elements, with the curve approaching a plateau above roughly 1 M elements once kernel-launch and host-side overheads are amortized. Enabling contact adds a net ON/OFF overhead of +13% to +21% to the step time, which remains within the throughput budget of the pipeline.

The results suggest several directions for future work. A BVH-based broad phase would attack what the kernel-time decomposition identifies as the remaining dominant cost. A multi-GPU extension is needed to reach problem sizes beyond the single-L40 regime studied here. A mixed-precision variant of the Tet10 path would close the FP32 limitation noted in Section 7.3. Finally, a comparison against LS-DYNA in MPP mode, and against other distributed-memory commercial codes, would sharpen the picture of where the GPU-resident approach has a decisive advantage and where it does not.

The next technical objectives are, therefore, (i) a mixed-precision extension of the Tet10 strain update and near-rigid regimes; (ii) a multi-GPU implementation with distributed time-step reduction and inter-GPU contact-candidate exchange; (iii) a BVH or spatial-hashing contact broad phase for deformable–deformable, self-contact, and multiple contact bodies; and (iv) a balanced LS-DYNA MPP comparison using matched output cadence, timestep controls, and deck settings.

The present study deliberately uses the hemisphere compression case as a controlled contact-dominated large-deformation benchmark. Extending the validation to more complicated load conditions—bending-dominated load paths, multistage forming sequences, and frictional sliding—is planned as a separate validation study and is not claimed here.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jmmp10060197/s1. Supplementary_Materials.docx contains Figure S1: host-labeled timing breakdown for representative L2 and L5 cases; Figures S2–S6: GPU FP32 versus CPU FP64 field-error histograms and force–displacement overlay for the L2 hemisphere benchmark; Table S1: LS-DYNA DOE-C solver setup; Table S2: FP32/FP64 region-specific von Mises error at the final output step; Table S3: FP32/FP64 per-frame field-error breakdown; Table S4: CUDA implementation details, including kernel launch mappings, data layouts, timing regions, and output handling; and Table S5: revised DOE-A mesh-scaling and DOE-B contact-overhead L40 three-run timing variation. The supplementary reproducibility package additionally provides benchmark decks, Windows batch scripts, executable/runtime files, processed timing data, profiler logs, field-accuracy comparison data, analysis scripts, figure-generation scripts, build/run provenance files, and supporting reference assets for the DOE-A, DOE-B, DOE-C, GPU FP32 versus CPU FP64, CUDA implementation/timing, run-to-run variation, and Appendix A Willert validation materials.

Author Contributions

Conceptualization, H.K. and N.K.; methodology, H.K.; software, H.K.; validation, H.K. and S.H.; formal analysis, H.K. and S.H.; investigation, H.K.; writing—original draft preparation, H.K.; writing—review and editing, S.H. and N.K.; project administration, S.H. and N.K.; supervision, N.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are included in this article, Appendix A, and the supplementary ZIP package provided with the revised manuscript. The ZIP package contains benchmark inputs, run scripts, processed timing tables, profiler logs, figure-generation scripts, FP32/FP64 field-accuracy tables and figures, build/run provenance files, and the revised DOE-B L40 three-run timing data with the corresponding contact-overhead figure. The source code of the GPU solver is not publicly available due to proprietary restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Rounded Flat-Punch Normal-Contact Validation Using the Willert Analytical Solution

The rounded flat-punch validation was added to make the boundary conditions and analytical comparison visible in the revised manuscript package. The case follows the rounded flat-punch geometry used by Willert [21] and evaluates only the normal-contact response. The finite element model uses a quarter domain, symmetry constraints on the two symmetry planes, and a fixed bottom support. The validation is, therefore, a controlled contact-output consistency check for pressure reconstruction and force accumulation, not a replacement for the hemisphere field-accuracy comparison in Section 5.4.

The analytical values in Table A1 were computed directly from the rounded-flat-punch solution used in Willert [21], not by fitting the finite element result. In this notation,

b

is the flat-punch radius,

R

is the rounding radius,

d

is the prescribed normal approach,

a

is the final contact radius, and

\tilde{a}

is a dummy contact-radius variable used only inside the MDR integral. The Heaviside function

H (\tilde{a} - b)

restricts the rounded-punch contribution to

\tilde{a} \geq b

.

Following Willert Section 5.2 and the MDR identity, the final contact radius is obtained from the normal approach condition:

d = g (a)

The auxiliary function used for the rounded flat punch is

g (\tilde{a}) = \frac{\tilde{a}}{R} [\sqrt{{\tilde{a}}^{2} - b^{2}} - b {c o s}^{- 1} (\frac{b}{\tilde{a}})] H (\tilde{a} - b)

After solving

d = g (a)

for the final contact radius

a

, the normal force is evaluated from the MDR force identity:

F_{N} = \int_{b}^{a} 2 E^{*} \tilde{a} g^{'} (\tilde{a}) d \tilde{a}, E^{*} = \frac{E}{1 - ν^{2}}

Thus, the analytical reference in Table A1 is obtained in two steps: first, solve

d = g (a)

for

a

, and then insert that value of

a

into the force integral above. Substituting

b = 20

mm,

R = 10

mm,

d = 0.1

mm,

E = 70, 000

MPa, and

ν = 0.3

gives

a = 20.38

mm and

F_{N} = 311, 211

N. As an independent check of the analytical reference, integrating the reference pressure field over the contact area gives

\int_{A} p d A = 310, 930 N

This pressure-integral value differs from the force formula by 0.09%.

Figure A1 shows the rounded flat-punch validation setup and boundary conditions used for the Willert analytical comparison.

Figure A1. Rounded flat-punch validation setup and boundary conditions used for the Willert analytical comparison. The schematic is redrawn from the geometry definition of Willert [21] and shows the normal-contact validation used here:

b = 20

mm,

R = 10

mm, indentation

δ = 0.1

mm, and a quarter finite element model with symmetry constraints and a fixed bottom support. No tangential force or moment is applied in the present normal-contact check.

Figure A1. Rounded flat-punch validation setup and boundary conditions used for the Willert analytical comparison. The schematic is redrawn from the geometry definition of Willert [21] and shows the normal-contact validation used here:

b = 20

mm,

R = 10

mm, indentation

δ = 0.1

mm, and a quarter finite element model with symmetry constraints and a fixed bottom support. No tangential force or moment is applied in the present normal-contact check.

Table A1. Direct analytical-versus-GPU-solver comparison for the rounded flat-punch normal-contact validation using the Willert analytical solution at δ = 0.1 mm.

Quantity	Analytical Reference	GPU Solver Coarse	Difference
Contact radius a	20.38 mm	20.591 mm	+1.0%
Quarter normal force Fq	77,803 N	79,119 N	+1.69%
Full normal force FN	311,211 N	316,476 N	+1.69%
Mean contact pressure	238.49 MPa	232.25 MPa	−2.6%

Note: The present GPU solver result is the coarse Tet4 validation case. The table reports only direct comparisons between the analytical reference and the finite element result. Two additional solver-side consistency checks were evaluated separately: the pressure-area integral

\sum p A

and the accumulated nodal contact force differed by

6.3 \times 10^{- 11}

, and the accumulated nodal normal force

\sum F_{n}

and rcforc differed by

3.4 \times 10^{- 6}

. These two checks verify force accumulation and output consistency inside the solver; they are not additional analytical reference quantities.

References

Hallquist, J.O.; Goudreau, G.L.; Benson, D.J. Sliding interfaces with contact-impact in large-scale Lagrangian computations. Comput. Methods Appl. Mech. Eng. 1985, 51, 107–137. [Google Scholar] [CrossRef]
Belytschko, T.; Liu, W.K.; Moran, B.; Elkhodary, K. Nonlinear Finite Elements for Continua and Structures, 2nd ed.; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
Hallquist, J.O. LS-DYNA Theory Manual; Livermore Software Technology Corporation (LSTC): Livermore, CA, USA, 2006. [Google Scholar]
Miller, K.; Joldes, G.R.; Lance, D.; Wittek, A. Total Lagrangian explicit dynamics finite element algorithm for computing soft tissue deformation. Commun. Numer. Methods Eng. 2007, 23, 121–134. [Google Scholar] [CrossRef]
Joldes, G.R.; Wittek, A.; Miller, K. Real-time nonlinear finite element computations on GPU—Application to neurosurgical simulation. Comput. Methods Appl. Mech. Eng. 2010, 199, 3305–3314. [Google Scholar] [CrossRef] [PubMed]
Huthwaite, P. Accelerated finite element elastodynamic simulations using the GPU. J. Comput. Phys. 2014, 257, 687–707. [Google Scholar] [CrossRef]
Bartezzaghi, A.; Cremonesi, M.; Parolini, N.; Perego, U. An explicit dynamics GPU structural solver for thin shell finite elements. Comput. Struct. 2015, 154, 29–40. [Google Scholar] [CrossRef]
Comas, O.; Taylor, Z.A.; Allard, J.; Ourselin, S.; Cotin, S.; Passenger, J. Efficient nonlinear FEM for soft tissue modelling and its GPU implementation within the open source framework SOFA. In Proceedings of the 4th International Symposium on Biomedical Simulation (ISBMS 2008), London, UK, 7–8 July 2008; Lecture Notes in Computer Science 5104; Springer: Berlin/Heidelberg, Germany, 2008; pp. 28–39. [Google Scholar]
Courtecuisse, H.; Allard, J.; Kerfriden, P.; Bordas, S.P.A.; Cotin, S.; Duriez, C. GPU-based real-time soft tissue deformation with cutting and haptic feedback. Prog. Biophys. Mol. Biol. 2010, 103, 159–168. [Google Scholar] [CrossRef] [PubMed]
Simpson, B.G.; Zhu, M.; Seki, A.; Scott, M.H. Challenges in GPU-accelerated nonlinear dynamic analysis for structural systems. J. Struct. Eng. 2023, 149, 04022253. [Google Scholar] [CrossRef]
Shi, Y.; Nie, N.; Wang, J.; Lin, K.; Zhou, C.; Li, S.; Yao, K.; Li, S.; Feng, Y.; Zeng, Y.; et al. Large-scale simulation of structural dynamics computing on GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), Denver, CO, USA, 12–17 November 2023; ACM: New York, NY, USA, 2023; pp. 1–14. [Google Scholar] [CrossRef]
Fu, Z.; Lewis, T.J.; Kirby, R.M.; Whitaker, R.T. Architecting the finite element method pipeline for the GPU. J. Comput. Appl. Math. 2014, 257, 195–211. [Google Scholar] [CrossRef] [PubMed]
Johnsen, S.F.; Taylor, Z.A.; Clarkson, M.J.; Hipwell, J.; Modat, M.; Eiben, B.; Han, L.; Hu, Y.; Mertzanidou, T.; Hawkes, D.J.; et al. NiftySim: A GPU-based nonlinear finite element package for simulation of soft tissue biomechanics. Int. J. Comput. Assist. Radiol. Surg. 2015, 10, 1077–1095. [Google Scholar] [CrossRef] [PubMed]
Johnsen, S.F.; Taylor, Z.A.; Han, L.; Hu, Y.; Clarkson, M.J.; Hawkes, D.J.; Ourselin, S. Detection and modelling of contacts in explicit finite-element simulation of soft tissue biomechanics. Int. J. Comput. Assist. Radiol. Surg. 2015, 10, 1873–1891. [Google Scholar] [CrossRef] [PubMed]
Cao, X.; Zhao, X.; Liu, Z.; Pei, Y.; Cai, Y.; Cui, X. A multi-GPU explicit finite element framework with a parallel contact algorithm for drop testing of electronic products. Adv. Eng. Softw. 2026, 213, 104086. [Google Scholar] [CrossRef]
Zhong, Z.H. Finite Element Procedures for Contact-Impact Problems; Oxford University Press: Oxford, UK, 1993. [Google Scholar]
Attaway, S.W.; Hendrickson, B.A.; Plimpton, S.J.; Gardner, D.R.; Vaughan, C.T.; Brown, K.H.; Heinstein, M.W. A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D. Comput. Mech. 1998, 22, 143–159. [Google Scholar] [CrossRef]
Posey, S.; Kodiyalam, S. Performance benefits of NVIDIA GPUs for LS-DYNA®. In Proceedings of the 8th European LS-DYNA Users Conference, Strasbourg, France, 23–24 May 2011; pp. 1–6. [Google Scholar]
Göhner, U. Usage of GPU in LS-DYNA. In Proceedings of the 11th German LS-DYNA Forum, Ulm, Germany, 9–10 October 2012; DYNAmore GmbH: Stuttgart, Germany, 2012. [Google Scholar]
Folk, M.; Heber, G.; Koziol, Q.; Pourmal, E.; Robinson, D. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases (AD’11), Uppsala, Sweden, 25 March 2011; ACM: New York, NY, USA, 2011; pp. 36–47. [Google Scholar] [CrossRef]
Willert, E. Elastic Stress Field beneath a Sticking Circular Contact under Tangential Load. Solids 2024, 5, 14–28. [Google Scholar] [CrossRef]

Figure 1. GPU-resident explicit FEM pipeline. The core kernels execute consecutively in device memory; only mandatory scalar solver-control data are exchanged with the host each timestep. Optional output and diagnostics are excluded from the 8-byte count.

Figure 2. Hemisphere compression benchmark used in this study: (a) initial configuration with a quarter-spherical rigid punch and deformable specimen; (b) representative deformed configuration colored by effective plastic strain.

Figure 3. CPU and GPU step times and speedup ratio as a function of element count for DOE-A. GPU values use the latest solver path with contact_v2 and CUDA Graph disabled; values are the median of three L40 runs. Error bars are omitted because the three-run variation is small at the plotted scale; full min–median–max, standard deviation, and coefficient-of-variation statistics are reported in Supplementary Table S5.

Figure 4. Revised DOE-B contact-overhead study using the latest solver build with contact_v2 and CUDA Graph disabled. Contact ON/OFF step times are shown for the L2, L3, and L5 meshes. Values are the median of three L40 runs. Error bars are omitted because the measured coefficients of variation are 0.02–0.18% and smaller than the plotted markers; full run-to-run statistics are provided in Supplementary Table S5.

Figure 5. LS-DYNA CPU multicore scalability and comparison with the proposed GPU solver. (a) Step time vs. CPU core count with ideal scaling reference. (b) Direct comparison between the optimal LS-DYNA SMP configuration (8 cores) and the proposed GPU solver. DOE-C reports the matched LS-DYNA timing cases used for the direct comparison; the repeated L40 GPU variation statistics for DOE-A/DOE-B are reported separately in Supplementary Table S5.

Table 1. Summary of related work comparison.

Work	GPU-Resident/Contact Scope	Scaling Evidence	Technical Distinction
Miller et al. [4]; Joldes et al. [5]	TLED soft-tissue FEM; formulation-level or internal-force acceleration; contact absent or limited.	Element-level FLOP reduction and real-time clinical-mesh demonstrations.	Partial/domain-specific offload rather than a full explicit contact timestep pipeline.
Huthwaite [6]; Bartezzaghi et al. [7]	Structural elastodynamics and thin-shell explicit GPU kernels; contact-free or limited contact setting.	~100× versus Abaqus and reported shell-mesh speedups.	Element/time-integration focus; no production-style contact loop fully retained on the GPU.
Comas et al. [8]; Courtecuisse et al. [9]	GPU nonlinear FEM in SOFA for soft tissue, cutting, and haptic applications.	Real-time or interactive medical-simulation demonstrations.	Medical real-time target; constitutive and contact scope differ from industrial large-deformation forming.
Simpson et al. [10]; Shi et al. [11]	GPU nonlinear structural dynamics and GPU-cluster structural simulation.	115× at 10⁶ DOF and cluster-scale evidence.	Shows GPU potential and bottlenecks but not a single-GPU fully resident contact timestep.
Fu et al. [12]	GPU FEM assembly plus AMG-PCG solution for elliptic/implicit problems.	Up to 87× acceleration in assembly.	Implicit/global linear-solver profile, different from the explicit contact loop studied here.
Johnsen et al. [13,14]	NiftySim TLED with GPU internal force and BVH-based contact modeling.	Matrix-free explicit simulation with contact in soft-tissue biomechanics.	BVH contact is included, but the domain and pipeline differ from the present industrial benchmark.
Cao et al. [15]	Multi-GPU explicit FEM with parallel contact for electronics drop testing.	Cross-GPU scalability demonstrated.	Multi-GPU scalability emphasis; different scope from the present single-GPU LS-DYNA SMP comparison.
Present study	Internal force, contact, nodal update, and timestep reduction remain device-resident in one CUDA pipeline.	99×–138× versus one CPU core; 94× versus LS-DYNA 8-core SMP.	Full-timestep GPU residency with contact ON/OFF decomposition and FP32/FP64 field-accuracy check.

Table 2. Hardware and software configuration.

Item	Specification
GPU	NVIDIA L40 (48 GB GDDR6, Ada Lovelace, 300 W TDP)
CPU	AMD EPYC 75F3 32-core
CUDA Toolkit	12.4
GPU Driver	595.71 (WDDM)
OS	Windows 10
Build toolchain	MSVC + MSYS2 (bash)
CFL safety factor (tssfac)	0.9
Δt growth limit	1.05 (5% per step)
Precision	FP32 (GPU), FP64 (CPU)

Table 3. Mesh levels used in the benchmark.

Level	N	Elements	Nodes	Abbreviation
L1	24	82,944	15,706	~83 K
L2	30	162,000	29,872	~162 K
L3	40	384,000	69,042	~384 K
L4	50	750,000	132,820	~750 K
L5	55	998,250	175,812	~998 K
L6	68	1,886,592	328,833	~1.89 M

Table 4. DOE-A mesh-scaling summary using the latest solver GPU path (NVIDIA L40, contact_v2, CUDA Graph disabled, 3-run median, unit: µs/step).

Level	Elements	CPU (µs/step)	GPU (µs/step)	Speedup
L1	82,944	24,053	243.6	98.7×
L2	162,000	44,966	386.0	116.5×
L3	384,000	104,012	822.5	126.5×
L4	750,000	187,375	1391.0	134.7×
L5	998,250	251,752	1829.0	137.6×
L6	1,886,592	463,434	3378.0	137.2×

Table 5. DOE-B contact-overhead summary for the revised latest-solver L40 study (contact_v2, CUDA Graph disabled, 3-run median, unit: µs/step).

Mesh Level	Elements	Contact ON (µs/step)	Contact OFF (µs/step)	Net Overhead	CV, ON/OFF
L2	162,000	386.3	319.6	20.9%	0.03%/0.18%
L3	384,000	822.4	705.1	16.6%	0.03%/0.06%
L5	998,250	1826.4	1616.9	13.0%	0.06%/0.02%

Note: All revised DOE-B timings were obtained with CUDA Graph disabled, which was the validated execution mode for the contact_v2 path used in this revision.

Table 6. GPU FP32 versus CPU FP64 field-quantity error summary for the L2 hemisphere benchmark.

Quantity	Relative L2 Error	Maximum Absolute Error
Displacement magnitude \|u\|	0.99%	2.38 mm
von Mises stress	0.27%	6.81 MPa
Effective plastic strain	0.81%	6.8 × 10⁻³
Force–time history	0.31%	7.7 × 10⁶ N

Table 7. DOE-C: LS-DYNA CPU multicore scalability (hemisphere, 1.89 M elements, SMP mode).

CPU Cores	Total Time (s)	Step Time (µs/step)	CPU Speedup
1	115	569,000	1.00×
2	87	430,700	1.32×
4	72	356,400	1.60×
8	64	316,800	1.80×
16	80	396,000	1.44×
32	95	470,300	1.21×

Table 8. Step-time comparison between the proposed GPU solver and LS-DYNA CPU multicore for the 1.89 M-element hemisphere compression model.

Configuration	Step Time (µs/Step)	Relative to GPU
Proposed solver, GPU (L40)	3378	1.0× (baseline)
LS-DYNA 1-core	569,000	168.4× slower
LS-DYNA 2-core	430,700	127.5× slower
LS-DYNA 4-core	356,400	105.5× slower
LS-DYNA 8-core (optimal)	316,800	93.8× slower
LS-DYNA 16-core	396,000	117.2× slower
LS-DYNA 32-core	470,300	139.2× slower

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, H.; Hong, S.; Kim, N. Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis. J. Manuf. Mater. Process. 2026, 10, 197. https://doi.org/10.3390/jmmp10060197

AMA Style

Kim H, Hong S, Kim N. Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis. Journal of Manufacturing and Materials Processing. 2026; 10(6):197. https://doi.org/10.3390/jmmp10060197

Chicago/Turabian Style

Kim, Honglae, Seokmoo Hong, and Naksoo Kim. 2026. "Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis" Journal of Manufacturing and Materials Processing 10, no. 6: 197. https://doi.org/10.3390/jmmp10060197

APA Style

Kim, H., Hong, S., & Kim, N. (2026). Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis. Journal of Manufacturing and Materials Processing, 10(6), 197. https://doi.org/10.3390/jmmp10060197

Article Menu

Design and Computational Efficiency of a GPU-Resident Integrated Execution Pipeline for Explicit Large-Deformation Finite Element Analysis

Abstract

1. Introduction

2. Related Work

2.1. GPU-Accelerated Finite Element Analysis

2.2. Contact Algorithms

2.3. Differentiation from Prior Work

3. Methods

3.1. Governing Equations

3.2. Element Formulation

3.3. Constitutive Model

3.4. Contact Algorithm

3.5. GPU-Resident Integrated Execution Pipeline

3.6. Single-Precision Numerical Stability

3.7. Asynchronous I/O Pipeline

4. Experiments

4.1. Problem Setup

4.2. Hardware and Software Environment

4.3. Experimental Design

4.4. Measurement Protocol

5. Results

5.1. Mesh-Scaling Performance (DOE-A)

5.2. Contact Overhead (DOE-B)

5.3. Kernel-Time Decomposition

5.4. Field-Quantity Accuracy: GPU FP32 Versus CPU FP64

6. Commercial Code Comparison (DOE-C)

Experimental Setup

7. Discussion

7.1. CPU Multicore Scalability Saturation and the Advantage of GPU Integration

7.2. Current Efficiency and Future Directions for the Contact Algorithm

7.3. Limitations

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Rounded Flat-Punch Normal-Contact Validation Using the Willert Analytical Solution

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI