1. Introduction
In numerical relativity, accurately modeling astrophysical systems such as neutron star mergers [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] relies on solving the equations of relativistic hydrodynamics, which involve the inversion of conservative-to-primitive (C2P) variable relations [
15,
16,
17]. This process typically requires computationally expensive root-finding algorithms, such as Newton-Raphson methods, and interpolation of complex, multi-dimensional equations of state (EOS) tables [
18,
19]. These methods, while robust, incur significant computational costs and can lead to inefficiencies, particularly in large-scale simulations, where up to billions of C2P calls may be required per time step. The inherent complexity of this mapping, however, often conceals underlying symmetries and lower-dimensional relationships that a machine learning model can be trained to recognize and exploit.
In view of these considerations, and taking into account the advent of GPU-based exascale supercomputers such as Aurora and Frontier and ongoing efforts to port relativistic hydrodynamics software into GPUs [
20,
21,
22], this work explores the use of machine learning (ML) algorithms that leverage GPU-accelerated computing for C2P conversion. CPU-based algorithms for C2P conversion typically involve an iterative non-linear root finder, for which the number of iterations required to achieve a given target accuracy depends on the input data, resulting in different runtimes for different points of the numerical grid. This limits the potential to use SIMD (for CPUs) or SIMT (for GPUs) parallelism, reducing the effective rate of conversion achievable using these schemes. An ML approach with its more predictable runtime and regular memory access pattern may help alleviate these issues. Indeed, this work is motivated by recent studies that have explored the potential of ML to replace traditional root-finding approaches for C2P inversion [
23]. Specifically, neural networks have shown promise in accelerating the C2P inversion process while maintaining high accuracy [
23]. Building on this, the present work introduces a novel approach that leverages ML to accelerate the recovery of primitive variables from conserved variables in relativistic hydrodynamics simulations, with particular focus on hybrid piecewise polytropic and tabulated EOS. These EOS models provide more realistic descriptions of the dense interior of neutron stars, yet their complexity makes the traditional C2P procedure very computationally expensive.
To help address these computational challenges, we present a suite of feedforward neural networks trained to directly map conserved variables to primitive variables, bypassing the need for traditional iterative solvers. In particular, we employ a hybrid approach, utilizing the flexibility of neural networks to handle the challenges posed by complex EOS models. Our models are implemented using modern deep learning tools, such as PyTorch, and optimized for GPU inference with NVIDIA TensorRT [
24]. Through comprehensive performance benchmarking, we demonstrate that our approach significantly outperforms traditional numerical methods in terms of speed, particularly when using mixed-precision deployment on modern hardware accelerators like NVIDIA A100 GPUs in the Delta supercomputer.
We evaluate the scalability of our ML models by comparing their inference performance against a single-threaded CPU implementation of a traditional numerical method from the RePrimAnd library [
25]. The benchmark was conducted on a Delta supercomputer compute node, featuring dual AMD 64-core 2.45 GHz Milan processors, 8 NVIDIA A100 GPUs (40 GB HBM2 RAM), and NVLink. For dataset sizes ranging from 25,000 to 1,000,000 points, the numerical method exhibited linear scaling of inference time. In contrast, TensorRT-optimized and TorchScript-based neural networks achieved substantially faster inference, typically demonstrating sub-linear scaling. We investigate two feedforward neural network architectures: a smaller network (
NNC2PS) and a larger one (
NNC2PL). Notably, mixed-precision TensorRT engines delivered impressive performance, with the
NNC2PS engine processing 1,000,000 points in 8.54 ms, compared to 3490 ms for the numerical method. Ideal parallelization across the entire node (64 CPU cores that support up to 128 threads and 8 GPUs) suggests a 25-fold speedup for TensorRT over the optimally parallelized numerical method when processing 8 million points. These results demonstrate the scalability and efficiency of our ML-based methods, offering significant improvements for high-throughput numerical relativistic hydrodynamics simulations.
This article is structured as follows.
Section 2 introduces the EOS considered in this study, along with the methodologies employed for designing, training, validating, and testing the ML models. In
Section 3, we present our key results, including an assessment of the accuracy of the ML models across different model types and quantization schemes. Additionally, we provide a comparison of the computational performance of the ML models relative to traditional root-finding methods. Finally,
Section 4 offers a summary of the findings and outlines potential avenues for future research.
2. Methods
We present an ML-based model with the potential to accelerate the recovery of primitive variables from conserved variables in general relativistic hydrodynamics (GRHD) simulations, specifically focusing on scenarios employing hybrid piecewise polytropic EOS and tabulated EOS. As in traditional approaches, this conversion requires inverting the conservative-to-primitive map, a process often reliant on computationally expensive root-finding algorithms. While previous work has demonstrated the success of machine learning for this task with the
-law EOS [
23], here, we investigate its application to hybrid piecewise polytropic EOS, which offers a more realistic representation of neutron star interiors, as well as the tabulated EOS, which incorporates the current nuclear physics model of neutron matter. To evaluate the performance of our neural network, we use a traditional CPU-based root-finding algorithm (provided by the RePrimAnd library) as a baseline for comparison. Our aim is to demonstrate the speed advantages of the neural network approach for conservative-to-primitive variable conversion. Our network is implemented using PyTorch (2.0+) and the inference speed tests are performed using
libtorch and NVIDIA TensorRT (8.4.1)’s
C++ API. While our numerical experiments are conducted in flat spacetime for simplicity, the C2P inversion is a local operation. Therefore, our method is directly applicable to general relativistic hydrodynamics simulations without loss of generality, as one can always perform the inversion in a local inertial frame.
In general relativity, the equations of relativistic hydrodynamics can be expressed in a conservation form suitable for numerical implementation. Specifically, in a flat spacetime, they constitute the following first-order, flux-conservative hyperbolic system:
where
is the metric determinant, and
is the determinant of the three metrics induced on each spacelike hypersurface. The state vector of the conserved variables is
, and the flux vector is given by
where
is the lapse function and
the spacelike shift vector: two kinematic variables describing the evolution of spacelike foliations in spacetime as in a typical
(ADM) formulation.
The five quantities satisfying Equation (
1), all measured by an Eulerian observer sitting at a spacelike hypersurface, are the relativistic rest-mass density,
D, the three components of the momentum density,
, and the energy density relative to the rest mass density,
, respectively. These are related to the primitive variables; rest-mass density,
, three-velocity,
, specific internal energy,
, and pressure,
p through
where
is the Lorentz factor, and
is the specific enthalpy.
Incorporating the EOS into the picture provides the thermodynamical information linking the pressure to the fluid’s rest-mass density and internal energy, which, combined with the definitions above, closes the system of equations given in Equation (
1) [
26,
27,
28].
We will first focus on the hybrid piecewise polytropic EOS. The hybrid piecewise polytropic EOS was introduced for simplified simulations of stellar collapse to model the stiffening of the nuclear EOS at nuclear density and include thermal pressure during the postbounce phase [
29]. In gravitational-wave science, it is more commonly used as described in Read et al. [
30], where it enables gravitational-wave parameter estimation and waveform modeling by effectively capturing macroscopic neutron star observables with minimal parameters. The structure of this EOS consists of multiple cold polytropes, defined by parameters
and
, where
nsegments denotes the total number of segments. Additionally, it includes a thermal
—law component characterized by
. Continuity of pressure and internal energy across segments, in accordance with the first law of thermodynamics, is ensured after appropriately setting initial values for the polytropic indices, density breakpoints (denoted
, and other relevant parameters. For this EOS, the polytropic indices (
), the density breakpoints (
), and the first segment’s polytropic constant (
) are treated as free parameters. Subsequent constants (
for
and all
) are then determined by enforcing continuity of pressure and internal energy across the breakpoints. In this context, pressure and specific internal energy components in each density interval are given by
where
is the segment-specific constant, and the rest mass density,
, is assumed to fall into the segments specified by each of the
. These equations apply to segment
i, where the rest-mass density
is in the range
.
In addition to the hybrid piecewise polytropic EOS-based model, we will train a separate network to infer the conservative-to-primitive transformation utilizing the tabulated EOS data. Specifically, we will use the Lattimer-Swesty EOS with a compressibility parameter
(hereafter referred to as
LS220 EOS), due to its prevalence and historical significance. Our training dataset is based on a modern, updated version of
LS220 EOS constructed and made available by Schneider, Roberts, and Ott in a more recent study [
31].
Below, we outline the dataset preparation, model architecture, training process, and methods used in inference speed testing with libtorch and NVIDIA TensorRT to evaluate computational efficiency.
2.1. Data
2.1.1. Piecewise Polytropic EOS-Based Model Data
We generate a dataset of 500,000 samples using geometrized units where
. Without loss of generality, we furthermore use a Minkowski metric
. The rest-mass density,
, is sampled uniformly from
, and the fluid’s three-velocity is assumed one-dimensional along the
x-axis, sampled uniformly from
. These ranges are chosen to be representative of the conditions found in binary neutron star mergers and to facilitate a direct comparison with the previous work in [
23]. Following Ref. [
30], we use an SLy four-segment piecewise polytropic EOS with segment-wise polytropic indices
. The first segment’s polytropic constant,
, is set to
. Subsequent polytropic constants,
, are determined by enforcing pressure continuity. Similarly, the first segment’s constant,
, is set to zero, while subsequent
values ensure continuity of internal energy. The density breaks for the segments are specified at
,
, and
. The thermal component has an adiabatic index of
. Additionally, the thermal component of the specific internal energy,
, is sampled uniformly from
(where
). A structured dataset is then constructed by converting the primitive variables to conserved variables using the standard relativistic hydrodynamic relations given in Equation (
3). In this dataset, conserved variables serve as input features, and the pressure is the target variable. The resulting dataset is then split into training, validation, and test sets, with each set fully standardized to zero mean and unit variance to ensure equal contribution of all features during neural network training (
Figure 1).
2.1.2. Tabulated EOS-Based Model Data
To generate the training data for the tabulated EOS-based model, we sample from a provided EOS table and follow a procedure similar to the one described in
Section 2.1.1. We begin by reading in the EOS table, which contains the variables electron fraction (
), temperature (
T), rest-mass density (
), specific internal energy (
), and pressure (
p). These quantities are stored in logarithmic form in the table and are extracted accordingly. For each data point, a random one-dimensional three-velocity,
, is sampled uniformly on a linear scale from the interval
. Values for electron fraction and temperature are also sampled uniformly on a linear scale from their respective ranges in the table. The rest-mass density is chosen by randomly selecting one of the grid points from the table, which are logarithmically spaced. For this study, we fetched the corresponding values of
p and
directly from the table without interpolation to ensure the training data perfectly represents the tabulated EOS. Using these, the corresponding values of
,
, and
p are then fetched from the EOS table. The primitive variables are then converted into conserved variables using standard relativistic hydrodynamics relations given in Equation (
3). A total of 1,000,000 data points are generated using this process [
32]. Similarly to the hybrid piecewise polytropic EOS-based model, the data is split into training, validation, and test sets, with each set fully standardized to zero mean and unit variance before being used for neural network training.
2.2. Model Architecture
2.2.1. Piecewise Polytropic EOS-Based Model
For the hybrid piecewise polytropic EOS-based model, we tested two feedforward neural networks of varying complexity to represent the conservative-to-primitive variable transformation. Each network takes as input the three conserved variables
(Equation (
3)) and outputs the pressure
p (Equation (
4)), assuming the remaining momentum density components are zero for simplicity. This architecture is designed to effectively learn the hidden symmetries in the relationship between the conserved and primitive variables, approximating the intricate C2P transformation without explicit root-finding. After experimenting with multiple multi-layer perceptron (MLP) architectures, as detailed in
Appendix A, we identified two models that offered an optimal balance between accuracy, speed, and trainability. The smaller model,
NNC2PS, features two hidden layers with 600 and 200 neurons, while the larger model,
NNC2PL, contains five hidden layers with 1024, 512, 256, 128, and 64 neurons (
Figure 2).
ReLU activation functions were applied to the hidden layers to introduce nonlinearity, with the output layer kept linear. We found these models strike an effective balance between complexity and performance, making them well-suited for our task.
2.2.2. Tabulated EOS-Based Model
For the tabulated EOS-based model, we use a single feedforward neural network,
NNC2P_Tabulated, to achieve an inherently equivalent task with minor differences. This model takes as input the log-scaled variables
and outputs the log-scaled pressure,
(Equation (
4)), assuming
and
are zero for simplicity as before. Using log-scaled inputs and outputs aligns with the format of the tabulated EOS values, which are also stored in logarithmic form to accommodate the typically large values of these physical quantities. This approach reduces the range of feature magnitudes, facilitating more stable learning dynamics and better alignment with the source data.
We explored several MLP architectures, varying in parameters, layers, and training strategies, to identify an optimal design for our task. Among these, an architecture identical to
NNC2PL, featuring five hidden layers with 1024, 512, 256, 128, and 64 neurons, respectively, detailed in
Section 2.2.1 above, emerged as a robust choice. This architecture effectively balanced capacity and efficiency, enabling accurate learning of log-scaled pressure from tabulated EOS data (
Figure 2).
2.3. Training Approach
We use a similar procedure to optimize all neural networks:
NNC2PS,
NNC2PL, and the tabulated baseline model,
NNC2P_Tabulated, with minor tweaks. Training was performed on a single
NVIDIA A100 GPU on the Delta cluster. For the hybrid piecewise polytropic EOS-based models (
NNC2PS and
NNC2PL), we employed a custom, physics-informed loss function that penalizes negative pressure predictions. This loss function is a modified mean-squared error:
where
represents the network’s estimation for feature
i,
is the corresponding target value, ReLU is the familiar rectified linear unit defined by
, and
represents an inverse normalization procedure based on the training data statistics. The penalty factor,
q, was optimized for each model, with
for
NNC2PS and
for
NNC2PL. These values consistently suppressed negative pressure predictions on the test set. For the tabulated EOS model (
NNC2P_Tabulated), the structure of the data precluded negative predictions, so a standard mean-squared error loss function was used.
All models were trained using the Adam optimizer with an initial learning rate of . A learning rate scheduler reduced the learning rate by a factor of 0.5 if the validation loss failed to improve for five consecutive epochs. NNC2PS and NNC2PL were trained for 85 epochs, while NNC2P_Tabulated required 250 epochs. These epoch counts were determined empirically by monitoring the validation loss, with training stopped once the loss had clearly converged. The use of a learning rate scheduler, which reduces the learning rate when the validation loss plateaus, also serves as a form of early stopping. For each epoch, the model was set to training mode, and data was loaded in batches of 32 onto the GPU. This batch size was chosen based on experimentation to balance the number of epochs and overall time to convergence. While training with larger batches and multiple GPUs (using PyTorch’s DataParallel module or other approaches) is possible, we found no significant advantage regarding the total time to convergence and ultimately opted for this simpler, more portable approach. For each batch, optimizer gradients were reset before generating predictions, and the loss was computed using respective loss functions. Backpropagation was then performed to update the model parameters.
After completing the training phase for each epoch, the model’s performance is evaluated on the validation dataset, accumulating the validation loss similarly to the training loss. Both losses are normalized by the size of the respective datasets and stored for further analysis, specifically for clues of potential overtraining.
2.4. Inference Speed Tests
In our inference speed tests, we evaluated two main approaches for efficient deployment: a TorchScript model and NVIDIA’s TensorRT optimized engines. These tests were conducted to measure and compare inference speed under typical deployment conditions, aiming to take advantage of the A100 GPU on Delta.
2.4.1. TorchScript Deployment
To prepare models for inference with TorchScript, we first saved a scripted version of the model, which is compatible with PyTorch’s JIT compiler, optimizing runtime execution without modifying the model’s core structure. TorchScript’s scripting provides some degree of optimization, enabling faster model execution than standard PyTorch models but without the hardware-level optimizations that TensorRT offers.
2.4.2. TensorRT Deployment
For TensorRT, we explored both FP32 (unquantized) and FP16-quantized engines, ultimately deciding not to pursue INT8 quantization due to accuracy degradation observed in initial tests. After extensive testing, we opted for dynamic engine building with a batch size determined by the total size of the expected dataset, as this approach provided the best balance between performance and flexibility for our hardware and model structure. It must be noted that constructing an optimal engine in TensorRT is a nuanced process, influenced by multiple factors including model architecture, hardware specifications, intended batch sizes during inference, and input data. Therefore, achieving the best results often involves iterative tuning and profiling to adapt the engine to the specific deployment environment and workload requirements. Below, we summarize the overall engine-building process we followed in detail:
Model Export to ONNX: First, we exported the PyTorch model to the ONNX format. This conversion enables interoperability with TensorRT, which uses ONNX as its primary model input format.
TensorRT Engine Building: Using TensorRT’s Python API, we constructed both FP32 and FP16 engines. A logger was initialized for verbose logging to capture potential issues during engine building. With the TensorRT Builder, we created a network definition with explicit batch handling, which is essential for dynamic batching configurations.
Parsing and Validating the ONNX Model: We loaded the ONNX model into TensorRT, where the OnnxParser validated and parsed the model. Parsing errors, if any, were logged for troubleshooting, ensuring a valid model structure before optimization.
Configuration and Optimization Profiles: The BuilderConfig was set with a 40 GB workspace memory limit, providing more than enough headroom for dynamic batch sizes while maintaining stable performance. We set up a dynamic optimization profile specifying minimum, optimal, and maximum batch sizes within a 10 percent margin of our typical usage, granting flexibility to handle both smaller and larger input volumes efficiently.
Engine Serialization: Finally, we serialized and saved the engine, creating a portable and optimized binary that can be loaded for deployment. This step encapsulates the model’s architecture, weights, and optimizations, ensuring it is ready for fast inference.
To ensure we measure the maximum possible performance for each point in our benchmark, we build a specialized, yet flexible, TensorRT engine for each combination of model and dataset size. The dynamic optimization profile for each of these engines is configured with a tight margin around its target dataset size (
N), as detailed in
Table 1.
Overall, the process of optimizing and saving models using both TorchScript and TensorRT gave us insight into balancing flexibility, accuracy, and performance. For larger batch sizes and greater computational demands, TensorRT’s dynamic engine approach in FP16 is often more effective, even for models as simple as ours, while TorchScript remains a reliable fallback and simpler alternative.
For the actual inference speed test procedure, we implemented two distinct workflows on a single GPU for both approaches. The TorchScript-based approach allowed for a straightforward configuration, primarily requiring the definition of batch sizes and the pre-loading of data onto the GPU. It then used libtorch for efficient GPU deployment and batch execution.
In contrast, the TensorRT-based approach demanded several additional configurations. The model, after being converted into an optimized engine, was loaded using TensorRT’s C++ API. This included the manual pre-loading of input data into GPU memory before execution and was followed by manual setup of input and output buffers for TensorRT’s executeV2 function and careful management of CUDA resources. While this setup was more involved, it leveraged hardware-specific optimizations to deliver substantial gains in inference speed.
3. Results
3.1. Accuracy
We evaluate the model accuracy using two standard metrics for regression problems: the
error (mean absolute error) and the
error (maximum absolute error), both calculated over the entire test dataset.
Table 2 summarizes the accuracy results based on
and
error metrics for each model variant—
NNC2PS,
NNC2PL, and
NNC2P_Tabulated—including both the unquantized and quantized TensorRT engines built from them.
The NNC2PS model trained in PyTorch achieves very high accuracy with an error of and an error of . When the model is converted to a TensorRT engine, the accuracy remains nearly identical, with an error of and an error of , indicating minimal loss in precision due to TensorRT optimization. However, when FP16 quantization is applied, the error rates increase to an error of and an error of , revealing an obvious side effect of reduced precision. This highlights the classic trade-off between computational performance and numerical precision, a critical consideration for selecting the appropriate model for a given scientific application where the tolerance for numerical error may vary.
The larger NNC2PL model, rather expectedly, achieves lower and errors than NNC2PS, with an error of and an error of . The corresponding TensorRT engine preserves this high level of accuracy, showing only a slight and negligible increase to an error of and error of , respectively. The FP16 quantized version, however, sees a notable rise in error metrics, with an error of and an error of .
The NNC2P_Tabulated model exhibits an error of and an error of . It is important to clarify that this larger error does not indicate a failure of the ML model but is a direct consequence of the model learning from a completely different dataset constructed from the LS220 EOS table to estimate the logarithmic pressure values. The TensorRT engine version also shows only a slight increase in error to . With FP16 quantization, the error rises again, more noticeably, to .
Additionally, we examined the relative accuracy of the
NNC2P_Tabulated model for parameters
,
,
, and
with
(See
Figure 3). The relative error, defined as the absolute error divided by the true value for each point in a specific parameter set, was not uniform across the parameter space. Larger relative errors were observed in the lowest density and temperature regions of the EOS table, while slightly smaller errors occurred in the high-temperature regions. This accuracy trend was consistent across all tested Lorentz factor (
W) values and even more emphasized for the
FP16 precision TensorRT engine. The
LS220 EOS, as provided by [
31], transitions from detailed treatment at high densities to simplified approximations at lower densities, which may contribute to these disparities. Low-density regions are inherently challenging due to the dominance of thermal effects, non-uniform phase transitions, and the treatment of nuclear matter surfaces, which can exacerbate modeling errors [
31,
33]. These characteristics likely explain the reduced accuracy in these regions, where variations in the nuclear matter’s phase state are more pronounced.
The overall results show that TensorRT’s optimizations maintain accuracy across models when using full precision. FP16 quantization, while accelerating inference (as will be discussed further below), introduces higher error rates, particularly in certain models. The potential trade-off between the inference speed and precision can be especially important in relativistic hydrodynamics simulations, where the accuracy of small-scale structures and wave propagation can critically impact the fidelity of predictions. For such simulations, even slight deviations due to quantization can influence results, making full-precision TensorRT inference particularly valuable when accuracy is paramount. Conversely, FP16 quantization may be suitable for faster, lower-fidelity simulations where minor accuracy trade-offs are acceptable.
3.2. Inference Speed Analysis
The inference performance of various methods was evaluated using a single NVIDIA A100 GPU for neural network models and a single-threaded CPU implementation of the traditional numerical method from the RePrimAnd library. The CPUs used in this study were dual AMD 64-core 2.45 GHz Milan processors on the Delta cluster, which can support up to 128 threads. Each configuration was tested across five dataset sizes, ranging from 25,000 to 1,000,000 data points, with ten inference runs conducted per configuration to ensure result stability and consistency. For the RePrimAnd numerical solver, we set the target accuracy for the relative error in the root-finding algorithm to . This is a standard, high-precision value used in production codes. We chose to compare our ML models against this robust baseline rather than tuning the numerical solver’s accuracy to match that of the NNs, ensuring a conservative performance comparison.
The numerical method exhibited linear scaling of inference time with respect to the dataset size. In contrast, both TensorRT and TorchScript models generally maintained relatively stable inference times across the dataset sizes. Notably, the full-precision TensorRT engine for the smaller network,
NNC2PS, showed a faster-than-expected processing time at certain intermediate dataset sizes, as observed in
Figure 4a. This behavior may be attributed to favorable thread block utilization and the kernel selection mechanism of TensorRT for this particular network size. A more detailed profiling study is needed to fully elucidate the underlying cause. The accuracy characteristics of these models remained consistent, as indicated in
Table 2.
The numerical method required significantly more time than the neural network-based approaches. On average, the numerical method took 103.8 ms to process 25,000 data points, with runtime scaling almost linearly to 3490 ms for 1,000,000 data points. In contrast, the neural network models demonstrated substantially faster inference times. Specifically, the mixed-precision TensorRT engine built from NNC2PS required 7.92 ms for 25,000 data points and 8.54 ms for 1,000,000 data points. Its full-precision counterpart exhibited similar performance, with runtimes of 25.17 ms for 25,000 data points and 21.06 ms for 1,000,000 data points. The TorchScript variant showed slower performance but still maintained sub-linear scaling, with runtimes averaging 72.79 ms for 25,000 points and 101.74 ms for 1,000,000 points.
A similar trend was observed for the NNC2PL models, with TensorRT engines consistently outperforming their TorchScript counterparts. The mixed-precision TensorRT engine for NNC2PL processed 25,000 data points in 8.32 ms and 1,000,000 points in 14.35 ms. In comparison, the full-precision TensorRT engine required 25.85 ms for 25,000 points and 23.87 ms for 1,000,000 points. The TorchScript model averaged 73.18 ms for 25,000 points and 102.04 ms for 1,000,000 points.
Figure 4 presents a theoretical performance benchmark based on ideal scaling under the assumption of perfect parallelization. This scenario assumes optimal workload distribution, minimal communication overhead, and negligible synchronization delays, representing the upper bound of scalability. For the numerical method, the figure reflects the full computational capacity of a single CPU node on the Delta cluster, utilizing 128 threads. For the neural networks, it represents the use of 8 A100 GPUs within a single GPU node. Under these ideal conditions, the processing time of the numerical method per data point is projected to decrease by a factor of 128, allowing for the processing of 8 million points in approximately 218 ms (
Figure 4b). Similarly, all neural network methods are expected to achieve linear inference scaling with similar per-GPU efficiency. Under this scenario, TensorRT-based methods—particularly the mixed-precision engine for
NNC2PS—show a 25-fold reduction in processing time for 8 million points compared to the numerical method running at full capacity on the CPU node. Furthermore, the scaling trend strongly favors TensorRT for even larger datasets.
The results presented above underscore the substantial performance gains achievable through the use of TensorRT-optimized neural networks, particularly in the context of conservative-to-primitive inversion in relativistic hydrodynamics simulations. By leveraging the parallel processing power of modern GPUs, these methods offer significant speedups compared to traditional CPU-based numerical approaches, even in large-scale simulations involving millions of data points. As demonstrated, TensorRT optimizations enable more efficient and scalable solutions, with the potential to dramatically reduce the computational cost of C2P operations. This work highlights the clear advantage of integrating ML-driven methods with GPU acceleration to address the computational challenges of high-throughput simulations. Moving forward, the next step is to incorporate these optimized approaches into full-scale hydrodynamics simulations, where their impact on both performance and scalability can be fully realized.
It is important to contextualize the comparison between the fully utilized CPU component (128 threads) and the fully utilized GPU component (8 GPUs) of a single compute node. This ‘node-to-node’ benchmark is designed to answer the practical question of how to best utilize the co-located and often cost-equivalent hardware resources of a modern heterogeneous compute node. While a formal cost-normalized analysis is complex, this approach compares the optimal-use scenario for each hardware type available to a researcher on a typical allocation. The resulting 25-fold speedup is therefore a combination of the algorithmic shift (from iterative root-finding to direct-mapping) and the architectural advantage of GPUs for the massively parallel workload presented by the neural network.
4. Conclusions
This work introduces a novel ML-driven method for accelerating C2P inversions in relativistic hydrodynamics simulations, with a focus on hybrid piecewise polytropic and tabulated equations of state. By employing feedforward neural networks optimized with TensorRT, we achieve substantial performance improvements over traditional CPU solvers, offering a compelling alternative to computationally expensive iterative methods while maintaining high accuracy. Our results demonstrate that the TensorRT-optimized neural networks can process large datasets significantly faster, achieving up to 25 times the inference speed of traditional methods. The success of this approach is rooted in the neural network’s ability to efficiently learn and represent the inherent symmetries and complex functional relationships within the EOS, effectively creating a direct mapping that bypasses iterative numerical solvers.
Future work will explore several key directions to refine and expand this approach. First, adapting the models to handle a broader range of equations of state will improve the versatility of this method across different simulation contexts. Second, exploring alternative network architectures, such as those incorporating physics-informed layers or adaptive activation functions to better handle physical discontinuities like phase transitions, could further enhance both accuracy and inference speed. Third, the models must be extended to handle full three-dimensional velocities to be fully integrated into production-level GRMHD codes. Additionally, continued optimization of TensorRT, including advanced parallelization strategies and scaling across multiple GPUs, and careful exploration of lower-precision formats like INT8, potentially with quantization-aware training, promises even greater reductions in computational time, enabling simulations of larger and more complex astrophysical systems. These improvements will be critical for advancing high-resolution simulations in numerical relativistic hydrodynamics.
We believe that ML-driven methods, particularly those incorporating TensorRT optimization, will play an essential role in advancing the field of general relativistic hydrodynamics and numerical relativity more broadly. To facilitate further validation and extension of these findings, we have made the software developed for this study publicly available at:
https://github.com/semihkacmaz/C2PNets (accessed on 27 August 2025).