The Parallel Compact Object CALculator: An Efficient General Relativistic Initial Data Solver for Compact Objects

Every numerical general relativistic investigation starts from the solution of the initial value equations at a given time. Astrophysically relevant initial values for different systems lead to distinct set of equations that obey specific assumptions tied to the particular problem. Therefore a robust and efficient solver for a variety of strongly gravitating sources is needed. In this work we present the OpenMP version of the Compact Object CALculator (COCAL) on shared memory processors. We performed extensive profiling of the core COCAL modules in order to identify bottlenecks in efficiency which we addressed. Using modest resources, the new parallel code achieves speedups approximately one order of magnitude relative to the original serial COCAL code, which is crucial for parameter studies of computationally expensive systems, such as magnetized neutron stars, as well as its further development towards more realistic scenarios. As a novel example of our new code we compute a binary quark system where each companion has a dimensionless spin of $0.43$ aligned with the orbital angular momentum.


Introduction
Gravitational wave astronomy was launched in 2015 with the first-ever gravitational wave detection of the inspiral and merger from a binary black hole system, as reported by the LIGO scientific collaboration -event GW150914 [1].Two years later the simultaneous detection of gravitational waves from an inspiraling binary neutron star system, event GW170817, and its postmerger emission of electromagnetic radiation spurred the era of multimessenger astronomy [2,3,4,5,6].Merging binary neutron stars and black hole-neutron stars are not only important sources of gravitational radiation, but also promising candidates for coincident electromagnetic counterparts, which could give new insight into their sources.To understand these observations and, in particular, to understand the physics of matter under extreme conditions, it is crucial to compare them to predictions from theoretical modeling, which, due to the complexity of the underlying physical phenomena, are largely numerical in nature.
Numerical modeling of strongly gravitating systems typically involves two steps: First, the ab initio calculation of the initial values that describe the astrophysical system at a given time, and second, the evolution of these initial data in order to describe it at later times.These two general relativistic calculations involve two distinct codes.One that solves the elliptic problem to find out the initial data and another that solves the hyperbolic one that simulates their evolution.For that reason we have developed the Compact Object CALculator (COCAL) code which is a serial code that solves for any compact object in equilibrium or quasiequilibrim in a variery of settings and/or mathematical formulations.In particular COCAL has been used to calculate: (i) Rotating neutron stars and quark stars (axisymmetric or triaxial, uniformly or differentially rotating) [7,8,9,10,11,12].(ii) Binary black holes in quasicircular orbits [13,14,15].(iii) Binary neutron stars and quark stars (irrotational or spinning) in quasicircular orbits [16,17].(iv) Magnetized rotating neutron stars with generic mixed poloidal and toroidal magnetic fields and a magnetosphere [18,19,20].(v) Self-gravitating, tilted black hole-disks (where the angular momentum of the disk is tilted with respect to the angular momentum of the black hole) [21].The main characteristics of the COCAL code is the use of finite differences and a Green's function approach as first developed in [22] for neutron stars and in [23] for black holes to achieve a convergent solution through a Picard type of iteration, known as the Komatsu-Eriguchi-Hachisu (KEH) method [24,25].The field equations are solved in spherical coordinates in multiple patches and a smooth solution is obtained everywhere through boundary surface integrals.For the fluid equations surface fitted coordinates are being implemented that allow accurate representation of the neutron star surface which is important in order to impose boundary conditions.
Similar to COCAL, there exist a number of other codes whose purpose is to solve for initial data in general relativity.For binary as well as for single compact objects codes these include the TwoPunctures [26], LORENE [27], KADATH/FUKA [28, 29,30], SGRID [31], Spells [32], Elliptica [33], and NRPyElliptic [34].Each code uses each own numerical methods, and depending on the problem, it may use different mathematical formulations too.
Compact systems (i)-(v) employ different mathematical formulations that nevertheless share common characteristics.For example, one well known method for the calculation of initial data [35] is the Isenberg-Wilson-Mathews (IWM) [36,37,38,39] formulation where the spatial threedimensional metric is assumed to be conformally flat and the trace of the extrinsic curvature, zero.This method is applied from rotating neutron/quark stars [7,11] to binary black holes [13,14,15] and binary neutron stars [16,17] and results to five elliptic equations, i.e., a truncated system of the Einstein field equations.Notwithstanding the approximate nature of the method, its robustness and accuracy (especially for neutron stars) make the IWM method a workhorse in initial data construction.On the other hand, more intricate methods have been developed that can treat the full Eistein system both for neutron stars and black holes.In [8] rotating axisymmetric or triaxial neutron stars have been computed using the waveless method [40,41], while in [18,19,20] this method is employed to calculate generic magnetized equilibria with mixed poloidal and toroidal magnetic fields as well as with a force-free magnetosphere.In [42,43] the same method has been used to compute binary neutron stars in quasicircular orbits.This method is referred as the "waveless" formulation.For spacetimes that contain a black hole a new method has been developed in [21] that solves the complete initial data equations.Contrary to the IWM method, the formulations that try to address the whole Einstein system [40,21] result to a significantly larger number of elliptic equations.For neutron star spacetimes [40] we have fourteen nonlinear Poisson-like equations, with four of them resulting from imposing gauge conditions.For black hole spacetimes we have a total of seventeen equations with four of them resulting from imposing gauge conditions and another three for an additional decomposition of the extrinsic curvature [21].The inclusion of magnetohydrodynamics will increase the total number of elliptic equations to be solved even more.Therefore it is evident that in order to be able to study these astrophysically realistic systems in a systematic manner, an efficient parallelization scheme for the COCAL code needs to be developed.
In this paper we present the OpenMP version of the COCAL code which we call the Parallel Compact Object CALculator or PCOCAL for short.For the first time we have analyzed systematically the whole serial COCAL code and identified the bottlenecks both for single objects (e.g.rotating neutron stars) as well as for binaries (e.g.binary neutron stars) using either the IWM or the waveless formulation.In order to implement the OpenMP directives for parallelization we had to revise and sometimes completely rewrite many of the original subroutines.Using modest resources of 30-40 cores both the single rotating neutron star module as well as the binary neutron star one achieve a speedup of an order of magnitude.In the binary case we observed a smaller speedup than in the single star case due mainly to the communication between different coordinate systems which is absent in the latter calculation.As an application of our new code we computed for the first time a spinning binary quark star system where each companion has a dimensionless spin of 0.43 aligned with the orbital angular momentum.
In the equations presented, Greek indices are taken to run from 0 to 3 while Latin indices from 1 to 3. We use a signature (−, +, +, +) for the spacetime line element, and a system of units in which c = G = M ⊙ = 1 (unless explicitly shown).

The initial value equations
In this section, we summarize the 3+1 decomposition of Einstein's equations for two representative problems that will be discussed below: (i) single rotating neutron stars in the waveless formalism [40,8] and (ii) spinning binary neutron stars in the IWM formalism [16,17].While details on these methods can be found in the aforementioned papers, here we will give a broad overview of the equations that need to be solved in order to discuss the time footprint of each one in the iteration procedure, the possible bottlenecks with the parallelization as well as the optimizations we performed.
We assume that a spacetime M is foliated by a family of spacelike hypersurfaces Σ t , parametrized by a time coordinate t ∈ R as M = R × Σ t .The future-pointing unit normal one form to Σ t , n α = −α∇ α t, is related to the generator of time translations t α as where t α ∇ α t = 1.Here, α and β α are, respectively, the lapse function and the shift vector which is spatial, β α ∇ α t = 0.The projection tensor γ α β to Σ t is introduced as γ α β := g α β + n α n β .The induced spatial metric g ij on Σ t is the projection tensor restricted to it.Introducing a conformal spatial geometry with spatial metric γij = ψ −4 γ ij , where ψ being the conformal factor, we can write the line element on a chart {t, x i } of Σ t as The conformal rescaling is determined from a condition γ = f , where γ and f are determinants of the rescaled spatial metric γij and the flat metric f ij , respectively.We denote by h ij and h ij the differences between the conformal and the flat metric The extrinsic curvature of each slice Σ t is defined by where L n is the Lie derivative along the normal vector on Σ t .Hereafter, we denote the trace of K ij by K, and the tracefree part of We define the conformal extrinsic curvature Ãij = ψ −4 A ij , similar to the conformal spatial metric.
In this paper, we consider perfect-fluid spacetimes whose stress-energy tensor is written as where ϵ is the energy density and p the pressure as measured by the comoving observer, i.e. an observer with 4-velocity u α .As discussed in [18,19], in the case where we have magnetized neutron stars the total stress-energy tensor in addition to the perfect-fluid contribution, Eq. ( 5), will include the contribution of the magnetic field.We decompose the Einstein's equations E αβ := G αβ − 8πT αβ = 0 along the hypersurface as well as along its normal n α as follows, which correspond, respectively, to the Hamiltonian constraint, the momentum constraint, spatial trace part (combined with the Hamiltonian constraint), and spatial tracefree part.As shown in [40], the above set of field equations ( 6)-( 9) are reduced to elliptic (Poisson) equations for {ψ, βi , αψ, h ij } respectively, as follows, where ∆ • is the flat metric (in arbitrary coordinates) Laplacian, defined by ∆ The sources S H , S i , S tr , and S ij [21] depend nonlinearly on the unknown potentials {ψ, βi , αψ, h ij } and are written in A. In the COCAL code we use the Cartesian components of the elliptic equations ( 10)-( 13), on spherical grids (see Section 3).Equations ( 10)-( 13) must be supplied with conditions on the boundary of our computational region.Since we will consider only isolated single or binary stars, the boundary conditions for the gravitational equations will be only at spatial infinity, where we impose asymptotic flatness, i.e.

Rotating neutron star in the waveless formalism
Coordinate gauge conditions such as the maximal slicing and the generalized Dirac gauge are assumed for rotating neutron stars.As described in A, H i = D • j γij .These conditions simplify Eqs. ( 51)-( 54) significantly.In particular RLI ij = 0.The asymptotic behavior of the metric potentials becomes a Coulomb type fall off, Our choice of β a is the shift in an (asymptotically) inertial frame.The Bianchi identity implies ∇ α T αβ = 0.Under the assumption of rest-mass conservation ∇ α (ρu α ) = 0, and an isentropic law ∇ α s = 0, it leads to the relativistic Euler equation where is the relativistic vorticity and h := (ϵ + p)/ρ the relativistic specific enthalpy.We consider two cases of single rotating stars, both of which are described by a 4-velocity Here k α = t α +Ωϕ α is the helical symmetry vector, Ω the constant angular velocity of the star and ϕ α is the generator of rotational symmetry.For uniformly rotating stars it is u β ω βα = −u t ∇ α (hu β k β ).

Axisymmetric rotating neutron stars
For stationary and axisymmetric systems, we impose time symmetry on both the three dimensional metric as well as the extrinsic curvature Note that we do not explicitly impose the axisymmetry on our formulation.Under these asumptions ũij = 0 in Eq. ( 52) and (54).Note that the last terms involve second order derivatives of the shift vector βi .The stationarity condition for the fluid variables reduce the Euler Eq. ( 19) to the simple algebraic equation where C a constant to be determined.Note that u t is computed by the normalization condition u α u α = −1.In this work we also assume that neutron stars are "cold" i.e. they can be described by a zero-temperature equation of state (EOS), or, equivalently, p = p(ϵ).This kind of one-parameter EOS is called barotropic.Equations ( 10)- (13) under gauge conditions (15)(16) and Eq, (23) determine all gravitational and fluid variables.

Triaxial rotating neutron stars
For nonaxisymmetric rotating neutron stars we impose time symmetry on the three metric but helical symmetry (stationarity in the rotating frame) on the extrinsic curvature Under conditions (25) we have ũij = 0 in Eq. ( 52) and where ω α = β α + Ωϕ α the so-called corotating shift vector ω α .The helical symmetry condition for the fluid variables reduce the Euler Eq. ( 19) to Eq. ( 23) similarly to the stationary and axisymmetric case.

Binary neutron stars in the IWM formalism
Binary neutron stars are computed within the IWM formalism where which means that Eq. ( 13) is absent while in Eqs. ( 10)-( 12) the magenta terms in the sources ( 51)-( 53) are zero.The spacetime helical symmetry implies that Eq. ( 60) can be written as where ω i the corotating shift.The 4-velocity of the fluid here can be written as where V i is the fluid velocity in the corotating frame, and the helical Killing vector k α refers to a binary system having orbital angular velocity Ω.For irrotational binaries [44,45,46,47], we have ω αβ = 0, so the specific enthalpy current hu α can be derived from a potential hu α = ∇ α Φ.In order to allow for arbitrary spinning binary configurations, a 3-vector s i is introduced according to [48] ûi where the D i Φ part corresponds to the "irrotational part" of the flow and s i the "spinning part" of the flow.Equations ( 30) and ( 31) yield For arbitrary spinning binaries under the assumptions of helical symmetry and the additional assumption for the spin of the neutron star where C is a constant to be determined.The conservation of rest mass and Eq. ( 32) for the fluid velocity will produce an extra elliptic equation for the fluid potential Φ Note that since h ij = 0, the conformal geometry is flat γij = f ij and therefore ∂ i Φ = Di Φ.The boundary for the fluid is represented by the surface of the star; hence the boundary condition for Eq. ( 36) will be of von Neumann type, that is, in terms of derivatives of the rest-mass density and of Φ Equations ( 10)-( 12) under maximal slicing gauge condition and Eqs. ( 35), (36) determine all gravitational and fluid variables.As in the rotating star case u t is determined from the normalization of the 4-velocity.A barotropic EOS is assumed.

Numerical methods
All elliptic equations are solved using the representation theorem of partial differential equations through a Picard type of iteration.Starting from where S is a nonlinear function of f (here f can be any of {ψ, α, βi , h ij , Φ} and S any combination of them), and using an appropriate Green's function that satisfies certain boundary conditions, a solution for f can be written as where V is the domain of integration, x, x ′ ∈ V ⊆ Σ t , the initial spacelike hypersurface.The volume V and its boundary ∂V depend on the coordinate system used and are different for isolated rotating neutron stars and binary neutron stars.We will explain this difference in the next sections.This method is widely known as the KEH method [24,25].
For the evaluation of the integrals in Eq.( 40), a multipole expansion of G(x, x ′ ) in associated Legendre functions, P m ℓ , on spherical coordinates is used; Table 1: Summary of parameters used for single rotating star configurations.R(θ, ϕ) denotes the neutron star surface.
where the radial Green's function g ℓ (r, r ′ ) depends on boundary conditions and the coefficients ϵ m are ϵ 0 = 1, and ϵ m = 2 for m ≥ 1.In practice the sum over ℓ is truncated at ℓ = L.The standard value used for single rotating stars as well as binaries is L = 12.On the other hand highly rotating black holes and magnetized neutron stars require a larger L ≲ 60 .

Single rotating neutron stars
For a single rotating neutron star the only elliptic equations to be solved are the gravitational Eqs. ( 10)-( 13).We employ a single spherical coordinate system that covers the region where r a = 0 and r b ∼ O(10 6 M ), M being the total mass of the system.The boundary conditions ( 14) are applied at r = r b .Coordinate grids (r i , θ j , ϕ k ) with i = 0, • • • , N r , j = 0, • • • , N θ , and k = 0, • • • , N ϕ , are freely specifiable except for the points at the boundary of the computational domain, The grid setup is the same as in [8]: the radial grid intervals ∆r i := r i −r i−1 are constant when , which typically extends up to ∼ 1.25R e , where R e is the equatorial radius of the star, and increase thereafter in a geometric progression with ∆r i = k∆r i−1 .Here k is a constant, and For the angular coordinate grids (θ j , ϕ k ), we choose equally spaced grids.Definitions of the parameters for the grid setups are listed in Table 3.1.
When applying Eq. ( 40) we use the Green's function G(x, x ′ ) = 1/|x − x ′ |.For the quadratures, a 2nd order midpoint rule is used in r and ϕ integrations, and a 4th order midpoint rule is used for θ integration.We also use a 2nd order finite difference formula for the r, θ and ϕ derivatives evaluated at the mid points (r , except for a 3rd order finite difference formula used in the first radial derivative evaluated at the mid points.For derivatives at the grid points (r i , θ j , ϕ k ) a 4th order finite difference formula is used.Note that we use the midpoint rule for numerical quadrature formula, and hence compute the source terms, Eqs. ( 51)- (54), at the midpoints of the grids.
A relaxation parameter λ is used when updating a newly computed variable f ∈ {ψ, α, βi , h ij , Φ, ρ}.If f (n) (x) is the value at the n-th iteration, and f (x) the current result of the Poisson solver, Eq, (40), then the (n + 1)-th iteration value will be where 0.1 ≤ λ ≤ 0.4.Usually λ = 0.4 for all variables except for Φ where λ = 0.1.The criterion used by the cocal to stop the iteration for a variable f is based on the relative error between two successive iterations and given by for all points of the grids, and all variables.

Binary systems
For binary configurations the hypersurface Σ t is covered by at least three coordinate systems as explained in detail in [13,14,15,16].As an example of a general binary system, we show in Fig. 1 a black hole-neutron star (x − y cross section) which consists of the following: (i) Coordinate system COCP-BH (top left coordinate system) centered around the black hole extending from an inner sphere S a (small black circle at its center) of radius r a to a sphere S b of radius r b (red outer circle).The surface S a denotes the black hole region and is excised from the grid.If M is the mass of the system we have r a ∼ O(M ) while r b ∼ O(100M ).(ii) Coordinate system COCP-NS (top right coordinate system) centered around the neutron star extends from r a = 0 to a sphere S b of radius r b (blue outer circle).Both COCP-BH and COCP-NS contain an excised sphere S e (although we use the common symbol S e these spheres are different in the two coordinate systems) which is introduced to improve the angular resolution and reduce the number of multipoles for resolving the companion object in a given coordinate system.(iii) Coordinate system ARCP that is positioned at the center of mass of the system and extends from an inner sphere S a of radius r a ∼ O(10M ) (green inner circle) to an outer sphere S b of radius r b ∼ (10 6 M ) (outer black circle).
In binary systems elliptic equations ( 10)-( 12) are solved separately in all three coordinate systems (in Fig. 1) and a smooth solution emerges in an iterative process.On the other hand Eq. ( 36) is solved only in cordinate systems where a neutron star exists.Therefore the computational cost of a binary system is significantly larger than that of a single rotating star.Convergence to a smooth solution is achieved because in any given coordinate system, in the calculation of the surface integrals in Eq. (40) we use the values of the potential functions f from another coordinate system as indicated by the red, green, and blue arrows in Fig. 1.For example, when we calculate the contribution of the surface integral on the red sphere S e inside the COCP-NS coordinate system (top right) we use the values of the potentials on the corresponding red sphere from the COCP-BH coordinate system (top left) as indicated by the red curved arrow.In this way a given potential (α, ψ, β i ) from one coordinate system is communicated to all others and at the end of the iterative process a smooth solution is obtained in the whole computational domain.

Speedup and efficiency results
The original serial COCAL consists of more than 400, 000 lines of Fortran 90 code and in order to use OpenMP parallelization many of its loops had to be rewritten in a way that will efficiently utilizes the OpenMP capabilities.In particular, in multiple loops, the code inside the innermost one had to be written in a way that is independent between threads.For example, in a typical triple loop over the r (loop 1), θ (loop 2), and ϕ (loop 3) coordinates, any code between loop 1 and loop 2, and/or between loop 2 and loop 3, is written only inside the innermost loop 3. The reason is to utilize the collapse clause which collapses the multiple loops into a single one which is then divided among the multiple threads.Most of our loops were three-dimensional but there were many that are four or even five-dimensional, which result to large speedups through OpenMP parallelization.For example, in a binary system Fig. 1, the calculation of the surface integral on  Table 2: Summary of grid the parameters used for the binary systems and in particular the binary neutron stars computed here.R(θ, ϕ) denotes the neutron star surface.Every patch has its own set of parameters above.
the excised sphere S e uses the Legendre functions with respect to the coordinate system at the center of S e .Assuming θ e is the spherical angle with respect to the z-axis at the center of S e , the term cos(θ e ) will depend on all coordinates r, θ, ϕ of that particular patch and therefore these functions become five dimensional, since they depend on the Legendre indices ℓ, m too.
Another commonly used command is the reduction clause which is used to perform summations, as for example in the calculation of the volume and surface integrals.In this case the code is written in a way that the summation appears only inside the innermost loop so that a combination of the collapse and the reduction clause can execute this operation in multiple threads.The reduction clause is also used in finding the maximum error in Eq. ( 43), for every variable in each coordinate system.The private clause is used extensively for local nested-loop variables so that multiple threads executing a parallel region are independently calculated.Listing a variable as private causes each thread that executes that construct to receive a new temporary variable of the same type and the multiple loop can be performed independently in parallel.
Below we will describe the two main modules of the PCOCAL code: a) the parallelized single rotating star module (Sec.2.1), and b) the parallelized binary neutron star module (Sec.2.2).For the single star solver (a), the computational infrastructure is straightforward (one spherical grid), but the mathematical formalism used requires the solution of 14 elliptic equations.On the other hand for the binary solver (b) a simpler mathematical formalism requires the solution of 6 elliptic equations in a more intricate computational domain which involves at least three coordinate systems (see Fig. 1).
In order to quantify the speedup and efficiency of PCOCAL we define the following measures: • T p (n): CPU wall-clock time of parallelized part of the code using n threads.
• T s : CPU wall-clock time of serial (nonparallelized) part of the code.Input/Output (IO) and copying between arrays are excluded.
• T IO : CPU wall-clock time of serial IO.
• T m : CPU wall-clock time of memory copy between arrays.
Type r a r b The reason for timing T m is because in COCAL/PCOCAL every main variable (e.g. the 3d conformal factor ψ) exists in all coordinate systems (which are at least three) and is stored in a higher dimensional array (for the case of ψ, in a 4d array) where the extra index determines the patch (COCP-1, COCP-2, or ARCP).When calculating quantities in one patch and we want to access a quantity in another patch, an array copy is necessary.The time for these copies is measured by T m .
The total time for a calculation is T = T p + T s + T IO + T m .We define the speedup of the code using n threads as while the efficiency is defined as The speedup S p (n) and efficiency E p (n) refer to the parallelized part of code (which constitutes the large majority of the solver), while S(n) and E(n) refer to the total speedup and efficiency which include the serial parts of the code, the input/output, and memory handling.Note that for the tests performed in this paper the input/output routines were neither optimized nor minimized (for diagnostic purposes) and thus the total speedup shown underestimates the real one.
To make sure that the parallelized PCOCAL code produces the same results as COCAL itself, we perform a pointwise check for every parallelized subroutine against the serial code and confirm that the differences between the two codes are ≲ 10 −12 in all variables.
Most of the runs have been performed in server A which has 36 cores with 2 threads per core (72 threads total) and an Intel(R) Xeon(R) Gold 6254 CPU 3.10GHz.We have also used server B which has 40 cores with 2 threads per core (80 threads total) and an Intel(R) Xeon(R) Gold 6242R CPU 3.10GHz.Intel compiler (2021.3.0 20210609) is being used.

Single rotating neutron star
Following the methods of Sec.2.1 and Sec.3.1 we test our new code in three grid resolutions as can be seen in Table 4.1.Resolution H2 has N r × N θ × N ϕ = 442368 intervals (number of points in the r, θ, ϕ directions are N r + 1, N θ + 1, and N ϕ + 1 respectively) while resolutions H3, H4 have twice and four times as many intervals in every dimension, i.e. 8 and 64 times more intervals in total respectively.The default value of terms in the Legendre expansions is L = 12.Below we also investigate the performance of the code with respect to L.
In Algorithm 1 we sketch the most salient steps taken for a solution.In parentheses with green fonts we show the percentage of time needed for a given calculation per iteration using a single Algorithm 1 Rotating star in the waveless formalism 1: procedure RNS
12: 43) ▷ (1%) 13: if exit core using the H3 resolution.Lines that do not have a number indicate that the time it took for completion was less than 1%.The rest of the time is spent in the calculation of diagnostics as well as further output.Steps 2-12 are repeated iteratively until condition 13 is fulfilled, and hence convergence to a solution is achieved.As we can see, the most time consuming routines are the computation of the momentum constraint sources S i (15%), the calculation of the sources S ij (13%) followed by the calculation of the conformal Ricci tensor Rij (11%), and the Poisson solvers themselves (11%).The reason that the three source arrays S i of the momentum constraint took more time to be computed than the six source arrays S ij , was mainly due to modular way COCAL implements various mathematical formulations.In particular for the waveless formulation the sources (right hand side of the Poisson equations) are split into two parts: i) the part that comes from the conformal flat part of the metric, and ii) the ones that come from the nonconformal one.
In this way, one has the flexibility of choosing a specific method (conformal flat vs nonconformal flat) without repeating the writing of a code.The disadvantage is that triple loops all over the gridpoints may be repeated, therefore speed is sacrificed in view of modularity.In the computation of the momentum constraint sources S i for the waveless formulation (which is not conformally flat) both the fluid part and the gravitational part of the source computation are done twice, which leads to a larger time footprint than the six source arrays S ij .Overall the bottleneck for the rotating neutron star module in the waveless formulation is the computation of the sources (lines 2-8) rather than the Poisson solvers (line 9).As we will see this is in contrast with the binary neutron star module in the conformal flat approximation, where the latter dominate over the former.In Fig. 2 on the left panels we plot the speedup S p (top) and efficiency E p (bottom) of the parallelized part of the code (solid lines), as well as the corresponding measures S and E for the total code (dashed lines).As we mentioned above, S underestimates the real speedup since the input/output routines were neither optimized nor minimized (for diagnostic purposes).Thus in the discussion below we focus on the speedup S p and efficiency E p .One characteristic of the speedup is that for all resolutions it reaches a maximum at ∼ 36 threads, drops slightly afterwards, from which point it continues to increase.The maximum speedup of the parallelized part of the code is 18 − 20 times the serial one (when ≲ 60 threads are used) which can be achieved with a minimum of ∼ 36 threads.At that point the efficiency is ∼ 50%.At maximum speedup the whole rotating neutron star code (including the serial routines that are nonparallelized, such as the calculation of the parameters Ω in Eq. ( 20), C in Eq. ( 23), and the coordinate scaling R 0 [8,16]) is ∼ 12, 13, Figure 2: Left panels: Speedup (top) and efficiency (bottom) of the parallelized (solid lines) and total code (dashed lines) of a single rotating star module in the waveless formalism for three different resolutions H2, H3, and H4.In all cases the maximum number of terms in the Legendre expansion is L = 12.The vertical dashed line denotes the number of cores/CPU of server A. Right panels: Performance time of the whole code in the three resolutions.and 14 times faster than the serial analogue for resolutions H2, H3, and H4 and occurs at 36, 34, and 35 threads respectively.This can be seen in Fig. 2 on the right panels, where blue bars signify the parallelized routines of the code, red bars the serial part of the code, and green bars memory copy between arrays.From this plot we see that the parallelized code constitutes the vast majority of the total number of subroutines and is responsible for the achieved speedup.Also we observe that the highest the resolution (H4), the larger the total speedup, which is a promising result for high resolution campaigns.
In all tests presented in Fig. 2 we used L = 12 terms in the Legendre expansion, Eq. ( 41), and therefore all the integrals, Eqs.(40).It is experimentally found that such number of terms leads to accurate results without compromising the speed of the code.L = 12 is commonly used in all nonmagnetized rotating star calculations [7,8,9,10,11,12] as well as in the binary neutron star calculations [16,17] that we will mention in the next section.On the other hand accurate magnetized rotating neutron stars or black hole-disk solutions require a larger number (≲ 60) of terms to be included in the Legendre expansions.In Fig. 3 we plot the time of a single iteration (blue dots), as well as the time for the Poisson solvers alone (red dots), using the H3 resolution and a single core.Increasing the number of Legendre terms L, increases the time spent on the Poisson solvers as Time ≈ 0.32(L − 12)1.28 + c (46) where c = 9.52.This fitting function is plotted by a solid green line.At the same time, the time spent on the whole iteration is also fitted by the same relation Eq. ( 46), but c = 88.63 1 , and is plotted with a cyan curve.In other words, for a given resolution, the increase of the number of Legendre terms affects essentially the Poisson solvers as well as the whole iteration in the same manner and scale as ∼ L 1.28 .In order to evaluate our new code with respect to the number L we performed the same tests using the H4 resolution for three different number of Legendre terms L = 12, 24, 48.The results are shown in the left panels of Fig. ( 4) with red, blue and green curves in server A. In addition we show with a cyan curve the speedup and efficiency of the L = 24 run in server B. The first conclusion is that the speedup is approximately preserved when the number of Legendre terms is increased.All runs in server A have a maximum at ∼ 35 threads.Beyond that point, and similarly to Fig. 2, the speedup drops and then starts to increase again.This behavior is qualitatively the same as the run with L = 24 at server B with the maximum now happening at 40 threads and a speedup larger than 20.Given than the server A CPUs have 36 cores, while the server B CPUs have 40, we conclude that peak speedup with minimum amount of threads happens approximately at the number of cores a CPU has.The larger the number of cores the larger the speedup will be.On the right panels we plot the total performance time for the L = 12, 24, 48 in the H4 resolution and server A. We find that the maximum speedup of the total rotating neutron star code (including the routines that are nonparallelized) is ∼ 14, 13, and 14 times faster than the serial analogue in the H4 resolutions for L = 12, 24, 48, and occurs at 35, 36, and 35 threads respectively.Therefore the speedup is preserved when the number of Legendre terms is increased for the whole code as well, which is expected, given the fact that the parallelization scheme covers all the main components of PCOCAL.

Binary neutron stars
Following the methods of Sec.2.2 and Sec. 4 we test our new code in three resolutions as in Table 4. Resolution E2.5 has N r × N θ × N ϕ = 1492992 intervals in each of the three coordinates systems (two COCPs and one ARCP in Fig. 1) while resolutions E3.0 and E3.5 have ∼ 1.3, and 2 times as many intervals in every dimension, i.e. 2.4 and 8 times more intervals in total respectively.In all binary cases we use L = 12 number of terms in the Legendre expansions.Notice that the BH coordinate system (COCP-BH, top left, red patch in Fig. 1) is now replaced by a NS coordinate system, similar to the blue, top right patch COCP-NS.Therefore there is no inner surface S a (no boundary conditions), and r a = 0 for the red patch in Fig. 1 for the binary neutron/quark case.
In Algorithm 2 we sketch the most important steps taken for a binary neutron star solution using a single core in the E3.0 resolution.As in Algorithm 1 we report the percentage of time needed Figure 4: Same as in Fig. 2 but only for resolution H4 with L = 12, 24, 48.Speedup and efficiency in a second (server B) is also shown for L = 24 (cyan curve on the left panels).As in Fig. 2, solid lines in the left panels correspond to S p and E p , while dashed lines to S and E. Vertical dashed lines denote the number of cores/CPU of servers A and B.
for the completion of each step with green fonts inside a parenthesis, while when such number is absent it means that the time it took for completion was less than 1%.The rest of the time is spent in the calculation of diagnostics as well as further output.Steps 2-14 are repeated iteratively until condition in line 15 is fulfilled.The main differences between the RNS and NSNS modules are: i) Steps 3-15 are performed in three coordinate systems (two for COCP-NS and the ARCP) as seen in Fig. 1 instead of one in the RNS module.ii) The calculation of every potential ψ, α, βi in the two COCP-NS coordinate systems is significantly more involved because of the excised sphere S e2 .iii) The additional calculation of the velocity potential Φ (∼ 4%) (absent in the RNS module) is another important difference between the NSNS module and the RNS one.iv) Since for binary neutron stars we solve for conformally flat initial data, potetials h ij = 0 and in addition the source terms (Eqs.( 51), ( 52), ( 53)) are significantly simpler since the magenta terms are zero.
As mentioned above, surface integrals (see Eq. ( 40)) on S e do not exist in the single coordinate system of the RNS module.When computing those integrals as well as the surface integrals at S b and S a in a given coordinate system (e.g. on COCP-NS-1) the integrands are computed using the variables of another coordinate system (e.g. from COCP-NS-2).In this way solutions between coordinate systems communicate between each other in order to achieve a smooth solution everywhere.Therefore to compute the integrands on the surface integrals at S a , S b , and S e threedimensional interpolations from nearby points of another coordinate system is needed.In total the Poisson solvers take ∼ 41% of the time of an iteration.The time of the Poisson solvers in each COCP-NS patch is ∼ 17% while in the ARCP is ∼ 7%, The big difference between them is due to the fact that patch ARCP does not have an excised sphere S e .The surface integral in the off-center sphere S e is the most time consuming operation (∼ 11%), and it involves the computation of the corresponding associated Legendre functions which are five-dimensional arrays of (r, θ, ϕ, ℓ, m).In the current implementation this time consuming operation is done on the fly every time such an integral is calculated in order to have a smaller memory footprint.Note that these functions do Table 4: Three different grid structure parameters used for the circular binary computation in COCAL.The amount of RAM used in each resolution is written in the first column.All variables are explained in Table 3.2 and the distances are in normalized quantities [16].There are two COCP grids centered around each neutron star, and one ARCP centered at the center of mass of the system (see Fig. 1).Resolutions E2.5, E3.0, and E3.5 use 5.0, 9.9, and 30.6 GB of RAM correspondingly.for all Coordinate Systems A do 3: Interpolate variables from SFC ▷ (3%)
In Fig. 5 on the left panels we plot the speedup S p (top) and efficiency E p (bottom) of the parallelized part of the binary NSNS code (solid lines), as well as the corresponding measures S and E for the total code (dashed lines).Similar to the RNS module, Fig. 2, we find that for all resolutions the speedup increases until a certain number of threads, then it exhibits a sudden small drop, from which point it continues to increase with the number of threads.One difference with respect to Fig. 2 is that the speedup curves are distinct for resolutions E2.5, E3.0, and E3.0, while for the RNS module we observe a broad overlap of these curves between resolutions H2, H3, H4.The reason for this behavior is the communication between the different coordinate systems in the binary code which is absent in the single neutron star module.Despite of that the speedup of the parallelized code reaches values of 12 − 16 using just ∼ 36 threads.The efficiency at that point is ∼ 40%.This speedup can further increase if more than 60 threads are used.Another finding in Fig. 5 is the fact that for the highest resolution E3.5 the speedup is broadly constant from  4 and are E2.5, E3.0, and E3.5.As in Fig. 2, solid lines in the left panels correspond to S p and E p , while dashed lines to S and E.
∼ 26 to ∼ 35 threads which is due to the domination of data communication between the different grids in the calculation process.On the left panels of Fig. 5 we also see that the total binary code (including the routines that are nonparallelized) is 6, 5, and 5 times faster than the serial analogue for resolutions E2.5, E3.0, and E3.5 and occurs at 33, 34, and 26 threads respectively.The distribution of times in the whole code can be seen on the right panels of Fig. 5.In the future we will address the data communication and memory hadling in order to make the binary speedup to scale similarly to the RNS code.In practice a binary neutron star in quasicircular orbit that took approximately six days of continuous computation in the E3.0 resolution with the COCAL code, with PCOCAL it needs approximately one day using a modest amount of ∼ 30 threads.
The differences between the PCOCAL and COCAL for a binary neutron or quark star are ≲ 10 −12 in all coordinate systems for the gravitational variables ψ, α, β i and fluid variables ρ, Φ as well as for the constants Ω, C and R 0 that appear in the calculation.

Spinning binary quark stars
Given the fact that the state of matter at supranuclear densities is an open question, there exists the possibility that conventional neutron stars contain in their core a phase transition to strangequark matter i.e. up, down, and strange free quarks.A quark-gluon plasma has been observed in heavy ion collision at CERN and RHIC [49].Also, Bodmer and Witten [50,51] pointed out that it is possible the ground state of matter at zero pressure and large baryon number is not iron but strange quark matter.Thus from a theoretical standpoint one can have pure quark stars.In [11,52,53,12,54] we have computed rotating isolated axisymmetric, triaxial and differentially rotating quark stars, while in [55] irrotational binary quark stars have been evolved in grid based hydrodynamics.Here we present an example using the PCOCAL code of a spinning binary quark star, where each companion has a spin along the axis of the orbital angular momentum.
As in our previous studies the quark EOS is the MIT model with p = 1 3 (ϵ − 4B) or in a parametric form where B = ϵ s /4 is the bag constant which is set to 52.5 MeV fm −3 , and ϵ s the surface energy density.
The parameter ρ could be regarded as the rest-mass density and K is constant.As explained in [54], changing K does not affect the EOS, thus we choose this constant such that at the surface ρ s = ϵ s i.e. h s = 1.This is accomplished for Assuming ρ s = 1.4ρ nuc this EOS predicts a maximum spherical mass of a quark star of M sph max = 2.1M ⊙ with R sph max = 11.5 km.Using the formalism described in 2.2 with a spin vector (x s , y s are coordinates with respect to the quark star center) in Eq. ( 31), and solving Eqs. ( 10), ( 11), ( 12) and ( 36), ( 31), (35), we compute here a spinning binary quark star at 37.3 km separation and Arnowitt-Deser-Misner mass M = 2.84.The quasilocal spin [17] of each companion is J ql = 0.858 along the axis of orbital motion.This spin angular momentum corresponds to an approximate dimensionless spin of J ql /(M/2) 2 = 0.43.In Fig. 6 we plot the rest-mass density (left quark star), the velocity potential (right quark star), contour plots of the conformal factor ψ (green dashed lines) and velocity potential (brown dashed lines), as well as the fluid velocity with respect to the corotating frame.The direction of the velocity arrows show that the spin angular momentum is along the positive z axis (contrary to the irrotational case) in accordance with Eq. ( 50).From the density colorbar we see that the central density ρ c is approximately 1.55 times the surface density ρ s .The nonzero surface density is characteristic of the self-bound quark stars contrary to their neutron star analogues.For nonspinning (irrotational) binaries of the same surface density (and rest mass) the central density is higher and ρ c /ρ s = 1.63.This behavior is similar to binary neutron stars as well as to isolated rotating neutron/quark stars.Note that in order to find the quasiequilibrium solution a root finding method over the central quark star densities is needed, and therefore a number of cycles (∼ 10) of Algorithm 2 is performed.Pointwise comparison between PCOCAL and COCAL solutions show differences ≲ 10 −12 for all variables.

Conclusions
In this work we presented a new efficient parallelized code, PCOCAL, for general relativistic compact object initial data.PCOCAL is based on the serial COCAL code (which has been tested for a wide range of atrophysical scenarios) and employs extensive OpenMP practices.To optimize PCOCAL, a thorough profiling of the serial COCAL code is done for the first time.Both the single and the binary neutron star modules have been examined in detail and timing of each subroutine was performed.The parallelized module that solves the full set of Einstein equations for a single rotating neutron star is found to be 12-14 times faster than the corresponding serial code, with the parallel part being 18-20 times faster than the serial one.The parallelized module that solves for a binary neutron star system in quasiequilibrim is 5-6 times faster than the corresponding serial code, with the parallel part being 12-16 times faster than the serial one.As a novel example, we computed initial data for a spinning binary quark star system that could be a binary neutron star mimicker.The achieved speedup will help us explore a larger set of problems that require intensive calculations as for example in the calculation of magnetized neutron stars.As a step forward we plan to revisit the data communication and memory hadling of the binary module in order to achieve better speedups for the highest resolutions, as in the single rotating star module.In addition we will perform domain decomposion in order to implement a combined MPI-OpenMP scheme.

Declarations
The authors declare that there are no data associated with this manuscript.

A Source terms
The source terms S H , S i , S tr , and S ij in Eqs. ( 10)- (13) are where with magenta color we signify terms that are due to the nonconformal flat contribution h ij .Such terms are zero in the IWM formulation used in the binary neutron star calculations of Sec.2.2.We also define the conformally rescaled shift as βi = β i , therefore βi = γij βj = ψ −4 β i .The covariant derivatives associated with γ ij , γij , and f ij are respectively D, D, and D

•
. It is Di where The matter sources that appear on the right-hand side of Eqs. ( 51)-( 54) are S := T µν γ µν , (63) Another important term in our system is the one that involves Rij , the 3-d Ricci tensor associated with the conformal geometry γij .One can show [56] that where and Γi := −D Notice that the terms RLI ij , RNL ij enter into Eq.( 54), which we discuss below.The nonlinear term RNL ij is second order in h ij and therfore smaller than the first order terms RLI ij and ∆ • h ij in Eq. ( 65).The term RLI ij involves the gauge functions Γi which are the conformal connection functions in the BSSN formulation [57,58,59].For initial data, in order for the whole system (51-54) to converge, these functions must be fixed [56,21] Γi = −H i (69) where H i is known.Eqs.(69), are related to our freedom in choosing spatial coordinates.In order to impose Eqs.(69) to the solutions of the system Eq.( 51)- (54), and thereby having a self-consistent iteration scheme, an adjustment is necessary for the h ij .Following [43], (or [8] Eq. (29-32)) gauge vector potentials ξ a are introduced through the transformation which are then used to adjust h ij as Here h ij ′ are chosen to satisfy the condition D • i h ij ′ = H j given by (68).The gauge vector potentials ξ i are then solved from the elliptic equations, and subsequently the h ij are replaced by Eq. ( 71).

:
Radial coordinate where the grid r i starts.r b : Radial coordinate where the grid r i ends.r c : Radial coordinate between r a and r b where the grid changes from equidistant to non-equidistant.N r : Total number of intervals ∆r i between r a and r b .N m r : Number of intervals ∆r i in [r a , r c ]. N f r : Number of intervals ∆r i in [0, R(θ, ϕ)].N θ : Total number of intervals ∆θ i for θ ∈ [0, π].N ϕ : Total number of intervals ∆ϕ i for ϕ ∈ [0, 2π].L : Number of multipole in the Legendre expansion.

Figure 1 :
Figure1: A typical setup with multiple coordinate grid patches in the cocal code for a black holeneutron star system.Left and right top patches are those for compact object coordinate patches (COCP) centered at the black hole and the neutron star correspondigly.The smallest circle with thick curve in COCP-BH is the sphere S a where the interior region is excised and certain boundary conditions are imposed.The ovals drawn in COCP-NS denote the neutron star.Bottom patch is that for asymptotic region coordinate patch (ARCP), centered at the mass center of the system.The arrows represent maps of potentials between the multiple patches.Note that the spheres S a , S b , and S e of these coordinate patches are distinct despite our use of a common symbol.The radius of each coordinate patch doesn't reflect the size used in actual computations.

r
a : Radial coordinate where the radial grids start.For the COCP-NS patch it is r a = 0. r b : Radial coordinate where the radial grids end.r c : Center of mass point.Excised sphere is located at 2r c in the COCP patch.r e : Radius of the excised sphere.Only in the COCP patch.N r : Number of intervals ∆r i in [r a , r b ].N f r : Number of intervals ∆r i in [0, R(θ, ϕ)] for the COCP-NS patch or in [r a , r a + 1] for the ARCP patch.N m r : Number of intervals ∆r i in [r a , r c ]. N θ : Number of intervals ∆θ j in [0, π].N ϕ : Number of intervals ∆ϕ k in [0, 2π].L: Order of included multipoles.

Figure 3 :
Figure 3: Time for the Poisson solvers (red dots) as well as for a single iteration (blue dots), for the H3 resolution in a single core, for various values of Legendre terms L. Solids lines represent fitting functions.

Figure 5 :
Figure 5: Same as in Fig. 2 but for binary neutron stars.The resolutions used are shown in Table4and are E2.5, E3.0, and E3.5.As in Fig.2, solid lines in the left panels correspond to S p and E p , while dashed lines to S and E.

Figure 6 :
Figure 6: A spinning binary quark star.The rest-mass density ρ is shown for the star on the left while the velocity potential Φ is plotted for the star on the right.Contour plots of the conformal factor ψ (green dashed lines) and the velocity potential Φ (brown dashed lines in the right quark star), as well as the fluid velocity with respect to the corotating observer, are shown.

Table 3 :
Three different resolutions used for the single rotating star tests.Parameters are shown in Table3.1.The number of points that covers the largest star radius is N f r .Resolutions H2, H3, and H4 use 2, 10.3, and 80.1 GB of RAM correspondingly.