Continuous Gravitational-Wave Data Analysis with General Purpose Computing on Graphic Processing Units

: We present a new approach to searching for Continuous gravitational Waves (CWs) emitted by isolated rotating neutron stars, using the high parallel computing efﬁciency and computational power of modern Graphic Processing Units (GPUs). Speciﬁcally, in this paper the porting of one of the algorithms used to search for CW signals, the so-called FrequencyHough transform, on the TensorFlow framework, is described. The new code has been fully tested and its performance on GPUs has been compared to those in a CPU multicore system of the same class, showing a factor of 10 speed-up. This demonstrates that GPU programming with general purpose libraries (the those of the TensorFlow framework) of a high-level programming language can provide a signiﬁcant improvement of the performance of data analysis, opening new perspectives on wide-parameter searches for CWs.


Introduction
Gravitational waves are a phenomenon described by Albert Einstein in his Theory of General Relativity [1,2] as perturbations of space-time generated by a mass distribution with time varying quadrupole moment: they propagate as variations of the space-time metric, changing, in time, the proper distance between space-time points.
The first direct detection of gravitational waves by LIGO and Virgo collaborations [3] has opened a new window into observing the Universe. All signals detected so far have been produced by the inspiral and coalescence of binary systems made of two black holes or neutron stars (e.g., [4]). These signals are transient by nature, that is, their duration is much shorter (the order of seconds) than the detector typical observation time, whose observation runs typically last for months. Neutron stars, however, can emit Continuous gravitational Waves (CWs), as a consequence of an asymmetry with respect to the rotation axis.
The search for this class of signals is challenging, mainly because they are much weaker than those from compact binary coalescences (see [3][4][5][6]). Moreover, for these long lasting signals, the characteristic frequency is modulated by the Doppler effect due to the Earth's motion and the source orbital motion (for sources in binary systems), further complicating the analysis [7,8]. 2 of 12 The amplitude of the wave emitted by a neutron star, which is modelled as an asymmetric ellipsoid-shaped body, rotating with frequency f r around one of its principal axes of inertia, is given by [7,9] h 0 = 4π 2 f 2 0 G c 4 r I where is the star's ellipticity, which is a measure of its degree of asymmetry, I is its moment of inertia with respect to the rotation axis, r is the distance from the source to Earth, G is the gravitational constant and c the speed of light in vacuum. In the second line of Equation (1), a parametrization of the signal amplitude is given with typical parameters for a neutron star. The emission mechanism we are considering in this paper is driven by the asymmetry of the source with respect to the rotational axis, given by the parameter, and is such that the signal frequency is twice the rotational frequency: f 0 = 2 f r [10,11]. Other emission mechanisms exist but are not considered for the kind of analysis described in this paper [12][13][14][15].
Considering a stationary rotating compact star, which loses energy only via gravitational waves, the emitted gravitational flux L GW is equal to the rotational energy loss, L GW =Ė r = π 2 I fḟ , and the rotational f frequency will decrease in time. Given the smallness of this effect, with a good accuracy we can consider the signal frequency to vary linearly in time, that is, with a constant spindown (i.e., the time frequency first derivativeḟ ). Consequently, the signal shape will be a sinusoidal function with frequency following at the first order the equation: with k being a constant that parametrizes the first derivative of the frequency, usually called the spindown parameter. Equation (2) will appear as a straight line across the time-frequency domain. The number of neutron stars observed by electromagnetic emission is 2500 [16], but a population of about ∼10 8 -10 9 [17] unseen objects of the same kind is expected to exist in our galaxy. Since, in principle, a fraction of them could emit gravitational waves within the sensitivity band of the gravitational-wave detectors, the search for CWs is split into two main classes (plus some intermediate cases): coherent searches for known neutron stars and all-sky searches over a wide portion of the search parameter space (consisting of the position in the sky, the star rotation frequency and frequency derivatives for isolated stars, plus the orbital parameters for the sources in binary systems); see, for example, [18,19] for thorough reviews.
The main challenge in all-sky searches is the computational cost of the data analysis, which is so far unaffordable with respect to the cited compact binary coalescence searches, if we want to adequately cover the search parameter space with a coherent method. To this purpose, hierarchical analysis algorithms (so-called pipelines) have been developed (see [6] and references therein).
Nevertheless, the analyses are still very challenging from a computational point of view. Hence, the fast evolution of parallel computing on Graphical Processing Units (GPUs) appears to be promising: in recent years, GPU technology has been developed well, with devices able to carry out a fast increasing amount of floating point operations per second. Moreover, as in the case of CW searches, they are extremely efficient for the computation of highly parallelizable algorithms.
We present the porting on GPUs of the core of the CW all-sky search algorithm called FrequencyHough [7],using the TensorFlow framework [20] as a high-level, general purpose numerical computation library. We will show that the performances of the porting can reduce by an order of magnitude the time needed for a full all-sky analysis. Thanks to such an improvement, we have already deployed the GPU-based FrequencyHough code to analyze data from the third observing scientific LIGO-Virgo detectors, and the related results will be disclosed in an upcoming observational paper.

Continuous-Wave Search
The search for CWs is challenging because the signals are weak compared to the noise floor. If we do not know the sky position of the source, as happens with all-sky searches, detecting CWs is even more difficult because of the dimension of the search parameter space: the Doppler modulation due the Earth's motion will modulate the characteristic frequency time law of a signal (Equation (2)) from an isolated neutron star, depending on the source position [21].
In this paper, we consider one of the standard search pipelines to search for CWs, that is, the mentioned FrequencyHough [7]. In particular we focus on the section that identifies the signal candidates and estimates their parameters, through an implementation of the so-called Hough transform (see Figure 1 which summarizes the core of the pipeline).
Briefly, the pipeline starts from a database of short Fourier transforms (FFT), computed from chunks of the data with a given duration T FFT [7]. The most significant peaks are selected in the equalized spectra from each FFT, producing a collection of time-frequency peaks that we refer to as peakmap. The peakmaps are stored in files that will be used as input for the FrequencyHough step of the pipeline, and they cover the full time of the detector observational run and a frequency interval of 1 Hz for f < 128 Hz and of 5 Hz for f ∈ [128, 2048] Hz.
For every point in the sky, which is also discretized, the Doppler effect is corrected: a double sinusoidal pattern in the peakmap, due to the superposition of orbital and rotational Earth Doppler-shifts, becomes a straight line over the whole observation time. Figure 1. A scheme that shows the steps of the full FrequencyHough pipeline. They are summarized in Section 2 and described in detail in [7]. The FrequencyHough transform and the candidate selection, highlighted in red, are the parts of the pipeline that have been ported on GPUs.
The peaks are then fed to the FrequencyHough transform algorithm (for details see Section 2.1). This step of the analysis is crucial because it produces the set of candidates, that is, significant points in the signal frequency-frequency time derivative space, and represents the main computational burden of the whole pipeline. Once we have selected the candidates from both detectors, they are stored in files where their parameters are saved (sky coordinates, frequency, spindown), and this is the final output of the FrequencyHough transform step.
After the candidates collection and storage, their parameter space position is refined and their number is reduced, by clustering those below a certain distance in the parameter space and, most importantly, by matching the coincidence scheme between two or more detectors. The selected candidates are fed to a follow-up analysis in order to finally discard them or claim a detection [7].

Hough Transform
The FrequencyHough transform is a special implementation of CW searches of the so-called Hough transform, a pattern-recognition method [22] that was conceived for the study of subatomic particle tracks in bubble chambers, where curved tracks are divided, with good approximation, in sufficiently small line segments.
Considering an image where the line is formed by a series of co-linear black points with a pure white background, the detection of a track using the Hough transform consists of the transformation of the input image into another image (that we refer to as "the Hough map"), where a point with coordinates (x, y) is converted into a straight line, whose slope and intercept are given by the point coordinates in the input image. If we take a sequence of points along a straight track with equation y = mx + q, they will be represented in the Hough map by straight lines with changing slopes and intercepts, forming an intersecting family of lines. The coordinates of the incidence point in the Hough space will be the parameter values (m, q) of the line in the coordinate space (see Figure 2 for an example). Then, the Hough transform translates a straight line in the input coordinate space into points in the parameter space and allow us to measure the parameters of straight lines in the input images. Since we work with digitized images, the parameter space is discretized. The Hough map will then be a 2D histogram, where each pixel will have a count depending on how many lines are entered into that parameter space region. A given pixel in the map records the number of lines passing through it and the number count in each pixel will provide information on the significance of a point in the parameter space, within the error given by the image resolution (i.e., the bin sizes of the input space) .
The Hough transform can be generalized in the following ways: (i) enhancing or reducing the resolution of the Hough map (to have either better precision on the parameter estimation or to reduce the computation load); (ii) using any N-dimensional manifold as input space and an M-dimensional manifold as parameter space, searching for curves rather than straight lines [23][24][25].

FrequencyHough Transform
The FrequencyHough transform is particularly well suited to searching for continuous gravitational waves [7,26]. As already stated, it starts with an input peakmap where the Doppler effect due to the Earth's motion has been corrected. The transformation is from the peaks of the detector frequency/time plane to the gravitational wave frequency/spindown parameters plane.
With intrinsic frequency f 0 and spindownḟ as parameters of a given neutron star waveform, and t 0 as an arbitrary reference time, the expected path in the peakmap at first order is given by the time law in Equation (2) [26]: Keeping in mind that the frequency bins in the peakmap have a width ∆ f = 1/T FFT , we enhance the resolution of the parameters in the FrequencyHough map by ten times: ∆ f = ∆ f /10. A peakmap point is then transformed into a stripe between two parallel straight lines with a width given by the relation Then, for every row (i.e., along the spindown dimension), the FrequencyHough map is computed with a differential method: all the elements of one of the two edges, that is, those that match the relation f 0 = f − ∆ f /2 − tḟ , are increased by 1. Once this step is carried out, the elements on the other edge of the stripe are decreased by 1, straightforwardly using array slicing to immediately identify the elements matching f 0 = f + ∆ f /2 − tḟ (we just take the elements found in the first step and move by the number of bins corresponding to ∆ f ). Finally, the row is cumulatively summed. An example is shown in Figure 3, where a straight line has been superimposed onto a portion of a peakmap and then the FrequencyHough has been run on it. Exactly like the naive example in Figure 2, every point in the input is transformed into a straight line (in this case into a belt between two lines), and the aligned points of the straight line produce belts in the FrequencyHough map that converge in the parameters of the input line, allowing us to identify it when searching for the pixel in the FrequencyHough map with the highest number count.
For all-sky searches, the size of the parameter space given by the sky grid is crucial. For each input peakamp, the number of sky points depend on its maximum frequency and on the FFT duration [7]. In Figure 4, we show the sky resolution versus the frequency for the four bands used for the standard FrequencyHough analysis.
To cover all parameter space across the full frequency range, millions of core hours are necessary for a typical analysis over several months of data. GPU parallelism could support a significantly higher computation efficiency, allowing us to run a much deeper analyses.

General Purpose Computing on GPUs
A GPU is a device created and developed to perform computations with an extremely high parallelism and the best possible efficiency.
After very fast technological development, GPUs currently have hundreds and even thousands of cores and several GB of dedicated RAM: thanks to GPUs, the processing power for floating point calculations has exploded in the last ten years, and the use of GPUs in many different fields, from scientific research to economics and so on, gave birth to General Purpose computing on GPUs (GPGPU). At the present time, GPUs are no longer created with the sole purpose of building 3D animation, but specifically for big data computations, and GPUs for data centers dedicated to massive scientific calculations are an established reality. Within this context, with the creation of the CUDA [28] and OpenCL [29] frameworks, software development of algorithms for scientific research that is able to exploit the computation power of this new technology is finally affordable.

TensorFlow
TensorFlow [20] is a framework for high-level GPGPU programming that works with a symbolic paradigm and a syntax similar to high-level scientific programs/languages, such as Matlab [30], or numerical Python libraries like Numpy [31]. Despite having been originally developed for machine learning and neural networks, TensorFlow works well for a wide variety of purposes and, specifically, for scientific data analysis. The main reasons are that, apart from the neural network dedicated part, it has many functions for numerical computations, statistics and tensorial operations, which are developed to run very efficiently on GPUs with a high scalability for multi-GPU systems.
TensorFlow uses the dataflow paradigm, where a program is modeled as a graph of operations where the data flow through. The central units of data in TensorFlow are N-dimensional arrays called tensors. Operations on tensors are represented by nodes in a graph, while the edges are the input/output tensors, which link the operations of the flow.
A simple example of the performance of TensorFlow on a single GPU, compared to Numpy on a single multi-thread CPU system, is shown in Figure 5, where the time execution of the random generation of two matrices with increasing size, and a matricial product between them, has been plotted. Taking into account that the used devices are not of the same class of performance, this example shows that, with well parallelizable algorithms, the improvement in performance using the GPU high-level TensorFlow functions can be up to one order of magnitude in terms of computation time.
All tests shown in this paper have been performed on the GPUs of the Cineca Mar-coni100 cluster [32] and on the CPUs of the LIGO-Caltech CIT cluster [33] (we refer readers to the institutional web pages for details of the hardware specifications). The test is based on the generation of two random 2-rank arrays and a matrix multiplication between them. The green and orange dots show the benchmark conducted, respectively, with 64 and 32 bit data with TensorFlow, the blue ones come from a test with 64 bit data with Numpy. The test with the CPUs has been performed keeping the multithreading enabled as by default for Numpy.

FrequencyHough on TensorFlow
The standard FrequencyHough algorithm [7] is written in Matlab and is based on the SNAG toolbox [34]. The new code has been written in Python with TensorFlow APIs up to version 13.1, using CUDA libraries up to 10.0 1 . The first step of the GPU porting has been to write a fully vectorized version of the code. This is a crucial step with high-level languages because, using a library with functions that are well developed and compiled in a low level language, the code will often be faster than a custom function with similar instructions or, even worse, loop cycles.
The greatest challenge in the GPU vectorization of the code has been the parallelization over the spindown values. Every spindown row in the FrequencyHough map algorithm is created independently. To limit the memory usage, we used a TensorFlow built-in function to map a single spindown iteration along the full search interval in a parallel and efficient way, rather than running a loop 2 . The integration part is instead intrinsically sequential, so it becomes rapidly the most inefficient part of the code when the frequency resolution is enhanced 3 .
Another delicate part of the analysis is the candidate selection. Once the Frequency-Hough map for the selected sky positions has been generated, usually the matrix is split along the frequency dimension in an arbitrary number of vertical stripes, where the local maxima are selected [7]. The number of frequency stripes depends on how many candidates the search will produce; usually that number is fixed on 1 candidate for every 0.05 Hz.
We note that, in terms of programming languages, this can create some issues: if in a stripe there were more candidates with the same number count value but at different spindown values, the algorithm would select only the first occurrence, discarding the others. To bypass this technical limitation, we split the whole spindown range into different spindown sub-intervals and, for each of them, the candidate selection routine was run independently.
Thanks to the vectorization method used for the generation of the Hough map, with TensorFlow we can also naturally parallelize the candidate selection over the spindown sub-intervals, rather than running it sequentially. In this way, since increasing the number of sub-intervals makes the increase in computation time negligible, we can select more candidates, thus improving the search sensitivity.

Benchmarks
The GPU FrequencyHough code can run on devices that have enough memory and computational power like those built for big data centers, with a speed-up that can reach 1-2 orders of magnitude with respect to the standard code of [7], using the same input and parameter space. In Figure 6, we present an estimation of the running time of the GPU FrequencyHough code compared to the CPU version that runs on a single core. The bar plots for the different T FFT used in the analysis (which imply different sizes of the input peakmap) show that the FrequencyHough map computation-which was the dominating part of the execution time of the whole pipeline-thanks to the GPU function becomes lighter than the weighting process of the peakmaps, which is still running on CPUs, using Numpy functions with the multithreading enabled.
The porting was carried out and tested in order to return the same set of candidates of the original code. The peakmaps used for the benchmarks were generated with a time duration of 12 months; between 10 Hz and 128 Hz we have a peakmap for every 1 Hz, while for the other frequencies instead they cover a 5-Hz band. The part where TensorFlow is involved has been split into two pieces: the graph building and variables initialization (orange) and the graph execution (red). The GPU code, after the TensorFlow graph is created, computes the FrequencyHough transform and the candidate selection at once, so it is not possible to split the two steps without the introduction of an overhead caused by the fact that the graph is generated and run in two steps. We remark that the CPU code can run only on a single thread. Hence, in the right plot we show the performance in terms of computing times using a single CPU core.
The spindown range is the same for each frequency band: [−10 −9 , 10 −10 ] Hz/s. T obs being the observation time, the spindown resolution is defined as ∆ḟ := ∆ f /T obs . With this definition, we have, respectively, for the four datasets (see Table 1) the following number of spindown bins: 2566, 1284, 644, 322.
The devices used are an Intel Xeon E5-2650 v4 CPU with 24 2.20 GHz execution threads and a NVIDIA Volta V100 with 5120 cores at 1500 MHz and 16 GB of memory. An example of a FrequencyHough transform applied to a hardware injection in the data from the O2 run of the LIGO Hanford detector is shown in Figure 7.  To provide better insight into how the performance of the code works with different sizes of the input time-frequency space and the frequency-spindown parameter space, we show in Figure 8 the results of another set of tests on the code efficiency: the parallelism efficiency of a GPU is well exploited when its memory and core architecture are filled as much as possible with parallelizable instructions.Due to this, the longer a given observation run is, the more efficient is the search with the new GPU code with respect to the older one. To provide an estimate of the different performances for a dataset spanning 1 year of observing time, 5 million CPU core hours are needed to complete a full all-sky analysis, while the pipeline equipped with the GPU FrequencyHough transform will take only~130 thousand single-GPU hours.

Conclusions
In this work, we have shown the usefulness of GPGPU by developing analysis algorithms for the search for CWs. By parallelizing the FrequencyHough transform computation on GPUs, for a single Doppler corrected peakmap, we obtained a speed-up of the analysis by a factor of 10 in comparison to the original code. As seen in Figure 8, we also showed that by improving the degree of parallelism, that is, analyzing more data and a larger parameter space at once, the efficiency of the analysis can be increased further, by an amount that depends strongly on the devices used, but roughly by another~50%.
With the computational speed of the GPUs, other portions of the pipeline that run on CPUs become the bottleneck of the pipeline, and should be ported on GPUs as well. By updating the code to the newer versions of TensorFlow, the time that TensorFlow takes to generate and initialize the graph is also expected to be reduced.
With the better performances granted to a well developed GPGPU code, one can consider adding higher order spindown parameters and expanding the parameter space in order to increase the probability of a detection. To this purpose, a few arrangements and improvements could be made, such as:

•
Extend the code to run on multi-GPU systems, exploiting the larger memory to load and process larger files at once; • Develop a scalable big data input/output pipeline, which can work out-of-memory, using appropriate modern file formats and libraries, trying to balance serialization and parallelism.
The field of big data computation with GPUs is fast evolving, with new hardware architectures that raise the available computational power, and an increasing number of frameworks, which allow us to develop codes with a high variety of purposes. Due to this, we expect that an increasing amount of scientific analysis in the next years will use such devices and frameworks. Thanks to the availability of clusters made by thousands of data center GPUs, it has become crucial to have codes that can efficiently exploit the computational power of these devices.
Developing GPGPU codes for CW searches is thus of primary importance. An update to TensorFlow 2.x with CUDA 11.x is planned in the near future. 2 The function used is map_fn: it applies a function along elements of a tensor, parallelizing the instructions over the GPUs. 3 For this part, TensorFlow APIs have a cumulative sum function called cumsum.