Testing General Relativity with Gravitational Waves: An Overview

The detections of gravitational-wave (GW) signals from compact binary coalescence by ground-based detectors have opened up the era of GW astronomy. These observations provide opportunities to test Einstein's general theory of relativity at the strong-field regime. Here we give a brief overview of the various GW-based tests of General Relativity (GR) performed by the LIGO-Virgo collaboration on the detected GW events to date. After providing details for the tests performed in four categories, we discuss the prospects for each test in the context of future GW detectors. The four categories of tests include the consistency tests, parametrized tests for GW generation and propagation, tests for the merger remnant properties, and GW polarization tests.


Introduction
The binary evolution in General Relativity (GR) is described differently than in Newtonian gravity (NG). In GR, the binary orbit shrinks due to the emission of energy: angular and linear momenta through gravitational waves (GWs) [1][2][3][4]. Whereas in NG, there is no concept of radiation reaction and the orbital period is constant over time. Even though Albert Einstein predicted the existence of GWs more than a century ago, their detection remained a puzzle due to their weak interaction with matter. The indirect evidence of GWs came from the decades-long observations of orbital decay of a binary pulsar by Russell Alan Hulse and Joseph Taylor [5][6][7][8]. They found that the observed orbital decay of the binary system, known as PSR B1913+16(PSR J1915+1606, or PSR 1913+16), due to the emission of GWs, is consistent with the predictions of GR. That is, the rate of decay of the orbital period (P orb ) from observationsṖ orb ∼ 10 −14 -10 −12 agreed to the GR predicted rate obtained from analytical calculations based on GR, leading the team to win the Physics Nobel Prize in 1993.
Among the many significant contributions to fundamental physics and astrophysics, GW observations test GR at the relativistic, strong-field regime. A set of testing GR analyses conducted by the LIGO-Virgo Scientific Collaboration (LVC) on the GW150914 event established that GW150914 is consistent with a binary black hole (BBH) signal described in GR [9]. This set of tests include consistency tests, parameterized tests, tests to confirm the nondispersive nature of the radiation, and tests on the remnant properties. Consistency tests check for the agreement of observed data with the signal predicted from GR. Parameterized tests introduce model-agnostic parametric GR deviations in the waveform and constrain those from the data to put statistical bounds on these parameters. The list of the GW events was extended further with more binary merger detections by LIGO and Virgo detectors' first, second, and third observation runs. All the tests applied on GW150914 were also performed on these events with appropriate modifications to the above-mentioned tests, including additional tests. Here we will go through them in detail.
The GW-based tests of GR on the BBH coalescence events detected by LIGO and Virgo until 1 October 2019 are available in Reference [30]. A generic binary system evolves from its early inspiral weak-field regime to a highly relativistic merger and then the final ringdown stage. In the case of BBHs in GR, the object formed after they merge (i.e., the merger remnant) is another black hole (BH). On the other hand, for non-BH binaries [31,32], 1 the merger remnant is not necessarily a BH but could be another compact star depending upon the properties of the binary.
In a model agnostic way, there were four broad classes of tests conducted in Reference [30]. These tests aim to look at different regimes of binary evolution or to the full inspiral-mergerringdown signals. The first set consists of the residual analysis and the inspiral-mergerringdown consistency test. Both of these tests check the consistency of GR predictions with the observed data (as in the case of Reference [9]). The second category of tests is the parametrized tests for GW generation and propagation. Here one sets statistical bounds on the parametrized deviations from GR, assuming GR is the correct theory of gravity, employing GR waveform models with parametric deviations present. On the third category of tests, one looks for any violation of GR by analyzing the merger remnant properties. The GW polarization tests look for extra polarization modes present in the data and comprise the fourth set of tests. This analysis provides statistical evidence for alternative theories of gravity that predict vector and scalar polarization modes along with the tensor modes. An overview of these tests is provided in Figure 1. The probability of there being astrophysical origin of a candidate event plays an important role in determining whether that event is considered for the testing GR analyses or not. Usually, a higher threshold is assumed so that the events analyzed have higher chances of being of astrophysical origin. For instance, in Reference [30], events satisfying a false alarm rate (FAR) less than 10 −3 per year are chosen to analyze. Once the set of events is chosen based on the detection significance, additional criteria are applied depending upon the strategies followed by each test. Bayesian formalism-based techniques are employed to get meaningful bounds from each test. The pipelines widely used for this purpose are, LALInference [48] available in the LIGO Scientific Collaboration's algorithm library suite (LALSuite) [49], Bayeswave [50,51], parallel bilby (pBilby) [52][53][54], PYRING [55,56] and Bantam [57]. Reference [30] demonstrated the possibility of performing tests of GR on binary black hole (BBH) events, employing mainly two different waveform models, IMRPhenomPv2 (phenomenological waveform model for a precessing BBH system) [58][59][60] and SEOBNRv4_ROM (reduced-order effective one body (EOB, waveform model for a non-precessing binary system) [61].

Importance of signal assumption
There is a significant increase in the detection rate as the detectors improve their sensitivities through first, second, and third observing runs of LIGO-Virgo detectors. Interestingly, it is possible to infer information from multiple events by combining the data from each event. The combined bounds help to improve our understanding of binary population properties in general. As we combine results, the statistical uncertainty that arises due to instrumental noise lessens. Notice that this instrumental noise does not include the uncertainty contributions from the systematic errors of gravitational waveform modeling [62][63][64][65][66][67][68], calibration of the detectors, and power spectral density (PSD) estimation uncertainties [69][70][71][72][73]. Sometimes, systematic errors can dominate the statistical errors and lead to false identification of GR violations, which we do not discuss here. Previous studies in References [9,68,74] discuss two different statistical approaches to estimate combined information on GR test parameters from multiple events. The first one (also called restricted or simple combining) assumes equal GR deviations across all the events independent of the physical parameters characterizing the binary, and this technique is well described and demonstrated for GWTC-1 events [74]. This assumption is generally incorrect as there are cases when the waveform model can arbitrarily deviate from GR depending upon the binary source properties. The second method, the hierarchical combining strategy, tries to overcome the issue of universality assumption by relaxing it. In this case, instead of assuming uniform GR deviation for all events, a Gaussian distribution models the non-GR parameter. The statistical properties (mean (µ) and standard deviation (σ)) of this distribution are obtained from the data itself, and the estimates are different for different models of gravity. We call the parameters µ and σ hyperparameters. If GR is the correct theory, the Gaussian distribution should center around zero. The astrophysical population properties of sources play a crucial role in estimating these statistical quantities [30].
The organization of this article is as follows. Section 2 is dedicated to various tests of GR performed already on the GW events observed by the LIGO-Virgo detectors, including tests of consistency with GR (Section 2.1), parametrized tests (Section 2.2), tests based on the merger remnant properties (Section 2.3), and tests for GW polarizations (Section 2.4). We conclude with a summary section, Section 3.

Tests of Consistency with General Relativity
Consistency tests do not need to assume any particular alternative theory to GR, nor do they test specific deviations. They address the simple question: can the observed data be fully explained by assuming GR? Or put differently, is there any statistically significant "trace" in the data that is unlikely to be explained as either part of an astrophysical signal (assuming GR) or the terrestrial instruments' noise? So far two different kinds of consistency tests have been performed on the detected GW events [30]: the residual analysis and the Inspiral-merger-ringdown (IMR) consistency test.

Residual Test
The residual test checks for signatures left in the data after subtracting the best-fit GR template. If GR is the correct theory and we have subtracted the astrophysical signal completely, the residuals in each detector should be consistent with instrumental noises. The best fit model of the astrophysical signal is obtained by a detailed parameter estimation analysis using a stochastic sampling of the signal's parameter space. Typically, the best-fit parameters are taken to be those that maximize the likelihood of observing the recorded data assuming this signal is present in the data. This set of parameters is not necessarily the one that describes the most probable source configuration a posteriori, as the likelihood alone does not take prior assumptions into account. Nevertheless, the maximum-likelihood parameters are those that minimize the difference between the data d i and template h i by definition of the likelihood Λ, Here, the index i enumerates the different detectors; h i is the signal projected onto each detector, respectively, and the norm · 2 = ·, · is induced by the following inner product The detector noise spectral density S n ( f ) acts as a weight in an integral over the Fourier transformed functionsã andb; * denotes complex conjugation.
By this construction, the residuals d i − h i are small in the sense of Equation (1), but they could still contain a coherent signal that cannot be captured by the GR model. To look for such a potential signal, the method employed in References [9,30,68] is Bayeswave [50]: a transient search algorithm looking for coherent excess power in the (residual) detector data. This part of the analysis is model-independent. Bayeswave uses Morlet-Gabor wavelets to look for coherent, elliptically polarized signatures that rely on no further model assumption. In addition to generic signals, it employs models for stationary and non-stationary noise simultaneously.
To quantitatively explain the results, we require various definitions. First, the optimal network signal-to-noise-ratio (SNR) of a signal h is derived from its norm, using the inner product Equation (2). If we take h as the best-fit GR template, we obtain SNR GR . The residual modelled by Bayeswave is not a single signal. The uncertainty in what constitutes a coherent residual signal and what is instrument noise leads Bayeswave to provide a discretized probability distribution in the parameter space of wavelets. However, each point in this distribution corresponds to a residual signal that has a well-defined SNR. Consequently, we can map the probability distribution of residual signals to a probability of their optimal network SNRs.
As is standard, we characterize the probability distribution by credible intervals that enclose a certain amount of probability (we use this quantity more frequently in this article). Specifically, we report the SNR of the residual at which the cumulative probability distribution is 90%. Put differently, we infer a 90% probability that the residual signal after subtracting the best-fit GR template has an optimal network SNR ≤ SNR 90 .
The left panel of Figure 2 shows SNR 90 as a function of the best-fit SNR GR for the observed binary mergers from O1, O2, and O3a. The SNR of the GR signals ranges between 9.24 (GW151012) and 25.71 (GW190521). The 90% upper credible bound of the residual SNR ranges between 4.88 (GW190727_060333) and 9.24 (GW170818). No clear correlation is visible between the two quantities. If our GR models would consistently be unable to capture an ubiquitous deviation from GR, we might expect that stronger signals correlate with stronger residuals. The current data shows no indications of such correlations. To systematically assess if the residual SNRs are consistent with the detector noise, we can define a p-value under the hypothesis that the residual is consistent with detector noise (this is the null hypothesis). This p-value is estimated by running identical Bayeswave analyses on a large number of noise-only data segments around, but not including, each event. The p-value then provides the probability of pure noise producing an SNR n 90 greater than or equal to the residual SNR found after subtracting the best-fit GR template, SNR 90 . That implies, p-value:= P(SNR n 90 ≥ SNR 90 ). A large p-value indicates that there is a high chance that the residual power originates from the instrumental noise. Small p-values, on the other hand, indicate that it is less likely for noise alone to yield such high values of residual SNR.
The p-values for all events considered here are included in Figure 2 as a color scale. They also span a large range between 0.07 (GW90421_213856) and 0.97 (GW190727_060333). This is to be expected in repeated uncorrelated experiments. In fact, assuming the residuals are pure noise, the p-values of the population should be uniformly distributed between [0, 1]. As further discussed and quantified in [30], the p-values found for O1, O2, and O3a events are consistent with this expectation.
As a final interpretation of the residuals, one can ask: how well does the GR model fit the signal in the data? Obviously, if the model would be perfect, and we could unambiguously identify the coherent signal in the data, then the agreement between the model and data should be perfect, too, assuming GR is the correct theory. In reality, our GR models are very accurate, but not perfect, and we only have a probabilistic measure of the signal in the data. Therefore, we can only expect to obtain a lower bound on the fitting factor FF between the model h and the signal s by calculating Here we used that the coherent signal s is the sum of the GR model h and any residual s r that is perpendicular to h. The latter assumption is justified because h was chosen by maximizing the agreement between the data and the model. FF 90 = 1 would indicate perfect agreement.
We plot 1 − FF 90 on the right panel of Figure 2 for the observations we considered so far. They show more clearly a correlation with SNR GR . Strong signals with large values of SNR GR tend to yield higher fitting factors. This is not because the models describe the actual signal better for louder events. It is because our confidence in how well the model agrees with the actual signal increases with increasing SNR GR .
Another notable fact about the fitting factors is that they are larger than related quantities one often finds in the GW literature. For example, discrete template banks for GW searches are often constructed such that fitting factors of at least 0.97 are guaranteed between any signal and the closest template in the bank [75]. Waveform models for BBHs are commonly tuned to FF 10 −3 between the most accurate predictions and the full model [76,77]. Parameter estimation poses strict demands on waveform accuracy of the order of SNR −2 GR [78,79] (i.e., waveform differences of < 10 −2 ). Is that level of accuracy necessary, given that for most observations, we cannot put stronger constraints on the fitting factor than FF 90 ∼ O(10 −1 )?
The answer is, of course, yes! While we cannot be sure about the true signal for individual events, it is worth emphasizing that the Bayeswave analysis has great freedom and a large parameter space to identify virtually any residual signature as a potential coherent signal. Much more restricted measurements that only look for specific, lower-dimensional deviations are sensitive to significantly smaller signal differences, because only those signal differences that are consistent with both the assumed variation and the data are considered. Therefore, the residuals test is a very generic baseline test for anything that cannot be modeled with GR. It is, however, much less sensitive to specific deviations that can be tested more accurately in a dedicated test (see the rest of this paper). A more detailed discussion of the relation between various tests of GR can be found in [80].

Inspiral-Merger-Ringdown Consistency Test
In GR, the time evolution of BBH mergers is uniquely determined. Hence the final mass and spin of the remnant BH are uniquely determined from the initial mass and spin parameters of Kerr BHs. The inspiral-merger-ringdown consistency test (IMRCT) tests the consistency of inspiral and merger-ringdown parts of the signal by comparing two independent estimates of binary parameters. More specifically, the binary's final mass and spin parameters are measured separately from both low-and high-frequency parts of the signals and then compared to the two measurements to check their agreement.
Given the final mass (M f ) and spin (χ f ), one can estimate the spin-dependent innermost stable circular frequency 2 for a Kerr BH ( f cut ) [81][82][83]. The full BBH signal can be divided into two parts using this frequency, the low-frequency part and the high-frequency part 3 . By restricting the noise-weighted integral in the likelihood calculation to frequencies below ( f < f cut ) and above this frequency cutoff ( f > f cut ), the binary parameters are estimated using stochastic sampling algorithms based on Bayesian inference. The merger remnant properties are calculated by averaging NR-calibrated final state fits given the posterior median values [81][82][83] from the two independent mass-spin estimates above. This calculation assumes an aligned-spin binary system. If the data is consistent with GR, both estimates should agree [12,20,84,85].
The frequency, f cut , roughly divides the signal into inspiral and merger-ringdown (postinspiral) regimes. To calculate f cut , the binary parameters inferred from the full signal are used. As the test relies on independent parameter inference from the low and high frequency parts of the full signal, one requires enough SNR in both these regimes of the signal. For the selected events, a detailed parameter estimation analysis is performed in Reference [30] focusing on the mass-spin parameters. If M insp f and M post−insp f denote the final mass estimates obtained from low and high frequency parts of the signal, we can define a dimension-less quantity that measures the fractional deviation from the final-mass estimate as, where subscript, 'f' denotes merger remnant parameters, and 'insp' and 'post-insp' correspond to the estimates coming from the low and high-frequency regimes, respectively.M f andχ f are the symmetric combinations of M f and χ f estimates from inspiral and post-inspiral regimes. A similar expression can be written for dimension-less spin parameter, In principle, these fractional deviations vanish if the data is consistent with GR 4 . However, one must perform a detailed parameter estimation analysis and estimate the statistical confidence that the GR deviation vanishes. This is illustrated in Figure 3 and see Reference [86] for more details of the method and demonstration on simulated binary signals. See References [87][88][89] for studies projecting the possibilities of IMR consistency tests from combining information from current and future detectors. Especially, Reference [88] found that the multi-band observations can improve the constraints by a factor of 1.7.  We start with the full BBH signal. Analyze the signal by restricting the noise-weighted integral in the likelihood calculation to frequencies below and above the cut-off frequency, f cut . From the two independent mass-spin estimates obtained above, the merger remnant properties are calculated by averaging NR-calibrated final state fits. If the data is consistent with GR, the deviation parameters defined in Equations (6) and (7) will be zero, assuming that the waveform model employed accurately models a BBH evolution in GR. See Reference [86] for more details and examples.
In Reference [30], from the list of events satisfying the detection criteria, based on the false alarm probability of each event and the requirement of enough SNR in both inspiral and post-inspiral regimes, posteriors distributions on δM f and δχ f are obtained assuming uniform priors on these parameters. In terms of the two-dimensional GR quantile, Q GR -is the fraction of the posteriors enclosed by the iso-probability contours that contain the GR value. Reference [30] reports GW190814 as the most consistent event with the quantile Q GR = 99.9%.
From the hierarchical combining method, the hyperparameters describing the Gaussian distribution are estimated to be (µ, σ) = (0.02 +0.11 −0.09 , < 0.17) for δM f and (µ, σ) = (−0.06 +0. 15 −0.16 , < 0.34) for δχ f within the 90% confidence interval in Reference [30]. The details can be found in Figure 4. Assuming that the fractional deviations take the same value for all events, a less-conservative combined 90% confidence interval of −0.08 obtained in Reference [30]. This analysis employed IMRPhenomPv2 or IMRPhenomPv3HM waveform models depending upon the information about the higher-mode content present in the binary signal [90] and so far the analysis finds all events to be consistent with GR. Probability density

Parametrized Tests of GR Based on Generation and Propagation of GWs
The parametrized tests are designed to capture any deviations from GR in the generation and propagation of GWs. This is achieved by introducing model-independent parametric variations in the gravitational waveform models and constrain those from the observed data. If GR is the correct theory, parametric deviations vanish, and the statistical bounds can be used to put constraints on the alternative theory models.

Constraining the Parametrized Deviations from General Relativistic Inspiral-Merger-Ringdown Coefficients
Any generic deviation from GR may modify the binary dynamics and its time evolution. This leads to measurable modifications to the equation of motion through the energy and angular momentum of the source and the flux. However, the inspiral-merger-ringdown dynamics are uniquely determined and well studied in GR through various techniques such as post-Newtonian theory, numerical relativity, and BH perturbation theory once we fix the intrinsic parameters of the binary system [58,[91][92][93][94][95][96][97][98][99][100].
The inspiral coefficients are modeled analytically using post-Newtonian (PN) theory, which finds perturbative solutions to the binary evolution in terms of a velocity parameter, v/c, in the slow-motion limit (v << c, v is the PN parameter and c is the velocity of light). Parametrized tests based on inspiral coefficients are investigated in detail [98,[100][101][102][103][104][105][106][107][108][109][110][111][112][113][114][115][116] and also demonstrated the applicability of the test using Bayesian framework [9,20,[117][118][119] in the past. Moreover, the possibility of constraining these parameters employing multiband observations has been studied in References [87][88][89]120]. For an inspiralling compact binary system, the GW waveform can be schematically represented in the frequency domain as, where A( f ) denotes the amplitude and φ( f ) is the phase of the signal. Now, we introduce parametric deviations of the form, φ i ( f ) → (1 + ϕ i )φ i . If GR is the correct theory,φ i vanishes for the Nth PN order, where i = N/2 denotes the Nth PN order. The parametrized tests for post-Newtonian coefficients (pPN analysis) measure these deviation parameters and are one of the main tests of GR analyses performed for the detected binary signals so far.
For a generic binary system, one needs to put bounds on the inspiral-merger-ringdown parametrized deviation coefficients separately. In this case, a relative deviation is introduced to each coefficient appearing through the inspiral-merger-ringdown regimes as, The set of free parameters, δp i include the inspiral coefficients {ϕ i } and post-inspiral coefficients {α i , β i } [62,106,121]. It is not plausible to represent the post-inspiral coefficients analytically and they are obtained by numerical fits.
The IMRPhenomPv2 waveform model [58,59,122] describes a precessing binary system in frequency domain with inspiral coeffiecients determined by PN theory and intermediate, and merger-ringdown regions by finding appropriate numerical fits. The transition frequency between the inspiral to merger-ringdown is defined as GM(1 + z) f PAR c /c 3 = 0.018 (M is the total mass of the binary system and z is the redshift to the binary). The results for parametrized tests for post-Newtonian coefficients reported in Reference [30] relied on the IMRPhenomPv2 waveform model which allows for parametrized deviations of the phenomenological coefficients describing the inspiral, intermediate β i = {β 1 , β 2 }, and merger-ringdown There is also another equally accepted method to perform the same analysis, which is not based on any particular waveform model but rather on theory-agnostic modifications applied to the inspiral coefficients of any waveform model. These low-frequency modifications (inspiral-only modifications {ϕ i }) are tapered to zero at high frequencies. That is, as the frequency reaches post-inspiral regions these modifications vanish and the signal agrees to a BBH signal described in GR. This test is carried out employing SEOBNRv4_ROM waveform model in References [61,118,123]. One of the main differences between the two approaches is that the deviations are applied to only the non-spinning coefficients for the first method. Still, for the second method, the aligned-spin waveform coefficients are modified. Both approaches provide consistent results when we compare them.
Denoting {ϕ i } as the deviation from N = i/2 PN order, the list of inspiral coefficients we can put bounds from the data are, {ϕ −2 , ϕ 0 , ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 , δϕ 5 , ϕ 6 , ϕ 7 }. (10) This means the coefficients up to 3.5PN order (i = 7) are available, including the logarithmic terms at 2.5 and 3PN orders. Due to its degeneracy with the coalescence phase, one cannot constrain the coefficient at 2.5 PN (term having no logarithmic dependence). Notice also that ϕ −2 is zero in GR and, so as ϕ 1 , it is introduced to account for specific alternative theories of gravity, especially those that predict dipolar radiation. For the two coefficients, ϕ 1 and ϕ −2 , we have the absolute deviations while all other parameters provide relative deviations from the respective GR coefficients. The absolute deviations are the differences between the true/actual value and the measured value, whereas the relative deviations are the ratio of absolute deviations to the true/actual value.
Among all the coefficients listed above, the best combined bound is obtained for the Newtonian coefficient in Reference [30], |ϕ 0 | ≤ 4.4 × 10 −2 (neglecting the −1PN coefficient, ϕ −2 ). Notice that this bound is weaker than the bounds from the double pulsar measurement by a factor of ∼3. Reference [30] reports that the posterior on δp i is consistent with the GR prediction within the 90% confidence interval for all the events considered. From the hierarchical analysis, the tightest (loosest) bound obtained is ϕ −2 = 0.97 +4.62 −4.07 × 10 −3 (ϕ 6 = 0.42 +1.67 −1.50 ). For both the circumstances, the GR hypothesis is preferred with quantiles Q GR = 68% (Q GR = 69%, which is close to the median values for both the cases. Figure 5 shows the combined posterior distributions obtained from the GWTC-2 events considering both the hierarchical method and a restricted method, where the deviation parameters do not allow for variance between events. Results from both the studies described above are found to be consistent with GR.

Tests of BBH Nature from Spin-Induced Quadrupole Moment Measurements
Spin-induced multipole moments arise due to the spinning motion of the compact object and take unique values for BHs given mass and spin. For the BBH signals, these effects are included in the post-Newtonian modeling, and they appear along with the spin-spin terms and can be schematically represented as, here κ is the spin-induced quadrupole moment coefficient, m is the mass, and χ is the dimensionless spin parameter. For BHs, κ BH = 1 from the no-hair conjecture [124][125][126] and for any other compact objects, the value may vary depending upon the properties of the star. Through numerical relativity simulations of slowly spinning neutron stars, it is found that the value of κ varies between κ NS = 2 and 14 [127][128][129]. On the other hand, for more exotic stars like boson stars, the value of κ can be even larger and found to vary between ∼10 and 100 [40,130]. It has been shown that one can introduce parametrized deviations of the form, κ = (1 + δκ) and put bound on δκ [131][132][133][134][135]. An inspiralling binary system is parametrized by two such parameters, corresponding to both the binary components, δκ 1 and δκ 2 . Simultaneous measurement of both these parameters will end up giving weak constraints on either parameter, hence it is proposed to measure δκ s = 0.5(δκ 1 + δκ 2 ) keeping δκ a = 0.5(δκ 1 − δκ 2 ) = 0. This is a safe assumption if we are testing the BBH nature of the detected signal [136,137]. The analysis performed on the first, second, and third observing runs of LIGO-Virgo detectors provided good constraints on events with non-zero spins which include GW151226, GW190412, GW190720_000836, and GW190728_064510. Employing IMRPhenomPv2 waveform model [58,59,122] and assuming prior distribution on δκ s ranging uniform between [−500, 500] [30]. These events, where the posteriors are very different from the prior knowledge, are highlighted in Figure 6. With the restricted assumption that δκ s take the same value for all events, Reference [30] reported a combined bound on δκ s within the 90% confidence interval as δκ s = −15.2 +16.9 −19.0 . Using the hierarchical analysis, the hyperparameters are constrained to µ = −24.6 +30.7 −35.3 and σ < 52.7 with δκ s = −23.2 +52.2 −62.4 and which is again consistent with the null (µ = σ = 0) hypothesis at 90% confidence interval. The hypothesis stating that the population contains all BBHs is favored by the population containing all the non-BBH hypothesis by a combined Bayes factor of 11.7. The analysis found that the data are consistent with the BBH hypothesis.

Tests of Gravity from GW Propagation
GW propagation in GR is non-dispersive and described by the dispersion relation, Equivalently, the velocity of propagation of GWs in GR is independent of the frequency of the radiation. As a consequence, the graviton is massless with a corresponding infinite Compton wavelength. There are alternative theories of gravity predicting GWs with dispersion where the local Lorentz invariance is not respected [138].
For a generic theory of gravity, the GR dispersion relation may require modifications and the following equation can account for such propagation effects [139][140][141][142], We can re-parametrize the gravitational waveforms so that they also account for the propagation effects given in Equation (13). In Equation (13), A is the dispersion amplitude and has dimension of [Energy] 2−α , and α is a dimensionless constant. These paramtetrized modifications can be constrained from the data and these bounds can be translated into constraints on different alternative gravity models. For example, α = 0 and A > 0 correspond to massive graviton theories [141], α = 2.5 corresponds to multifractional spacetime [142], and α = 3 corresponds to double special relativity [140], etc.
As shown in References [67,139], one can use GW observations to get constraints on the modified dispersion parameters. The first bound on the Compton wavelength (which has a finite value for any massive graviton theory) from GW observations of a BBH signal is λ g > 10 13 km [9] and this has been extended to more generic cases in the subsequent analyses [20,22,74]. This bound on the Compton wavelength translates to a graviton mass, m g ≤ 5 × 10 −23 eV/c 2 , and this is a stronger bound compared to the solar system constraints [74]. From the GWTC-2 data [30], a factor of 2.7 improvements is observed on this bound, and the graviton mass bound correspondingly changes to m g ≤ 1.76 × 10 −23 eV/c 2 with 90% credibility.
Note that the past studies have investigated the possibility of dispersion of GWs described by GR due to specific physical effects. For example, the nonlinear interaction between charged particles and GWs may lead to dispersion of GWs when GWs pass through astrophysical plasma in the presence of magnetic fields [143][144][145][146][147][148][149][150][151][152].

Tests Based on the Merger Remnant Properties
The merger remnant of a BBH system emits GWs to settle down to the stationary state, and this distinct signal from stellar-mass BBHs can be measured using the current GW detectors. Consequently, many tests for remnant nature have been proposed and performed on the detected GW events. We briefly discuss GW tests based on merger remnant properties here.

No-Hair Theorem Based Tests from the Quasi-Normal Mode Ringdown Radiation (BH Spectroscopy)
According to GR, the remnant formed after a BBH coalescence is a perturbed Kerr BH, and this BH attains the stationary state by emitting GWs. This damped sinusoidal signal (BH ringdown radiation) is characterized by quasi-normal-modes (QNMs) with frequency f and damping time τ. Both the damping time and frequency of this oscillation are determined by the mass and spin of the Kerr BH formed after the merger (M f and χ f ) [125,[153][154][155][156][157]. In GR, the ringdown waveform is a superposition of damped sinusoids and takes the form, where z is the cosmological redshift, and the (l, m, n) indices label the QNMs (( , m) are the angular multipoles, whereas n is the order of modes given ( , m). All the f mn and τ mn are determined by the final mass and spin of the binary system (this is called the final state conjecture). For a perturbed Kerr BH, the damping time and frequency of each quasi-normalmode can be calculated from BH perturbation theory as a function of its final mass and spin [158][159][160][161]. Assuming that the mergers we observe are BBHs, from the independent M f and χ f post-merger measurements we can test the final state conjecture (commonly known as the BH spectroscopy) [55,56,83,[162][163][164][165][166][167][168][169]. The complex amplitude A mn is a measure of the mode excitation and the phase of these modes at a reference time [170][171][172].
PYRING is a toolkit to perform BH spectroscopy which is completely implemented in the time domain [55,56]. As both templates and the likelihood are modeled in the time domain, spectral leakage is reduced [173]. Mainly assumed template models for this study are, Kerr 220 ( = |m| = 2, n = 0 contributions of Equation (14)), Kerr 221 ( = |m| = 2, n = 0, 1 contributions of Equation (14)), and Kerr HM (all fundamental prograde modes with ≤ 4, n = 0 contributions of Equation (14) and also taking into account mode-mixing [172]). The frequencies and damping times are predicted in terms of final mass and spin for all these cases. The remnant quantities, M f and χ f , are estimated assuming uniform priors on these parameters. We do not consider the higher overtones (n > 1) as those are not expected to provide constraints with the current sensitivity of detectors.
Another equally established technique for the QNM analysis is the parametrized-SEOBNRv4HM (pSEOB) analysis, which employs a parametrized version of the EOB waveform model [174] accounting for aligned spins and higher modes [30,166,175]. The pSEOB analysis differs from PYRING in that it measures the ringdown frequency within a complete IMR waveform model framework using the full SNR of the signal. It is not dependent on a ringdown time definition. In this framework, one parameterizes the frequency and damping time of the = m = 2 by introducing a fractional deviation from the nominal GR prediction and constrains these fractional deviation parameters directly from the data. That is, f 220 = f GR 220 (1 + δf 220 ) and τ 220 = τ GR 220 (1 + δτ 220 ) [174], where f GR 220 and τ GR 220 are frequency and damping time if GR is the correct theory of gravity. From both these approaches, by performing detailed analyses on the GWTC-2 events, no indication of the presence of non-BH behavior was reported [30].

Testing the Nature of Merger Remnant from the Measurement of Late Ringdown Echoes
One can ask the question as to if the merger remnant is not a BH but instead an exotic compact object (ECO) with a light-ring and reflective surface, instead of an event horizon as in the case of BHs [176][177][178][179]. For these hypothetical cases, the GWs can be trapped in between the effective potential at the centre and the reflective surface, leading to the emission of GWs as a train of repeating pulses known as GW echoes. BHs produce no echo signal as there is no possibility of an out-going boundary condition at the BH event horizon. GW echoes are unique probes of any non-BH compact object formation (especially exotic objects like gravastars, fuss balls) after the binary coalescence [47,176,180,181]. For an illustration of GW echoes originating from a binary merger, we point out Figure 7 (also see Figure 2 of Reference [176] for the original GW150914 signal on top of the best-fit echo template from a template-based echo analysis). Here ∆t echo denotes the time-delay between two consecutive echoes; also, the horizontal line shows the time delay between binary merger and the first echo. (In Reference [176], the time domain template along with the best-fit echoes template for GW150914 from a template-based echo analysis is plotted.) In the template-based framework, the echo signal is modeled with five extra parameters, characterizing the echo: the relative amplitude of the echoes, the damping factor between each echo, the start time of ringdown, the time of the first echo concerning the merger, and the time delay between each echo (∆t echo in Figure 7). Reference [176] studies this method and subsequent discussions on GW150914 in detail. For the GWTC-2 events [30], assuming uniform priors on each of these echo parameters and employing the IMRPhenomPv2 waveform model (except for the case GW190521, where Sur7dq4, is a surrogate model for precessing BBH system directly interpolates the numerical relativity waveforms, is used), a Bayesian analysis is performed to investigate the evidence for echoes. Bayes factor B I MRE I MR (comparing the two hypotheses I MRE data best fits an echo model, and data best fits a model without echoes (I MR)) are computed for each event. The data did not show evidence for echoes [30], except for the event GW190915_235702, which showed the highest value for B I MRE I MR = 0.17 indicating a negligible evidence for echoes [30]. Reference [30] reports that the posteriors on the echoes parameters returned their prior distribution, pointing to a null detection of GW echoes.

Constraints on the Polarization States of GWs
Generic metric theories of gravity predict six independent degrees of freedom for the metric tensor, which can be identified as polarization states of GWs. More than the two tensor (spin 2) degrees of freedom allowed by GR, there is a possibility of two vectors (spin 1) degrees of freedom and two scalars (spin 0) degrees of freedom [141] in such cases. Out of these six modes, three of them are transverse, and three are longitudinal. The first constraints on the polarization states of the GWs from observations are detailed in [9]. Still, the results were uninformative as the data from the two LIGO instruments are not enough to constrain the extra polarization mode. This is possible only if there are data available from one another detector with another orientation. Hence the polarization test was first demonstrated with the three-detector event GW170814, and the improved results compared to Reference [9] are available in Reference [22]. In Reference [22] also the tensor modes hypothesis was favored over scalar and vector modes as shown in Reference [9].
A detailed analysis was performed in Reference [30] considering all GW events observed till the first half of the third observing run of LIGO/Virgo detectors. Reference [30] reports the highest (lowest) Bayes factor for GW190720_000836 (GW190503_185404) with log 10 B T V = 0.139, B T S = 0.138) (log 10 B T V = 0.074 and B T S = −0.072), here B T V and B T S represent the Bayes factors for full tensor versus full-vector and full-scalar hypotheses respectively. The B T S is slightly larger than B T V and this is explained by the intrinsic geometry of the LIGO-Virgo antenna patterns [182]. One should also notice that any of these results do not account for the possibility of mixed polarization. This topic has to be explored in the future when more detectors become operational.

Summary
This review article provides a brief overview of the tests of GR performed during the first three observing runs of the LIGO-Virgo detectors, including tests of consistency with GR (Section 2.1), parameterized tests (Section 2.2), tests based on the merger remnant properties (Section 2.3), and tests for GW polarizations (Section 2.4). Along with some technical details about each test, we also provide a short discussion pointing to the prospects of these various tests.
In this article, we only focused on signals which are consistent with BBHs. The detection of the first binary neutron star merger event opened up different possibilities of testing GR from combined electromagnetic and GW observations [23,118]. Many tests we detailed here may have overlaps or redundant information which is not accounted for here. Though the instrumental noise mainly dominates the current measurement uncertainties, we cannot exclude the possibility of any systematic bias arising due to un-modeled effects present in the waveform models. Another critical point is that the tests discussed here are all modelindependent tests (null tests). In other words, we are not assuming any alternative gravity theory models here, and every test is capturing the deviation from GR in a model-agnostic way. Funding: This research received no external funding.

Data Availability Statement:
This research has made use of data, software and/or web tools obtained from the Gravitational Wave Open Science Center (https://www.gw-openscience.org/ accessed on 8 December 2021), a service of the LIGO Laboratory, the LIGO Scientific Collaboration and the Virgo Collaboration. LIGO is funded by the U.S. National Science Foundation. Virgo is funded, through the European Gravitational Observatory (EGO), by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale della Fisica Nucleare (INFN) and the Dutch Nikhef, with contributions by institutions from Belgium, Germany, Greece, Hungary, Ireland, Japan, Monaco, Poland, Portugal, Spain. The data presented in this study are openly available at https://doi.org/10.7935/903s-gx73 accessed on 8 December 2021.
Acknowledgments: This work was supported by the Max Planck Society's Independent Research Group Grant. We thank Ajit Mehta for carefully reading this article and providing comments. We are thankful to Angela Borchers Pascual and Anuradha Gupta for very useful comments and discussions. This document has LIGO preprint number LIGO-P2100349.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Non-BH binaries include other compact objects like neutron stars and more exotic objects like boson stars [33][34][35][36][37][38][39][40][41], gravastars [42][43][44][45][46][47] etc. 2 Inner-most stable circular orbit of a Kerr BH is the smallest stable circular orbit in which a test particle can stably orbit around the BH. 3 Current analysis is taking into account for the dominant mode and neglecting any higher-mode contributions to the frequency evaluation.