Statistical Tools for Imaging Atmospheric Cherenkov Telescopes

The development of Imaging Atmospheric Cherenkov Telescopes (IACTs) unveiled the sky in the teraelectronvolt regime, initiating the so-called"TeV revolution", at the beginning of the new millennium. This revolution was also facilitated by the implementation and adaptation of statistical tools for analyzing the shower images collected by these telescopes and inferring the properties of the astrophysical sources that produce such events. Image reconstruction techniques, background discrimination, and signal-detection analyses are just a few of the pioneering studies applied in recent decades in the analysis of IACTs data. This (succinct) review has the intent of summarizing the most common statistical tools that are used for analyzing data collected with IACTs, focusing on their application in the full analysis chain, including references to existing literature for a deeper examination.


Introduction
Any scientific experiment would be incomplete if only the collected data were reported. A statistical analysis is needed in order to interpret the data and to draw conclusions from the experiment. This is the case for experiments that are imaging the Cherenkov light emitted by a cascade of secondary particles produced by the interaction of gamma rays and cosmic rays within the Earth's atmosphere. They are called Imaging Atmospheric Cherenkov telescopes and popular examples are MAGIC [1], HESS [2], VERITAS [3], and CTA [4], which is currently under construction. By "statistic" we mean any function S 1 computed from the observed data assuming the truth of a model. Very well-known examples of such functions are the mean, the variance and the χ 2 . As the observed data consists of random variables, the statistic itself is a random variable whose distribution can be derived either from theoretical considerations or empirically using Monte Carlo (MC) simulations. A statistical analysis is therefore performed by comparing the observed value of the statistic with the frequency distribution of the values of the statistic from hypothetical infinite repetition of the same experiment assuming a given model of interest. This approach is usually called the "classical" or "frequentist" approach. This comparison (referred to as the test statistic) between the observed statistic and its long-run distribution allows the analyzer to draw a conclusion from the observed data with a procedure that is right 2 (1 − α) · 100% of the time. The value (1 − α) · 100% is referred to as the confidence level (CL). It is important to underline that it is the procedure, not the conclusion, which is correct (1 − α) · 100% of the time. To better clarify this point we can consider the following claim: "a flux of 10 −13 · cm −2 s −1 from the observation of a gamma ray burst is excluded at 95% CL". Claiming that such a value of the flux is excluded is obviously always wrong in infinite experiments in which the true flux of the observed gamma ray burst is 10 −13 · cm −2 s −1 . However, in these infinite experiments, the procedure would lead the analyzer to this wrong conclusion only 5% of the time 3 . The procedure, i.e., the test statistic and the value α for its significance, is usually dictated by many factors, such as the assumptions about the underlying model, the way the data have been collected and, sometimes, also by the biased conclusions 4 one is willing to derive from the experiment. A common principle [6] is to choose the test statistic with the maximum power, where the power of the statistic is the probability of rejecting a hypothesis that is false. According to the Neyman-Pearson lemma the most powerful statistic is the likelihood ratio, usually defined in the literature, for reasons that will be clear soon, as follows which by definition can only take values bigger or equal than zero, since byθ we have defined the values of the model parameters that maximize the likelihood L. The likelihood is a function of the model parameters θ defined as the probability of obtaining the observed data D obs assuming θ to be true: L(θ|D obs ) = p(D obs |θ). (2) Searching for the parameter valuesθ that maximize the likelihood in Equation (2) is also referred to as fitting the model to the data. If nuisance parameters 5 π are present in the model, π is maximized to the valueπ in the numerator of Equation (1) letting θ be free, resulting in the so-called likelihood profile L(θ|D obs ) = L(θ,π(θ)|D obs ). ( Taking the log value of the likelihood ratio as done in Equation (1) allows making use of Wilk's theorem [7], which states that under certain circumstances this random variable has a χ 2 distribution with degrees of freedom given by the number of the free parameters. This property makes the likelihood ratio a very appealing statistic, whose usage has a very broad application, and it will indeed appear many times in this manuscript. Yet, it must be used cautiously: the interested reader may refer to Ref. [8] for a critical review of the usage and abuse of the likelihood ratio.
The frequentist theory described so far may be considered unsatisfactory by some [9] with its dependence on long-run distributions from infinite experiments and its arbitrariness in the choice of the statistic. An alternative approach is provided by the "Bayesian" or "probabilistic" approach, in which probabilistic statements about hypotheses and model parameters are made through the Bayes theorem p(θ|D obs ) = p(D obs |θ)p(θ) p(D obs ) .
The prior probability p(θ) captures the available knowledge about the parameters, or, more generally, about the hypotheses under study. The so-called evidence p(D obs ) can be seen as a normalization factor. It follows from probability theory that p(D obs ) = ∑ θ p(D obs |θ)p(θ). (5) In this case nuisance parameters are treated via the marginalization p(θ|D obs ) = ∑ π p(θ, π|D obs ), (6) or, in other words, instead of profiling the likelihood by fixing π toπ(θ), one marginalizes the likelihood by integrating out π. Another way of looking at Equation (4) is to consider the odds of a hypothesis defined as the ratio of its probability of being true and not being true whereH and H are mutually exclusive and collectively exhaustive hypotheses. Using the odds formalism the Bayes theorem takes the following form o(H|D obs ) = BF · o(H), with BF = p(D obs |H) p(D obs |H) , (8) where BF stands for the Bayes Factor, i.e., the likelihood ratio of two competing hypotheses. After the measurement the Bayesian approach leads us to update the odds we assign to a given hypothesis by multiplying it with the BF. Unlike the frequentist approach where the goal is to provide a statement about the long-run performance of a test statistic, in Bayes theory we are not interested in hypothetical infinite experiments but in calculating the probability of hypotheses from the observed data and from our prior knowledge of them. Going deep into the details of the statistical analysis and the difference between the frequentist and Bayesian theory goes beyond the scope of this paper, the interested readers may refer to Refs. [10,11] and references therein. Yet, this brief introduction of the basic principles and definitions used for performing an inference analysis in both the frequentist and Bayesian approach is necessary for reviewing the statistical tools used in IACTs analysis in the next sections.
The typical workflow of an IACTs analysis is schematically shown in Figure 1, where, starting from the shower images, the variable s (the expected number of signal events) is derived, which is then used for inferring the flux Φ of gamma rays by taking into account the instrument response function (IRF) of the telescopes. Schematic workflow of the inference analysis performed in order to estimate the intrinsic flux Φ of gamma rays (and the values of its parameters θ) from the recorded images. The acronym IRF stands for Instrument Response Function (see Section 4). The bold arrows (going from the right to the left) show the relation of cause and effect. The aim of the inference analysis (shown as a thin arrow going from the left to the right) is to invert such relation.
In the remaining part of this section, the structure of the paper is outlined. First we discuss in Section 2 the most common techniques implemented for performing the event reconstruction from the shower images detected with IACTs. The goal of these techniques' yields is to obtain a list with the estimated energy, direction, and discriminating variables for each of the candidate gamma ray events.
Using this event list, it is then shown in Section 3 how to estimate the strength of the signal s and how confidently we can claim that a gamma ray source is producing part of the recorded events.
The final result of the statistical analysis is the differential gamma ray flux Φ, which corresponds to the number N γ of expected photons per unit energy (E), time (t), and area (A): wheren is the photon direction. We denote with Φ the observed flux, i.e., the flux of events actually observed by the telescopes when the IRF of the telescope is included. The expected counts s is connected to Φ by taking into account the exposure of the observation, i.e., by integrating Φ over the temporal, energetic and spatial range in which the events have been collected. This is the topic of Section 4, in which the way the source flux and its model parameters are obtained is also discussed.

Event Reconstruction Techniques
The first statistical analysis one has to face in IACTs is the reduction of the recorded images in the camera of the telescopes to a few parameters of interest. The Cherenkov light from the shower of secondary particles is reflected by mirrors and focused on a camera with photomultipliers composing the pixels of the shower image (see Figure 2). The event reconstruction consists therefore in extracting from the photo-electron (PhE) counts and arrival time of each pixel the following variables: • the energy of the primary gamma ray that initiated the shower, • its arrival direction, • and one or more discriminating variables.
The role of these discriminating variables is to provide information on how likely one event can be associated with a gamma ray or to the background composed mainly by hadronic cosmic rays. The background estimation and the signal extraction are discussed in Section 3, while for the remaining part of this section the most commonly used event reconstruction tools are reviewed.

Hillas Method
The most common event reconstruction technique is based on the moments (up to the second order) of the pixel amplitudes in the camera, referred to as Hillas parameters [13]. This technique can be thought of as fitting an ellipse to the pixels: a likelihood function that depends on the Hillas parameters is maximized under the assumption that the Cherenkov light from a shower initiated by a gamma ray would produce an elliptical shape in the camera. The set of parameters includes variables such as the total PhE counts in all pixels, the PhE-weighted barycenter of the pixel positions, and the time gradient of the pixel arrival times. If more than one telescope is involved, then these single-image parameters can be combined in order to obtain stereoscopic parameters giving a 3-dimensional reconstruction of the event [14]. The calculation of these parameters is easily affected by image noise and night-sky background, which requires a cleaning procedure in order to remove the pixels that do not contain the shower image. Moreover, the dim and small shower images below 100 GeV can result in parameters values affected by large fluctuations and systematic uncertainties, which is the reason why the instrument response function of IACTs deteriorates at lower energies. Techniques based on Hillas parameters have been implemented since the 1980s and are used in a variety of experiments such as MAGIC [14] and HESS [15], demonstrating the robustness and reliability of the method.
After the parametrization of the event is completed, the gamma ray energy is estimated from the shower impact parameter and from the photon density is measured with each telescope. This is done by constructing look-up tables for different observational conditions, filled with MC information about the true energy of the gamma ray as a function of the simulated image amplitude and impact parameter. The arrival direction is obtained from the crossing point of the main ellipse orientations in the individual cameras. A weighted combination of some Hillas parameters can be used as a discriminating variable [15]. More refined techniques have been developed, aimed at improving the inference analysis on the gamma ray properties starting from the Hillas parameters (see Section 2.5).

Semi-Analytical Method
Despite the robustness and stability under different conditions of the Hillas method, additional reconstruction procedures have been explored in order to exploit more information from the recorded image. The so-called semi-analytical method consists of fitting to the shower images a model of the Cherenkov light produced by a gamma ray shower as seen by the camera. A first implementation of this method can be found in Ref. [16] from the CAT collaboration in the late 1990s. In this pioneering implementation, the 2D-models are stored in a look-up table and compared to the observed image via a χ 2 -function of the gamma ray energy, the impact parameter and the source position in the focal plane. This function is defined as the sum of the squared differences over all pixels between the expected PhE content and the actual observed one. This sum is weighted according to the Phe count quadratic error. A χ 2 minimization is performed to obtain the best fit parameter of the gamma ray, while the resulting χ 2 is then used as a discriminating variable. This method has been re-implemented and subsequently improved by the HESS collaboration [17], where the χ 2 minimization has been substituted by the minimization of a log-likelihood defined as The variable n i is the observed PhE count in the pixel i, while θ are the shower model parameters. This method is referred to as semi-analytical because the template library of shower images is produced with MC simulations which are usually carried out with dedicated software such as KASKADE [18] and CORSIKA [19]. Compared with the Hillas method, a more precise estimation of the energy and direction of the primary gamma ray is provided by this reconstruction technique, especially at low energies.

3D-Gaussian Model
An additional approach is given in Refs. [20,21], where the single-pixel PhE counts are fitted to an analytical gaussian air shower model. This method, referred to as a 3D model or 3D Gaussian model, assumes an isotropic angular distribution of the shower, and its rotational symmetry with respect to its incident direction is used to select gamma ray events. As usual a likelihood function is maximized with respect to the shower parameters. This maximization process is rather fast thanks to the simpler assumption of the 3D-Gaussian model. More recently [22] the 3D-model was combined with a multivariate analysis that makes use of the so-called Boosted Decision Tree (see Section 2.5) and adapted to the detection necessity of IACTs, particularly for the discovery of new faint sources.

MC Template-Based Analysis
The previously mentioned methods in Sections 2.2 and 2.3 strongly rely on a model fit that becomes more difficult to describe as we reach higher energies. The more energetic the gamma ray, the more particles are produced, and a large fraction of the latter is capable of reaching the ground. This causes strong fluctuations in the fit model above ∼10 TeV. Moreover, in these approaches, the quality of the model fit is inevitably worsened by instrumental effects and atmospheric conditions which require approximations in order to be taken into account. To overcome these issues and improve the accuracy of the analysis, the authors of Ref. [23] proposed an Image Pixel-wise fit for Atmospheric Cherenkov Telescopes (ImPACT). In this approach, the template shower images are produced using more detailed and time consuming MC simulations. The simulation chain consists in simulating the air shower with CORSIKA [19] which is then combined with sim_telarray 6 [24] for reproducing the instrumental effects of the telescopes. The sensitivity is improved by a factor of 2 when the ImPACT reconstruction method is implemented, relative to the Hillas-based method (see Section 2.1).
Compared with the 2D-model, some improvements were shown at higher energies. A similar implementation for the VERITAS telescopes can be found in Ref. [25]. The role of realistic MC simulations has been very recently emphasized by the authors of Ref. [26], who proposed a new simulation and analysis framework as an alternative to the current way MC templates are obtained. In the existing paradigm, simulations are generated from pre-defined observation and instrument settings, such as the zenith of the observation or the configuration of the camera. Each simulation can be then seen as a grid point in the setting-parameters space. The analyzer willing to use a "run" 7 performed with some given settings has to look for the adjacent grid points either by interpolating them or by taking the closest one. Instead in the run-wise simulation approach described in Ref. [26], simulations are generated on a run-by-run basis. In this way observational conditions are fully taken into account, thus leading to more realistic MC simulations that reduce systematic uncertainties and improve the scientific output of the statistical analysis.

Multivariate Analysis
To date, we described techniques for parametrizing an event detected by IACTs. These parameters are then used for inferring the energy, the arrival direction of the primary gamma ray, and discriminating variables. The latter are used to tell how likely the event can be associated with a gamma ray. The usage of discriminating variables is quite simple: all the events with values of the parameter larger (or smaller, depending on the kind of variable) than a pre-defined threshold are retained and considered to be gamma-like events, i.e., originating from a gamma ray. A different approach that avoids cutting data by exploiting the full probability distribution function (PDF) of the discriminating variable will be discussed in Section 3. Once a discriminating variable is chosen and a fixed threshold is defined, the separation (or discrimination) power can be obtained from the so-called Q value where x is the efficiency of the selection procedure given by the fraction of events belonging to the population x surviving the selection (h stands for hadrons which compose the background population). This classification problem becomes considerably more difficult when more than one parameter can be actually used for discriminating signal events from the background population. Multivariate methods consist of combining several of the shower parameters into one single discriminating parameter. The main advantages of these approaches are that • nonlinear correlations between the parameters are taken into account and • those parameters with no discrimination power are ignored.
A detailed review and comparison of different multivariate methods for event classification in IACTs can be found in Ref. [27]. In this section, we provide a brief description of the currently most used methods, the Boosted Decision Tree (BDT) and the Random Forest (RF) [28]. The BDT approach, implemented for the HESS [29,30] and VERITAS [31] telescopes, is a binary tree where events are sorted into small subsets by applying a series of cuts until a given condition is fulfilled. This condition might be given by requiring that the number of events in a leaf is smaller than a predefined value or that the signal over the background ratio in a leaf must be large enough. The term "boosted" refers to the fact that more than one individual decision trees are combined in a single classifier by performing a weighted average. The boosting allows improving the stability of the technique with respect to fluctuations in the training sample and is able to considerably enhance the performance of the gamma/hadron separation compared to a single decision tree. Like the BDT approach, the RF method, implemented for the MAGIC telescopes [32], also relies on decision trees, which are built up and combined with some elements of random choice. As for the BDT, training samples of the two classes of population (signal and background) are needed for constructing the decision trees. Once the classifier has been properly 8 trained, the algorithm can be used to assign to each event a single discriminating variable whose distribution on a test gamma and hadron population can be seen in Figure 3.

Deep Learning Methods
The multivariate methods described in Section 2.5 for discriminating the signal events from the background have shown a great capability in improving the sensitivity of IACTs. This effort has been recently pushed forward by Deep Learning (DL) [34] techniques for object recognition in images. Such algorithms, which require more computational power, are getting more and more attention thanks to the improvements during recent decades in the usage of CPU and GPUs for matrix operations. When it comes to image processing, the leading DL algorithm is Convolutional Neural Networks (CNNs) whose first application in the context of IACTs can be found in Ref. [35], where a CNN was applied in the simple case of muon-ring events. This work served as a pathfinder for the application of CNNs for gamma/hadron separation from the raw recorded images.
A CNN is made of many connected layers which in turn consist of different nodes. The first layer is the input image whose pixels represent its nodes. The inputs in a new layer are convolved with kernels that have to be trained. Each new layer is in general much smaller than the input one, and allows identifying features in the previous layer. Adding more and more layers, one aims to extract more and more complex features, which can be possibly used to identify those discriminating features in the images that otherwise would not be considered in other event-classifier algorithms. For a more detailed review and description of DL and CNNs algorithms, we refer the reader to Refs. [34,36] and references therein. The training process in CNN can be computationally demanding, due mainly to the very large number of parameters. The main advantages of DL relative to previous event reconstruction methods in IACT is that CNNs do not need the image parameters (such as the Hillas ones), and therefore all the features contained in the image are fully exploited, while they might get lost or suppressed during the parametrization. Recent applications of CNNs in the image processing of IACTs data can be found in Refs. [37][38][39][40], where the algorithm was also implemented for the energy and arrival direction estimation of the gamma rays.

Detection Significance and Background Modeling
The final result of the statistical analysis described in Section 2 is a list containing useful information about the candidate gamma ray events, such as their estimated energy and arrival direction. Neglecting any background contamination, the total number n of events in such a list would be a random variable distributed according to the Poisson PMF where s is the expected number of signal counts. The problem is that the majority of the observed events are actually generated by hadronic cosmic rays, while only a small fraction (which for the case of a bright source such as the Crab Nebula is of the order of 10 −3 ) can be associated with gamma rays. By applying a signal extraction selection on the data based on a discrimination variable, it is possible to reduce the background by a factor of 100 or more, thus increasing the signal-to-noise ratio close to 1 for a bright source. In order to infer the gamma ray flux from the resulting event list it is essential to first estimate the remaining background contamination. We can consider three different scenarios (which are the topic of Sections 3.1-3.3, respectively) where the expected background count b in the target region is: • zero or negligible relative to the source counts, • known precisely, • estimated from an OFF measurement.
The latter case is the most common one and requires the definition of two regions: a region of interest (ROI), also referred to as target, test or ON region, and a background control region, called OFF region. The ON and OFF regions provide, respectively, independent N on and N o f f counts, where the latter is ideally void of any signal event. A normalization factor α is introduced to account for differences (such as the acceptance and the exposure time) between the ON and OFF region. It can be defined as where A(x, t) is the instrument acceptance, which is a function of the observation time t and of all observational parameters (such as the FoV position or the zenith angle) here denoted for simplicity with x. The goal of the background modeling analysis is to provide the values of α and N o f f that are then used for estimating the signal s along with the detection significance.

The Background Is Zero or Negligible
Although very rare, in some analyses the background b in the ON region may be assumed to be zero or negligible relative to the signal s. Given the simplicity of this case, it is worth dwelling on it, discussing with examples the statistical conclusions that can be drawn from a measurement using the frequentist and Bayesian approach. In this scenario the likelihood function is trivially L(s) = P (N on |s), (14) where P is the Poisson distribution (see Equation (12)) with observed and expected counts N on and s, respectively. Using the likelihood ratio defined in Equation (1) as a statistic, and taking into account thatŝ = N on is the value of s that maximizes the likelihood, we get for any s > 0 the following statistic Such a statistic, known in the literature as the Cash statistic or C-statistic [41], has a straightforward meaning: if we measured N on counts in the ON region and assumed that the true signal rate is s, then the value obtained from S according to Wilks' theorem is a random variable that follows a χ 2 distribution with 1 degree of freedom. This can be checked by performing MC simulations as shown in the left plot of Figure 4. The smaller the true value of s the more difficult it is to find an exact distribution for the statistic S which, due to the discreteness of the Poisson distribution, cannot be assumed anymore to be a χ 2 variable. For small expected signal counts s, it is therefore necessary to get the CDF of S from MC simulations. We can be interested, for instance, in the hypothesis H 0 : "the number of expected signal events (for a given temporal and energetic bin and surface area) is s = 65.4". After having performed the experiment, we obtain from the measurement N on = 82 events. In this scenario, by computing the statistic in Equation (15) we get S = 3.9. If H 0 was true we would have observed a value of the statistic equal or greater than 3.9 only 4.8% of the time (see the left panel of Figure 4). The conclusion of the frequentist approach is therefore that H 0 can be excluded with a 95.2% CL or in other words with a 1.98 σ significance. The latter is obtained by expressing the CL in multiples of the standard deviation σ of a normal distribution via the inverse of the error function 9 : The aim of the Bayesian approach is instead to provide a probabilistic statement about s, which is not fixed as in the frequentist approach. By applying the Bayes theorem, we get that the PDF of s is (up to a normalization factor) p(s|N on ) ∝ P (N on |s) · p(s), where p(s) is the prior PDF of s, which encloses the prior knowledge the analyzer has on the source's signal. In the Bayesian context we can compare two hypotheses s = s 1 and s = s 2 as follows where BF = s 1 s 2 N on e −s 1 +s 2 .
The evolution of the BF as a function of the expected counts s 2 , using s 1 = 65.4 (which is our hypothesis H 0 of interest) and the experiment result N on = 82 is shown in the right panel of Figure 4. It is worth noticing that the BF is connected to the statistic in Equation (15) via in which s 1 = s and s 2 = N on . Lastly, it can be shown that assuming a uniform prior 10 the PDF of s in Equation (17) is

The Background Is Known Precisely
This is the case in which we know from theoretical or experimental considerations the true valueb of the background. This happens for instance in the field-of-view-background model, where the entire FoV (excluding positions where γ-ray events are expected) is used for modeling the background. Since the OFF region is composed by the entire FoV and the ON region by a small portion of it, we have α 1 . Therefore the Poissonian fluctuations of the background contamination in the ON region can be neglected, being given by Indeed the detection significance of the "known" and "unknown" background cases coincide for α 1 (see Section 3.3). Thus, in the field-of-view-background model we can make the assumption of knowing precisely the background, given by the product αN o f f .
The likelihood function is L(s) = P (N on |s +b), which reaches its maximum value forŝ = N on −b. The statistic is obtained from the Cash one defined in Equation (15) in which s has been substituted with s +b. An important difference between the previous case, in which the background was assumed to be zero, is that now the statistic is also defined for the hypothesis s = 0. For this no-source hypothesis it is common to slightly modify the Cash statistic by taking the square root of it and by introducing a sign that is arbitrarily chosen to be positive when the excess N on −b is positive, yielding In this way, for large enough counts (N on 10) the statistic is a random variable that follows a normal distribution with mean zero and variance 1 (as shown in the left panel of Figure 5). This allows immediately converting the output of S in a significance level. If for instance we assumeb = 10 and we observe N on = 21 events, then S = 3.0 which means that the no-source hypothesis can be excluded with a significance of 3 σ.  (28) (right panel) from measurements in which the background is known precisely to beb = 10 or must be estimated from the OFF counts N o f f with α = 1, respectively. In both simulations, the true values of s and b are 0 and 10, respectively. In both plots, the PDF of a normal distribution with mean zero and variance 1 is shown in grey as reference. In both cases 10 6 simulations were performed. The distribution in the left panel looks to be less populated due to the fact that having the background fixed to 10 (and not estimated from an OFF measurement) limits the number of possible outcomes of the statistic.

The Background Is Estimated from an OFF Measurement
Let us consider the most common scenario, in which we do not know the background b and therefore need to estimate it by performing OFF measurements, supposedly void of any signal. Such OFF measurements can be performed following one of the below procedures: • On-Off background: the OFF counts are taken from (usually consecutive) observations made under identical conditions, meaning that α is simply given by t on /t o f f with t on and t o f f the exposure time for the ON and OFF observation, respectively. The main advantage of this method is that no assumption is required for the acceptance, except that it is the same for the ON and OFF regions. The drawback of this approach is that dedicated OFF observations are needed, thus "stealing" time from the on-source ones. • Wobble or reflected-region background: the OFF counts are taken from regions located, on a run-by-run basis, at identical distances from the center of the field of view. Each of the OFF regions is obtained by reflecting the ON region with respect to the FoV center. This is the reason this method is called the reflected-region method. If we have n OFF regions then α is equal to 1/n. This technique was originally applied to wobble observations [43] and was later on used also in other observation modes.

•
Ring background: the OFF counts are taken from a ring around the ROI or around the center of the field of view. • Template background: the OFF counts are given by those events that have been discarded in the signal extraction selection based on a discriminating variable. In this method, first developed for the HEGRA experiment [44] and more recently refined for HESS [45], the discarded events are used to template the background.
For a more detailed review of the background modeling and comparison of the different methods see Ref. [46].
From one of the above-mentioned procedures, once we obtained the value of α and N o f f , the inference analysis on s is performed using the following likelihood: We are not directly interested in knowing the background, which is therefore a nuisance parameter. Thus, in the frequentist approach we have to profile the likelihood (see Equation (3)) by fixing b to the valueb that maximizes L for a given s (see for instance Ref. [47] for a derivation ofb) , i.e.,b with N ≡ N on + N o f f − (1 + 1/α)s. Performing as usual the logarithm of the likelihood ratio we have

the values 11 that maximizes the likelihood in
Equation (25), whileb is given in Equation (26) and maximizes L for a given s . The statistic in Equation (27) depends only on the free parameter s and according to Wilks' theorem it follows a χ 2 distribution with 1 degree of freedom. It can then be used for hypothesis testing, in particular the "s = 0" hypothesis 12 from which we can obtain the detection significance. Similarly to Equation (24), we can take the square root of Equation (27) and set s = 0, yielding the statistic where the sign is arbitrarily chosen to be positive when the excess N on − αN o f f is positive. This expression is the well-known "Li&Ma" [48] formula for computing the detection significance in ON/OFF measurements. As shown in the right panel of Figure 5 for large enough counts (N on , N o f f 10) the statistic in Equation (28) distributes according to a normal distribution with mean zero and variance 1. We can again consider the example in which N on = 21 counts have been observed in the ON region, but instead of assuming a known backgroundb = 10, our background is instead estimated from the OFF measurement N o f f = 10 with α = 1. Using the statistic in Equation (28) we get a detection significance of 2 σ, which is smaller than the 3 σ obtained from the Cash statistic where the background is assumed to be known precisely. A comparison of different values of N on between the Cash (Equation (24)) and Li&Ma (Equation (28)) statistic is shown in Figure 6, where one can see that the former becomes bigger than the latter as more events are observed in the ON region. This is due to the fact that the Li&Ma statistics account for the Poissonian fluctuations in the observed counts N o f f . These fluctuations make an association with the gamma ray excess with the source signal less likely. When α 1 the two statistics give the same result.
The Li&Ma expression in Equation (28) based on the likelihood ratio is not the only statistic used for rejecting the background-only hypothesis in Poisson counting experiments. One can find in the literature the so-called "signal-to-noise ratio" which has the disadvantages of following a normal distribution only for values of α close to 1 and for large enough counts [48]. Another approach is to compute the p-value from the observed N on counts, i.e., the probability of observing a total count bigger than N on under the assumption of the only-background hypothesis. If we ignore uncertainties in the background we have written in terms of the incomplete gamma function Γ. When we want to include the fact that the background is estimated from an OFF measurement, it is convenient to introduce the variable N tot = N on + N o f f , and it can be shown [49] that the observed quantity N on follows (for a given N tot ) a binomial distribution B with success probability ρ ≡ α/(1 + α) and total numer of attempts N tot . The p-value is with β the incomplete and complete beta functions (distinguished by the number of arguments). Finally the statistic is defined from the above p-values using A review and comparison of these statistics applied to the ON/OFF measurement can be found in Refs. [48][49][50][51]. Finally, it is worth mentioning that different extensions and refined versions of the Li&Ma expression in Equation (28) were introduced, each one with its application to a particular problem. The problem of including PSF 13 information in the likelihood is addressed in Refs. [52][53][54] , while the problem of including the prior knowledge of the source light curve is considered in Ref. [55] . Assessing the role in the detection significance of systematic uncertainties, especially those rising from the normalization factor α, is performed in the studies of Refs. [51,56,57]. Following the prescriptions of probability theory, in the Bayesian approach, the signal s is estimated by defining its PDF in which the nuisance parameter b has been marginalized: Assuming flat priors p(s) and p(b) (with s > 0 and b > 0) it can be shown [58] that the integral in Equation (33) is where N s is a bound variable whose physical meaning will be clear soon. Thus, the PDF of the expected signal counts s is given by a weighted sum of Poisson distributions with observed counts ranging from 0 to N on . One can recognize (see Refs. [58,59] for a detailed explanation) in Equation (34) a marginalization over the variable N s . The weights in the sum of Equation (34) are indeed the PMF of the number of signal events N s in the ON region: In the left plot of Figure 7 the PDF of s and the PMF of N s from Equations (34) and (35), respectively, are shown. The best estimate of s can be then obtained from the mode of the PDF in Equation (34). The evolution of the Bayesian estimation of s as a function of the excess N on − αN o f f can be found in the right plot of Figure 7. We can now compare the two hypotheses H s+b and H b , respectively, "the observed counts in the ON region are produced by the source and background" and "the observed counts in the ON region are only produced by the background". The BF is (see Ref. [60]) in this case (assuming again flat priors for s and b) where s max is the maximum prior value of s, i.e., p(s) = 1/s max . From the above expression, one can then compute the odds of H s+b following with and p(H s+b ) and p(H b ) the priors of the two competing hypothesis H s+b and H b , respectively. The above odds can be expressed in a "frequentist-fashion" way by converting the posterior probability of H b in a significance value using the inverse error function: as shown for instance in Ref. [61] where both constant and Jeffreys's [62] priors are assumed and a comparison with the Li&Ma significance (see Equation (28)) is shown. More recently this effort has been pushed forward in Ref. [63], where an objective Bayesian solution is proposed and compared to the result of Ref. [61]. The main advantage of these Bayesian solutions is that there are no restrictions in the number of counts N on and N o f f , while the frequentist ones holds only when the counts are large enough. Yet, it is important to not confuse the two approaches, since they aim at finding the solution of two different problems: studying the long-run performance of a statistic in the frequentist approach and deriving the probability of hypotheses in the Bayesian approach.
Lastly, it is worth mentioning that it is possible [59] to extend the PDF of s in Equation (34) by including the information on how the discriminating variables distribute for a signal or background population. The authors of Ref. [59] showed that by performing such extension not only can one avoid discarding data based on a discrimination variable (which inevitably discards also part of the signal events) but one can also increase the resolution of the signal estimation.

Bounds, Confidence and Credible Intervals
We have shown so far how, given the number of events observed in the ROI, one can estimate the source signal s and its significance. However, the statistical analysis would be incomplete without also reporting lower and upper bounds on the inferred parameters. In the former case they are referred to in the literature as lower limits (LLs), while in the latter as upper limits (ULs), with the interpretation that values of the parameters below the LL or above the UL are more unlikely to be true. They are particularly useful when the detection is not significant, for instance when the source is too dim, and therefore one would like to provide an UL on the strength of the signal s.
In the frequentist approach, these bounds are obtained by looking at the log-run behavior of the statistic: a threshold value S * of the statistic is defined such that in infinite experiments with fixed parameterθ, we would have observed S ≤ S * only x% of the time. The lower or upper bound θ x is then defined such that S(θ x ) = S * . In other words, we look for the values of the parameter that are excluded with a x% CL. The statistic S is generally constructed to increase monotonically for values of θ smaller or bigger than the best estimated valueθ, which is by definition the value whose exclusion can be claimed with a 0% CL. For an UL (LL) this means that values of θ bigger (smaller) than θ x are excluded with higher CL and they are therefore less likely to be true. This is schematically shown in the left panel of Figure 8 where the statistic S is shown as a function of the parameter θ for different experiments in which θ is fixed to the true valueθ (vertical line).
By searching for θ x such that S(θ x ) = S * we obtain a LL θ LL x and UL θ UL x . By construction only x% of these curves have S(θ) ≤ S * , which are shown in black in the left plot of Figure 8, while the remaining curves are shown in grey. This implies that the true valueθ lies in the interval θ LL x , θ UL x x% of the time. Such interval is referred to as confidence interval (CI) and it is said to cover the true value of θ x% of the time. shows the threshold S * for the statistic such that S(θ) ≤ S * only x% of the time. Black curves are those that fulfill this condition while grey ones are those that do not. In the left plot the intersection between the curves and the line S = S * defines CIs which by construction cover the true value of θ x% of the time. These CIs cannot be anymore constructed for the curves in the right plot where the statistic has below S * more than one minimum. In both plots the curves shown are not specific on any particular problem but only serve as a schematic representation.
The condition that values more extreme than the obtained bounds are rejected with more CL applies for the analysis described in Sections 3.1-3.3, but in general it is not always true, as shown schematically in the right plot of Figure 8. In such cases LLs, ULs and CIs do not have a straightforward interpretation. This is the reason why it is good practice to report in the result of the statistical analysis also the likelihood shape as a function of the free parameters.
If x is chosen to be 68, the interval between the lower and upper bounds defines the socalled 68% CI. When the background is estimated from an OFF measurement (see Section 3.3) we can use the statistic defined in Equation (27) which is a χ 2 random variable with 1 degree of freedom 14 . By looking for the bounds for which S = 1 we obtain the 68% CI of s, that for large count numbers is given by whereŝ is the estimated signal given by N on − αN o f f . When looking for the UL, the CL is usually set to 95%, with other common values being 90% or 99.9%. In this case the UL s 95 is obtained by solving S(s 95 ) = 3.84. The coverage of the 68% and 95% CI is shown in Figure 9 for different true signal and background counts. As one can see from this figure by imposing S = 1 or 3.84 we have a good coverage (of 68% and 95%, respectively) for large enough counts. Although when the counts are too small the CIs tend to undercover the true value of s. Such problem is well-known and it requires ad hoc adjustments [47,64] in order to recover the desired coverage.
In the Bayesian context, the concept of coverage is meaningless, since the objective of the analysis is not to look for the long-run performance of a given statistic, but to provide a probabilistic statement on the model parameters. In this case, CIs are replaced by credible intervals whose purpose is to provide the analyzer an interval where the model parameter lies with a given probability. Let us assume that we are interested in finding the 68% credible interval [s 1 , s 2 ] of the signal s. By using the PDF of s in Equation (33) such interval is defined as follows (41) where s 1 and s 2 can be chosen 15 to include the values of highest probability density. Similarly the 95% UL s UL on s is obtained from A comparison between the confidence and credible intervals, computed with the frequentist and Bayesian approach, respectively, can be found in Refs. [59,61,63]. When comparing them it is although important to remember that the two approaches are providing the answer to two completely different questions. In the frequentist case the analyzer is given a procedure for computing the interval that in infinite experiments will cover the true value of the parameters a desired fraction of the time. The parameter is fixed in these infinite experiments and the coverage is usually checked by performing MC simulations. In the Bayesian approach instead the model parameters are not fixed and they lie in the computed interval with a given probability.

Flux Estimation and Model Parameter Inference
We have reached the final step of the inference analysis (see Figure 1) which started in Section 2 from the shower image data: estimate the source flux and model parameters. Starting point of this analysis is the expected signal count s, whose estimation from the events list is described in Section 3. Taking into account the exposure of the observation given by the energetic (E), temporal (t) and solid angle (Ω) range (hereafter denote by ∆) in which the events have been collected we have s = ∆ Φ (E r ,n r , t)dE r dn r dt (43) where Φ is the differential observed flux with units of 1/( solid angle × time × energy), while E r andn r are the reconstructed energy and arrival direction (for the time a perfect temporal resolution is assumed being of the order of hundreds of nanoseconds). The observed flux is given by the convolution of the differential source flux Φ with the IRF of the telescope: The IRF can be though as the probability of detecting a photon with energy E and arrival directionn and assigning to it a reconstructed energy and arrival direction E r andn r , respectively. Following the rules of conditional probability the IRF can be expanded as follows: IRF(E r ,n r , E,n) = IRF(E r |n r , E,n) · IRF(n r | E,n) · IRF(E,n).
Since E r andn r are conditional independent variables 16 given E andn, the above expression can be rewritten as where we have identified the first term with the energy dispersion D, the second one with the point spread function (PSF) and the last term with the collection efficiency ε of the telescopes.
To date, we have made the assumption that the observation parameters (such as the zenith angle of the observation) are (or can be assumed to be) constant during the observation. We can further simplify the IRF expression by ignoring (i.e., by integrating out) the information on the arrival directionn, thus reducing the dimension of the problem from 3 to 1. This assumption is justified if, for instance, the observation is performed on a point-like source, which will be also assumed hereafter to be steady. Having simplified our problem 17 , Equations (43) and (44) can be then rewritten, respectively, as In order to get the flux Φ from the expected counts s two approaches are used: unfolding and forward folding.

Unfolding
If we divide the energy range in bins, the expected counts of gamma ray in the i-th bin ∆ i of reconstructed energy is (when combining Equations (47) and (48)) wheres j is the expected number of gamma rays from the source flux in the j-th bin ∆ j , i.e., The matrix R i,j is the response matrix which is the probability of detecting (due to the collection efficiency ε) a photon with energy in the range ∆ j and assign to it (due to the energy dispersion D) a different energy bin ∆ i . From Equation (49) its expression in formula is given by while in practice R i,j is estimated from MC simulations as where N γ j is the total number of gamma ray events simulated (according to an assumed source flux Φ) with true energy in the energy range ∆ j , and n γ i,j the number of those same events that have been detected by the telescope with a reconstructed energy in the energy range ∆ i . Clearly, by summing over i we recover the binned collection efficiency An example of the binned collection efficiency along with the energy dispersion from the same experiment can be found in Figure 10. Goal of the unfolding procedure is to find a solution of Equation (49), by inverting the response matrixs Thus, unfolding is basically a deconvolution problem and shares its typical problems, like the fact that the response matrix is, in general, non-invertible. As all ill-posed problems regularization procedures are required in order to find a solution and to prevent overfitting. In the context of IACTs analysis, common regularization procedures are those of Tikhonov [68], Bertero [69] and Schmelling [70]. For a more detailed discussion and comparison of these approaches with applications to data collected with the MAGIC telescopes see Ref. [67]. It is good practice to show the unfolding result with several of these approaches to cross-check the reconstructed flux and to also report along with the reconstructed flux pointss j their correlation matrix. Such a correlation matrix is needed if one is willing to fit the flux pointss j with a spectral model.
To date, the unfolding approach has been discussed as a geometrical problem: given a known vector s and a known matrix R, we wish to invert R in order to find the unknown vectors as shown in Equation (54). In the Bayesian unfolding approach instead the problem is a probabilistic one: given our prior knowledge I and the expected counts s i in the reconstructed energy bins, we wish to get the probability distribution ofs j p(s j | s i , I) = p(s i |s j , I) · p(s j |I) ∑ i p(s i |s j , I) · p(s j |I) .
The prior p(s j |I) is the binned normalized ( ∑ j p(s j |I) = 1) flux that we initially assumed for the source, while the term p(s i |s j , I) is the probability of measuring an expected signal count in the reconstructed energy bin ∆ i given the true signal counts j in the energy bin ∆ j . This term is related to the response matrix defined in Equation (51).
An iterative method for getting the posterior in Equation (55) that takes as a prior p(s j |I) the posterior obtained from a previous iteration can be found in Ref. [71] and later on revised and improved by the same author in Ref. [72] . More recently the author of Ref. [73] proposed a fully Bayesian unfolding with applications to numerous examples.

Forward Folding
The main advantage of the unfolding algorithm is its ability to show a distribution that is as much as possible equivalent to the observed distribution of events with physical and instrumental effects being removed. Although some assumptions about the flux are inevitable as discussed in Section 4.1, the desired outcome of the unfolding procedure is to "interpret" the data as little as possible.
In the total opposite direction, we can find the forward folding approach. In this case, a parametric model for the intrinsic flux is assumed and the final result is to provide an estimate of the free model parameters θ. When it is reasonable to believe that the source flux can be described by one or a family of parametric functions, the forward folding is always preferable to the unfolding one, being easier to implement and free of those problems that are typical of the unfolding methods which are cured by regularization. The problem falls therefore within the realm of "fitting" problems: searching for the valuesθ that maximize the likelihood function, defined as the probability of getting the observed data given the model parameters θ. The observed data are the list of N on events (obtained from the shower image as discussed in Section 2) with their reconstructed energy, arrival direction and time (hereafter denoted for simplicity x). If the background is estimated from an OFF region (see Section 3.3) one has to take into account also the observed counts N o f f in the OFF region and the normalization factor α. The likelihood function is where π are the nuisance parameters of the model and p(π | θ) their probability distribution given θ. The function f s is the differential source flux with the IRF of the telescopes being taken into account, such that where s is the expected signal counts in the ON region. The expected background count b in the OFF region is instead provided by where f b is the differential background template model. The function P is the Poisson distribution defined in Equation (12). The likelihood function defined in Equation (56) is usually referred to as "unbinned likelihood" to distinguish it from its binned version on , . . . , N where L i is the likelihood of the single i-th bin (in which N The variable b i has to be treated as a nuisance parameter and fixed to the value given in Equation (26) if the frequentist approach is implemented or integrated out in the Bayesian one. The variable s i (θ, π) is instead obtained by the integral in Equation (57) with integration limits on x defined by the bin under consideration.
The forward folding procedure applied in IACTs can be found for instance in fundamental physics studies, such as the search for dark matter [74] or tests of the equivalence principle from the time of flight of cosmic gamma rays [75].

Discussion
From the discovery of the TeV emission from the Crab nebula in 1989 by the Whipple collaboration [76], IACTs have been able in recent decades to give birth to a mature field of gamma ray astronomy. Instruments such as MAGIC, HESS, and VERITAS discovered numerous astrophysical sources at TeV energies, allowing investigation of the physics of remote cosmic objects. Apart from the technological development needed for the construction and maintenance of these telescopes, a huge effort has been carried out in recent decades to adapt and explore statistical tools aimed at extracting all of the information contained in the collected data.
The most challenging issue in the analysis of IACTs data is the predominant presence of background events that require detailed studies such as the estimation of the background from OFF regions as discussed in Section 3.3. Gamma rays only compose a small fraction of the cosmic flux that hits our atmosphere producing the Cherenkov light observed by the telescopes. Techniques such as the "multivariate analysis" (see Section 2.5) or the CNNs (see Section 2.6 ) are the current state of the art for discriminating gamma rays events from the background.
Another important factor that makes the statistical analysis so important and challenging for these instruments is that unfortunately we do not have a pure source of gamma rays, which is steady and under our control, and can then be used for calibrating the telescopes. The closest we have to a steady and bright gamma ray source is the Crab Nebula [77], which is indeed used as a standard for calibrating the instrument whenever an IACT observatory starts its operations. In order to infer instrumental properties, such as the energy threshold, a signal from the Crab Nebula is collected and then compared with the expected (obtained from MC simulations) response. MC simulations are therefore of huge importance for IACTs and, moreover, they also provide a way for training the BTD and RF algorithms briefly discussed in Section 2.5. The problem is that instrumental effects (such as the reflectance of the mirrors) and the atmospheric conditions have to be taken into account in these MC simulations, which in most cases requires approximations and idealized instrumental parametrizations. This is the reason why different efforts were made as discussed in Section 2.4 for making these MC simulations as realistic as possible.
Once the above issues are overcome, we have to quantify, given the list of detected events, how likely it is that a flux of gamma rays has been detected and how confidently we can set some values to such a flux. This has been discussed in Section 3 where we showed, using the frequentist and Bayesian approach, different solutions to this problem, emphasizing with examples their differences and mentioning some of the most recent developments. Lastly in Section 4, the IRF is taken into account in order to provide a flux estimation which is as much as possible similar to the intrinsic flux of gamma rays.
With the construction of CTA [78], the next generation of IACTs has ten times more sensitivity than the current instruments, and the statistical tools described in this review will become more and more indispensable in order to put the capability of the telescopes at its limits.

Acknowledgments:
We would like to thank Julia Djuvsland for useful comments on a early version of this manuscript and the anonymous reviewers for their insightful inputs and suggestions.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Hereafter throughout the paper the symbol S is used for the statistic, while the generic symbol p() is used to indicate all probability density functions (PDFs) and probability mass functions (PMFs) (the former applies to continuous variables and the latter to discrete variables). 2 By convention α is the probability of making a type I error, i.e., rejecting a hypothesis that is true. It is also refereed as the statistical significance or p-value. 3 Here we are assuming that the analyzer would make this conclusion every time that the observed statistic falls above the 95thpercentile of the statistic distribution. 4 The misuse and misinterpretation of statistical tests in the scientific community led the American Statistical Association (ASA) to release in 2016 a statement [5] on the correct use of statistical significance and p-values. 5 By nuisance parameters we mean parameters that are not of interest but must be accounted for. 6 The sim_telarray code is a program that given as input a complete set of photon bunches simulates the camera response of the telescope. 7 In IACTs a run is generally referred to as a data taking (lasting roughly half an hour) performed on a given target under the same conditions throughout the entire observation. 8 On the one hand it is important to train the classifier to maximize the separation between the signal and background, and on the other it is also crucial to avoid overtraining (also referred to as overfitting), i.e., avoiding the classifier to characterize statistical fluctuations from the training samples wrongly as true features of the event classes. 9 One can check that by computing √ S one would get the same value of Equation (16). That is because S is a χ 2 random variable. 10 See for instance Ref. [42] for a review of the problem regarding the choice of the priors. 11 Indeed one can check that Equation (26) (27) vanishes. 13 PSF stands for Point Spread Function. See Section 4 for its definition. 14 The CDF of a χ 2 distribution with 1 degree of freedom is 0.68 for χ 2 = 1 and 0.95 for χ 2 = 3.84. 15 Another way to choose s 1 and s 2 is to guarantee that the mean is the central value of the interval [s 1 , s 2 ]. In principle, one is free to pick up infinitely intervals from the constraint given by Equation (41). A more detailed discussion on this topic can be found in Ref. [65]. 16 Generally the performance of the energy and direction reconstruction only depends on the event true energy and arrival direction, which justifies the assumption that E r andn r are conditional independent variables.