Evolution of Data Formats in Very-High-Energy Gamma-ray Astronomy

Most major scientific results produced by ground-based gamma-ray telescopes in the last 30 years have been obtained by expert members of the collaborations operating these instruments. This is due to the proprietary data and software policies adopted by these collaborations. However, the advent of the next generation of telescopes and their operation as observatories open to the astronomical community, along with a generally increasing demand for open science, confront gamma-ray astronomers with the challenge of sharing their data and analysis tools. As a consequence, in the last few years, the development of open-source science tools has progressed in parallel with the endeavour to define a standardised data format for astronomical gamma-ray data. The latter constitutes the main topic of this review. Common data specifications provide equally important benefits to the current and future generation of gamma-ray instruments: they allow the data from different instruments, including legacy data from decommissioned telescopes, to be easily combined and analysed within the same software framework. In addition, standardised data accessible to the public, and analysable with open-source software, grant fully-reproducible results. In this article we provide an overview of the evolution of the data format for gamma-ray astronomical data, focusing on its progression from private and diverse specifications to prototypical open and standardised ones. The latter have already been successfully employed in a number of publications paving the way to the analysis of data from the next generation of gamma-ray instruments, and to an open and reproducible way of conducting gamma-ray astronomy.


Introduction
Gamma-ray astronomy, currently observing the non-thermal universe over more than 7 decades in energy, is conducted with different classes of instruments operating in two complementary energy ranges [1]. Space-borne telescopes, sensitive in the so-called highenergy regime (HE, 100 MeV < E < 100 GeV), directly detect the gamma rays through their pair-conversion in an instrumented volume [2]. Ground-based telescopes, sensitive in the so-called very-high-energy regime (VHE, E > 100 GeV), detect the particle cascade (or shower) generated by gamma rays interacting with atmospheric nuclei (via e ± pair production and Bremsstrahlung) using two different techniques [3]. Imaging Atmospheric Cherenkov Telescopes (IACTs) use a large reflector (∼ 10 m) and a photomultiplier camera to image the Cherenkov light emitted by the charged component of the shower. Particle samplers rely on an array of detectors (distributed over a surface up to a ∼ km 2 ) to directly sample the charged component using, for example, scintillators or water tanks in which In Sect. 3 we review some projects that have already successfully employed the format, either to validate the capabilities of the science tools, to illustrate the possibility of multiinstrument analysis with current gamma-ray instruments and to extend the format to particle samplers. In Sect. 4 we gather some ideas for the future of the format and its possible expansion. We provide our conclusions in Sect. 5.
2. Data formats for very-high-energy gamma-ray astronomy 2.1. Background: data model in the current generation of VHE instruments VHE gamma-ray astronomy inherited, along with the hardware techniques, the software solutions of particle physics. In the late 1990s and early 2000s, C++ and the ROOT [20] framework dominated the field. Hence, software for VHE data reduction and analysis has been mostly built in this environment. As already commented, even if some of these tools are accessible, little documentation is publicly available about the private analysis chains and the data they produce. Nonetheless, from the available material, a common data reduction workflow can be inferred for VHE gamma-ray telescope, sketched in Fig. 1.
In the case of an IACT, the raw output of the data acquisition typically consists of binary files containing the waveforms of all the camera pixels, sampled at the occurrence of a trigger event. The raw data are reduced to a list of quantities per pixel (e.g. charge and arrival time) aggregated in the so-called calibrated files with size of several GB for each observational run, typically around ∼ 30 min (in what follows the sizes indicated per each data level are taken from [21], so they refer to VERITAS. One can compare with similar figures reported in [22] for MAGIC). The Cherenkov light of the shower typically illuminates few pixels in the camera, this pixelated image, representing the distribution of Cherenkov photons, can be parametrised with simple geometrical quantities [23] connected to the shower properties. Image parameters can be fed, at the next data level, to algorithms estimating these properties (e.g. energy and direction of the primary) and classifying the showers initiated by gamma rays against those initiated by cosmic rays, the irreducible background of ground-based gamma-ray telescopes. In the case of particle samplers such as WCD, the data reduction workflow is similar but instead of camera images, the information is extracted from the pattern in the charge deposited by the shower across the array, as well as from its time evolution. Raw parameters derived from this charge distribution are fed into reconstruction algorithms that in turn, estimate the relevant shower parameters, like those mentioned above (see [24] for an overview of the HAWC data reduction pipeline). Having estimated the properties of the shower and of the primary particle generating it, a list of gamma-ray candidates can hence be assembled at the next data level.
At this stage, the information stored within the data products, generally denoted as high-level, is independent of the detection technique as well as the calibration and analysis methods. High-level data typically consist of a list of gamma-ray events along with a parametrisation of the response of the system, the so-called instrument response function (IRF). The latter provides the information necessary to perform a statistical analysis estimating, for example, the significance of the signal, the flux spectrum or the light curve of the source, which we refer to as science products.
All along the current-generation closed-source analysis chains the data, progressively reduced, are stored in the format associated with the ROOT framework, with each collaboration reiterating the effort of defining custom specifications for a data model that shares several commonalities between different experiments. Moreover, even if readable via ROOT, the content of these data products cannot be interpreted by a non-expert analyser. There are noticeable efforts to provide analysis tools wrapping these diverse analysis software like the Multi-Mission Maximum Likelihood framework [25]. The ultimate limitation of these tools is though the availability of the experiments to expose their closed-source software RA DEC ENERGY TIME ... ... Figure 1. Schematisation of the progressive data reduction and data levels of an IACT. Raw data contain the signal sampled from the photomultipliers at the occurrence of a trigger event (Data Level 0). Calibrated data (Data Level 1) contain the pixelated image of the Cherenkov light of the shower. The latter can be parametrised with few geometrical quantities and used to determine the observables of the original shower, including its probability of being a gamma-ray shower (Data Level 2). The detected events can be gathered in a list of gamma ray candidates, together with the functions representing the response of the system (the so-called instrument response function, IRF) e.g. the collection area of the system as a function of the energy or the bias of its energy reconstruction (Data Level 3). This information can be used to perform a statistical analysis obtaining the so-called science products, in this case the spectrum of the source (Data Level 4). and data format and the necessity to implement a new plug-in for each of the instruments considered.
Without a common data model or a general software tool oriented to external users, the current generation of VHE instruments faces different concerns in different time perspectives. At present, multi-instrument analyses simply cannot be performed within a common analysis framework using their proprietary data products. For what concerns the future, as the end of their operation approaches, it is worth to start considering the access to the wealth of data they gathered. If their legacy data are to be made public then a release in their original format will make necessary a release of the analysis software as well, which in turn has to be maintained. Beside not being designed for the usage by a large community, this software can rely on libraries that will eventually become deprecated.

GADF: A unifying effort
In the second half of the 2010s, partly to prototype the high-level data format of the forthcoming CTA and partly to exploit the newly available open-source data-analysis software like ctools and Gammapy, VHE astronomers started to explore several softwareindependent implementations of these high-level data. In 2016, in order to coordinate the parallel efforts and to foster the definition of a common and standardised data model, the Data Formats for Gamma-ray Astronomy forum (shortly referred to as the "gamma astro data formats", GADF) [26] was established. A community-driven initiative, the GADF consists of a documentation [27] hosted on GitHub [28] (Fig 2), specifying the naming scheme, the content, and the metadata of the files containing high-level gamma-ray observations. Though high-level products are the focus of the initiative, specifications for science products are also under discussion. The documentation, openly provided with a Creative Commons Attribution 4.0 license, evolves with the typical GitHub workflow: any interested user can propose changes via issues that will be discussed among the active members of the initiative, and implemented via pull requests that will be ultimately merged once a consensus is reached. Despite the bias towards IACTs, the flexible development of the format allows to accommodate data from other types of instruments, such as space-borne telescopes or WCD. The format has achieved a stable definition and counts already two minor releases, the present being 0.2 [29].

Format specifications
This section illustrates the guiding principles adopted in the development of the GADF specifications, gives an overview of their actual content and highlights the features that make them generalisable to different gamma-ray instruments. The first version of the GADF was designed for IACT, since the major contributors were VHE astronomers preparing for CTA. The data model and the breakdown of the data levels foreseen for CTA are presented in [30], introducing the following naming convention (see also Fig. 1): the raw output of the data acquisition is defined as data level 0 (DL0); calibrated files as data level 1 (DL1); reconstructed shower parameters as data level 2 (DL2); sets of selected gamma-ray events and the instrument response as data level 3 (DL3); science products (spectra, light curves, sky maps) as data level 4 (DL4), and observatory results as catalogues such as data level 5 (DL5). This nomenclature is used within the GADF and will be also adopted in the following text.
As the GADF is currently the only provider of standardised specifications for highlevel VHE gamma-ray data, science tools as ctools and Gammapy base their data structures on them. Compatibility with open-source data-analysis software is not the only objective of the standardisation effort. One of the guiding principles of the GADF is to produce data whose content is clearly documented and easy to interpret. The file format chosen to host the data is the Flexible Image Transport System (FITS) [31], representing a longtime standard in astronomy at all wavelengths. Another fundamental requirement in the design of the data specifications was to rely as much as possible on already well-established standards used in other FITS files productions, such as those by the missions gathered under NASA's High Energy Astrophysics Science Archive Research Center (HEASARC) [32]. NASA's Office of Guest Investigator Program (OGIP) FITS working group [33] already disseminates to the high-energy astrophysics community recommendations on FITS data productions. These include standards on keyword usage in metadata, on storage of time information, representation of response functions that the GADF extensively follows. The adherence of the GADF to widely used standards ensures additional compatibility with tools already in use by the high-energy astrophysicists like the FTOOLS [34].  As pointed out, the aim of the GADF initiative was to produce specifications for highlevel data, therefore, it mostly focuses on the DL3. Nonetheless, the forum discusses data levels higher than the DL3. For example, the OGIP spectral file format [35] is adopted to represent VHE gamma-ray one-dimensional (energy-dependent) spectral data. The compatibility with the OGIP standards ensures that DL3 products can be reduced to spectral data digestible by other established multi-mission analysis tools such as sherpa [36,37]. Prototypical specifications for DL4 (such as sky maps, flux points and lightcurves) are under discussion and not yet stable.

GADF DL3 data
The DL3 is the data level that contains a list of gamma-ray event candidates and the response of the system. All the information in the DL3 files is therefore post-calibration, i.e. already incorporating all the low-level information related to the detector (calibration, gain corrections, digital-count-to-photo-electron conversion) that is hence omitted. A FITS file consists of many extensions, called header data units (HDUs). Each HDU is composed by a header unit, typically containing metadata, and a data unit, containing a n-dimensional array (an image) or a table (in ASCII or binary format). All data units in DL3 files are stored as binary tables.
One of the file extensions contains the event list and, in the associated data unit, a flat table with a column for each event property (see Fig 3). In the current specifications columns listing the events identification number (in the DAQ system), energy, sky coordinates (right ascension and declination) and timestamp are mandatory. Optional columns might include results of the classification algorithms (e.g. a gammanness score) and quantities related to the reconstruction (e.g. image or shower parameters). Each file corresponds to a single observing run, therefore the events header unit contains the identification number of the data acquisition run, the type and number of telescopes used in the observation, information about the location of the instrument and its observation mode along with time and duration of the observation. Another HDU is dedicated to a list of good time intervals (GTI), specifying the time periods within the event lists with adequate scientific quality.
The response of the system is needed to properly relate the reconstructed events with astrophysical source properties. It is assumed that this response can be factorised in different components. The components considered are: the effective area, describing the acceptance of the system to gamma-ray events; the energy dispersion (or migration matrix), describing the probability distribution of the energy estimator and the point spread function (PSF), describing the probability distribution of the direction estimator. The background rate (measuring the rate of cosmic ray events misclassified as gamma rays) might be included among the IRF components, however it is not mandatory. The IRF components depend on observational (e.g. atmospheric conditions, zenith and azimuth angle of the pointing) and physical quantities (e.g. the energy or direction of the showers). The IRF components considered in the format are valid for a single exposure, which is tipically defined by constant observational conditions (e.g. zenith range, atmospheric quality, etc.), hence considering any such dependency of the IRF averaged out. In the current specifications, the dependencies on physical quantities considered are the photon energy and the offset of its position from the centre of the instrument field of view (a response symmetric with the offset coordinate is assumed). As an example, Fig. 4 illustrates the energy and offset dependency of the effective area component for a H.E.S.S. observation stored in the GADF DL3 format. IRF components are not stored in flat tables: energy and offset bin edges are stored in separate columns, and a last column contains a multi-dimensional array corresponding to the response in each bin. OGIP specifications are followed in storing both events and IRF components.

Projects successfully using the standardised data format
To illustrate the maturity of the GADF standardisation effort, we review, in the following sections, projects that have successfully employed its specifications.  [38,39] to promote the standardisation effort but also to allow to test the open-source science tools in development with actual IACT data. The data release contains 30 h of observations of sources representing different galactic and extragalactic science cases, and 20 h of observations of field of views empty of known gamma-ray emitters, also labelled as off data, to be used for background estimation. Table 1 summarises the content of this data release.

The joint-crab project
With multi-instrument analyses being one of the main objectives of the standardisation effort, after the first public release of GADF-compliant DL3 data, the next step in the format validation would have naturally been the combination of data from different experiments. In the so-called joint-crab project [40], Crab Nebula observations from Fermi-LAT and four of the currently operating IACTs, produced in a GADF-compliant format, were combined in the first multi-instrument and fully-reproducible gamma-ray analysis. The datasets used were: • 7 yr of Fermi-LAT observations, obtained in the custom high-level, DL3, format with which they are publicly released. They were reduced, before the final statistical analysis, to OGIP spectral data; To illustrate a prototypical analysis example, the Crab Nebula spectrum (Fig. 5 right) was estimated combining all the observations in an energy-dependent (or one-dimensional) joint binned likelihood. In this analysis technique, classically employed by IACT, source and background events are extracted via aperture photometry (Fig. 5 left) and then an energydependent analytical flux model is folded with the response of the system to estimate the number of counts maximising the Poissonian likelihood describing the counts in each energy bin. The joint-crab project relied only on open-source software for its statistical analyses (Gammapy). Datasets, scripts reproducing all the analysis steps and tutorial notebooks are publicly provided on GitHub [41], along with a conda environment freezing the exact dependencies used in the paper and a docker container [42] to guarantee a long-term reproducibility. The entire package was also archived on zenodo [43]. Given the approach proposed and the assets openly made available, this work not only implements the first fully-reproducible gamma-ray analysis but also constitutes the first joint public release of IACT DL3 data.

Analysis of the H.E.S.S. public data release with ctools
Besides evolving in parallel with the GADF, the open-source science tools can recognise data with its specifications as input. In [44], the H.E.S.S. DL3 DR1 (Sect. 3.1) was used to test the capabilities of ctools, until then mainly used to analyse simulated CTA observations and calculate prospects for its observational capabilities. The authors presented a method to build a parametric model describing the spatial and spectral distribution of the background events in the H.E.S.S. DL3 DR1. The latter was used to perform a spectro-morphological (three-dimensional) analysis estimating the spectrum of the 4 sources included in the data release. Differently than in the one-dimensional analysis described in Sect. 3.2, the sources positions and morphology are included among the parameters of the model used to estimate the flux. Source and background counts are not separated, rather the background is included among the components of a model that in this case predicts the flux in the entire field of view, allowing to take into account multiple sources at a time (see [18] Sect. 2 for a detailed explanation). This approach has been successfully used by the Fermi-LAT collaboration for all its scientific publications. The results of binned and unbinned threedimensional likelihood analyses are compared against the simpler one-dimensional binned analysis, also implemented in ctools, and against bibliographic references obtained from the same sources. The consistency of the results obtained with ctools with the different statistical methods applied and with the literature (see Fig. 6 left) testifies the maturity not only of the science tool, but also of the GADF scheme that correctly encapsulates all the information needed for correct reproduction of scientific results. The paper finally illustrates the capability of ctools, being built on the gammalib library [18], to simultaneously analyse gamma-ray data with different specifications, i.e. to analyse Fermi-LAT data in their own high-level format (without the reduction described in Sect. 3.2) and IACT DL3 data compliant with the GADF specifications.

Validation of open-source science tools and background model construction in γ-ray astronomy
Expanding on the project described in Sect. 3.3, [45] aims at testing both Gammapy and ctools using the H.E.S.S. DL3 DR1. The results of the one-dimensional and threedimensional analyses provided by both science tools are validated against each other. For the three-dimensional analysis, a novel background model is used, not parameterised from the off sources within the H.E.S.S. DL3 DR1, but built using ∼ 4000 h hours of H.E.S.S. private observations. For this work the results of the science tools are validated not only against the literature, but also against the results obtained with one of the closed-source analysis chains of the H.E.S.S. collaboration, performing a classical one-dimensional analysis on the exact same observations included in the H.E.S.S. data release (see Fig. 6 right). The agreement of the results of the different science tools among them and with the private analysis chain represents a landmark in the analysis tools and data formats validation for future VHE gamma-ray analyses.

Open and standardized formats for γ-ray analysis applied to HAWC observatory data
The GADF specifications were primarily developed by and for the IACT community. However, due to their generality, it is possible to use them to format data from WCD, such as the HAWC observatory, as shown by [46]. In this work the authors presented the first GADF- compliant production of event lists and instrument response functions for a ground-based wide-field instrument. These data products were then used to reproduce with excellent agreement the published spectrum of the Crab Nebula as measured by HAWC. This result, shown in Fig. 7, was obtained using the open-source software Gammapy. As highlighted by Sect. 3.2, a common data format and shared analysis tools allow multi-instrument joint analysis and effective data sharing. This synergy between experiments is particularly relevant given the complementary nature of pointing and wide-field instruments. This will be specially relevant for the joint scientific exploitation of future observatories such as SWGO and CTA.

Discussion
The future of data formats in gamma-ray astronomy will very likely be linked to the future of the GADF initiative. As discussed over the text, this community-driven initiative has proposed the first available set of specifications for high-level data for the current and next generation of ground-based gamma-ray instruments. In this section we will discuss the main limitations affecting current specifications, as well as foreseeable ways in which they will evolve over the next decade.
One of the main drivers of the evolution and improvement of the GADF will be the target requirements imposed by the future ground-based observatories. These will impose high-level data (and especially, the IRFs) to be described and parameterised in more complex ways, directly benefiting also the current generation of instruments. Possible extensions of the format to meet these requirements could include: a better field of view binning approach, removing the assumption of radial symmetry; inclusion of time dependency in the IRF components; distinguishing between different event types based on the hardware, reconstruction or analysis settings. Mature format specifications will be crucial for defining and testing current instruments legacy data, as they face the challenge of digesting decades of data (taken by instruments with evolving capabilities) and ensuring their proper use and interpretation.
In order to confront these challenges and to ensure the long-term feasibility of the GADF specifications, a more formal governance structure is needed. For this reason, a body of representatives from the high-energy ground-based community will be defined to act as a coordination committee. This governance definition effort, currently in progress, will inherit from the evolution of similar community-driven initiatives (for instance, the Astropy Project role responsibilities [48]).
Even if the GADF specifications were inspired by high-energy satellites and primarily developed by and for the IACT community, they are able to represent high-level data products from other event-based high-energy astrophysical instruments. As shown in Sect. 3.5, other high-energy gamma-ray observatories such as WCD (like HAWC or the future SWGO) naturally fit the GADF specifications, allowing the use of available open-source data analysis tools. In the coming years, the inclusion of other observatories will be explored, especially in the context of high-energy multi-messenger astronomy: allowing the inclusion of data from neutrino or even gravitational wave observatories would require some changes to the specifications, but at the same time would naturally allow the use of common science tools for joint multi-messenger analyses.

Conclusions
This review presented an outlook on the evolution of the data format in VHE gammaray astronomy from private and diverse specifications to the open and standardised ones proposed under the GADF initiative. The GADF initiative is presented as a communitydriven effort to provide a common and open high-level data format for gamma-ray instruments. The specifications proposed within the GADF refer to high-level data products that would allow the production of scientific results: they are independent of the particular detection technique, thus allowing to accommodate data from different telescopes (e.g. IACT and WCD). The format definition was driven by the requirement to operate the next generation of gamma-ray instruments (such as CTA) as open observatories, with the consequent need of providing non-expert external users with open data products that are easy to interpret. Another aspect of this demand was the development of open-source gamma-ray data-analysis tools, whose evolution is now also linked to the data standardisation effort.
Current GADF specifications have proven to be robust by several publications analysing GADF-compliant data with these open-source science tools, validating their results against those obtained with the established closed-source software in use by current collaborations. These publications confirmed not only the correctness of the information incorporated in the format specifications but, at the same time, the capabilities of this new generation of open-source science tools. Other publications have instead proven the feasibility of multiinstrument and fully-reproducible analyses once the common format and open software are used. Even if future instruments are driving the open data and software development, the current generation can significantly benefit from their advancement. Their adoption ensures a larger user and maintainer base for the legacy data of current instruments, and eventually more sophisticated data storage and analysis techniques. The H.E.S.S. collaboration already pioneered a first public release of GADF-compliant data. All currently operating VHE gamma-ray experiments are nowadays also able to produce GADF-compliant data products, though for the moment they have mostly been used internally. Multi-instrument scientific projects using these data products are on their way, sharing data among collaborations through the use of memoranda of understanding.
The standardisation effort remains open to the inclusion not only of more gamma-ray instruments but also of telescopes observing the universe with other messengers. With the initiative being community-driven, high-energy astrophysicist in need of new extensions to the format are able to propose them. The recent efforts reviewed in this issue successfully employing GADF-compliant data and open-source analysis tools will surely foster their usage for further scientific projects. The GADF does not represent an isolated effort and aims at maintaining compatibility with other established standards in high-energy astronomy, like the OGIP (on which the GADF largely draws), or those used for high-level products within the Virtual Observatory [49]. Promoting the use of open-source analysis tools as well as common open data formats will distinguish high-energy astrophysics in the future as one of the few branches of modern science unconcerned by the reproducibility dilemma affecting many other disciplines [50].