Analysing the Protein-DNA Binding Sites in Arabidopsis thaliana from ChIP-seq Experiments

: Computational genomics aim at supporting the discovery of how the functionality of the genome of the organism under study is affected both by its own sequence and structure, and by the network of interaction between this genome and different biological or physical factors. In this work, we focus on the analysis of ChIP-seq data, for which many methods have been proposed in the recent years. However, to the best of our knowledge, those methods lack an appropriate mathematical formalism. We have developed a method based on multivariate models for the analysis of the set of peaks obtained from a ChIP-seq experiment. This method can be used to characterize an individual experiment and to compare different experiments regardless of where and when they were conducted. The method is based on a multivariate hypergeometric distribution, which ﬁts the complexity of the biological data and is better suited to deal with the uncertainty generated in this type of experiments than the dichotomous models used by the state of the art methods. We have validated this method with Arabidopsis thaliana datasets obtained from the Remap2020 database, obtaining results in accordance with the original study of these samples. Our work shows a novel way for analyzing ChIP-seq data.


Introduction
Computational genomics consists of the use of a wide range of mathematical tools, implemented in specific software, in order to solve challenges such as how the functionality of the genome of the organism under study is affected both by its own sequence and structure, and by the network of interaction between this genome and different biological or physical factors (proteins, metabolites, molecular complexes, electromagnetic radiation, etc.).
One of the main types of experiments included in this field is the so-called chromatin immunoprecipitation (ChIP) experiment [1], which aims to identify and localize in vivo all the binding sites of a given DNA-binding protein throughout the genome of an organism, tissue, or cell line subjected to a specific biological condition (e.g. "wild type" or "stress"). Subsequent bioinformatics analysis of the results of this experiment will be carried out to elucidate the biological implications of the binding protein on the organism under study [2]. ChIP-seq experiments [3] consist of a first ChIP phase in which the immunoprecipitated fragments of the DNA molecule (with a length of between 150 and 1000 nucleotides) to which the protein under study has been attached (hereafter referred to as target protein) are enriched over the immunoprecipitated fragments corresponding to the rest of the genome. This is followed by a phase of identification of these fragments in two steps.
The first one is their initial segmentation into subfragments with a length between 30 and 150 nucleotides and subsequent sequencing of their nucleotide chain (seq) through next generation sequencing (NGS) techniques, which, by means of "base calling" algorithms, obtains the so-called "reads" (sequenced subfragments). In a second step, the reads obtained are mapped onto the reference genome of the organism under study, i.e., the identification of the site(s) where the read is statistically presumed to belong, including both the regions to which the target protein binds and those of the rest of the genome. Once the reads obtained from the experiment have been mapped, "peak calling" algorithms apply statistical enrichment analysis to identify the binding sites of the protein in the genome, which are called "peaks", referring to the shape of the distribution of mapped reads on these regions. Thus, peaks are the enriched regions of mapped reads, and they are considered as the hosts of possible binding sites.
The following are the main challenges facing methods or algorithms used in computational genomics in general and among them those focused on ChIP-seq experiments: (i) algorithms have to process a huge amount of data; (ii) the data have a degree of uncertainty associated [4,5], which is generated and accumulated by the occurrence of technical errors, biases in the algorithms used, the nature of the data being worked with and other unknown and therefore uncontrollable factors; (iii) the inherent complexity of the genomic and biochemical information represented by an extensive network of interacting nodes to carry out complex biological processes within a cell, tissue, or organism.
One of the most important stages is elucidating the biological implications of the peaks obtained in a ChIP-seq experiment, which has the following specific challenges: (i) the noise accompanying each peak, where the length of the peak is on average about 20 times longer than the actual site where the protein under study binds; (ii) obtaining peaks that do not actually harbor any binding site for the target protein; (iii) to establish correct relationships between the peak obtained and the functional elements of the genome. The state-of-the-art methods use heuristic methods in some cases, statistical methods in others, and both at the same time. Although there are two common features of all the tools developed so far, one of them consists of modeling and treating each peak obtained [6][7][8][9] and each functional element [10] belonging to the genome under study individually, while the other consists of working with dichotomous models, where models are created containing regions of different extension, with two possible values, and which in most cases lack a mathematical formalism in the generation of the models used.
The main objective of this work consists of the design, implementation and validation of an analysis method to overcome the two main challenges seen above. Our method treats all the peaks of a ChIP-seq experiment as if they were the result of a random experiment where a multivariate hypergeometric model can be defined. This method reflects the behavior of the target protein, taking into account each and every one of the peaks generated along the entire genome where they have been mapped, as well as all the possible binding sites that this genome harbors.
Our method addresses the two challenges as follows: (i) The uncertainty that is generated at each stage throughout the whole process is handled by applying statistical techniques and applying a standardized methodology, in which each step has a precise, complete definition and setting of the parameters; (ii) the complex nature of the biological data to be interpreted is dealt with through the use of multivariate probabilistic or background models that accurately represent the entire sample space being analyzed; in this way, adjusting the dimensionality of each model (statistical variables) to this complexity. Thereby, the proposed method permits to extract precise information available from the set of peaks obtained from a ChIP-seq experiment, as well as to compare the results obtained from different experiments. The characteristics of this method are: (i) flexibility, allowing the analysis of ChIP-seq experiments for any type of organism, tissue, or cell line, as long as there is sufficient information on the configuration and structure of its genome and its component genes; (ii) extensibility, it allows for the incorporation of new information that is generated over time, thus improving the method; (iii) mathematical formalization, imple-mented both in the generation of the probabilistic models and in the statistical techniques used on them.
Finally, we have validated the method using information from public repositories, both to obtain the datasets of peaks from ChIP-seq experiments and the information necessary for the modeling of the genome of the organism under study, which in this case is Arabidopsis thaliana, a well-characterized plant widely used as an experimental model. The results obtained show the capacity of our method to analyze these types of experiments from a point of view that has not been considered until now.

General Overview
Our basic assumptions are that the results of a ChIP-seq experiment (set of peaks) are a random vector following a multivariate hypergeometric distribution [11][12][13], and we can model the genome according to the expected characteristics of this type of random experiment.
Let S be a finite population formed by m elements which are classified into k mutually exclusive classes, i.e., each element belongs to one and only one of the k classes. Let S i be the subpopulation of all the elements of the ith class, being m i its subpopulation size (i = 1, 2, . . . , k) and m = ∑ k i=1 m i . Then, the random experiment consisting in drawing without replacement n elements of S is represented by the random vector X = (X 1 , . . . , X k ), where each X i denotes the number of elements of the S i class in the sample. The random vector X follows a multivariate hypergeometric distribution with parameters n, (m 1 , . . . , m k ) and whose joint probability mass function is given by where 0 ≤ x i ≤ m i and ∑ k i=1 x i = n. From (1), the ith component of X has a univariate hypergeometric distribution with parameters (n, m i , m), X i ∼ H(n, m i , m) for i = 1, 2, . . . , k. In reality there are only k − 1 distinct marginal random variables, since ∑ k i=1 X i = n [14], and for j = i with j = 1, 2, . . . , k, their means, variances, and covariances are and In our case, S is the finite population of all individual binding sites with a given length, that the genome can house. Each individual binding site must satisfy two requirements: (i) to belong only to one of the S i classes of k functional regions (promoters, upstream, exons, . . . ) assigned, being also named the annotated binding sites; (ii) to have the same probability of being chosen. Finally, E is the set of n peaks obtained from a certain ChIP-seq experiment. To convert an element of E (peak) into an element of S (annotated binding site) with its corresponding assigned functional region S i , its length has to be adjusted to the one given for the elements of S in a process called peak standardization that will be seen below. The finite population S is represented by the background model, which is derived from the genome model. Otherwise, once processed, the set E is represented by the ChIP-seq model.
The generation of the necessary models for data processing consists of two parallel and independent tracks ( Figure 1). On the one hand, the genome model (yellow) contains the necessary information, collected from available resources regarding the structure and organization of (i) both the genome and the genes to be considered, and (ii) the tissue or organism associated with the experiment whose results are going to be analyzed. This information is transformed from a series of parameters for subsequent use in the annotation process, generating the background model (blue). This model is composed of the annotated binding sites, which have been assigned a class of the functional regions that make up the genome model. On the other hand, a peak standardization process (Appendix A) is carried out on the set of peaks obtained from a given ChIP-seq experiment (ChIP-seq dataset). The peaks are standardized with a given length and are grouped in the ChIP-seq model. This length must be equal to the length configured in the background model. In the annotation process, each standardized peak is assigned a single class of functional region according to the annotated binding site with which it overlaps, becoming known as the annotated peaks. The set of all annotated peaks obtained from a specific ChIP-seq model according to a specific background model is called ChIP-seq annotated.
Finally, we obtain the characteristic profile, which includes information on the frequencies, expected values, standard deviations and Z-scores, for this annotated ChIP-seq. The characteristic profile depends on the count of the annotated peaks obtained according to the type of functional region class with which it has been annotated and the background model used in this annotation process. Further analyses are based on this information: (1) a goodness-of-fit test based on the multivariate hypergeometric distribution determines whether the distribution of the annotated peaks associated with the sample can be considered to belong to the population represented by the background model used in the annotation process; (ii) a test of homogeneity based on the multivariate hypergeometric model determines whether the distributions of two ChIP-seq annotated follow the same model.

Model Dimensionality
The dimensionality of the model is given by the number of different classes of functional regions assumed to be part of the genome of the organism under study. These functional regions will depend mainly on the biotype (e.g., protein-coding, non-coding RNA, pseudo-gene or jumping gene) of the genes whose locus in the genome provides a functional product, which will be included in the genome model. Although in this work only protein-coding genes, whose functional product is a protein, have been considered, genes of different biotypes could also be included, either simultaneously or separately, depending on the objective of the research or in order to refine the genome model according to the target protein under study. For this work, a protein-coding gene has been considered to consist of a promoter region [15], a proximal region [16], splice regions including donor, and acceptor regions [17], a cleavage region [18], exons, and introns. Figure 2 shows both the layout as well as the extension from the considered reference point of all functional regions mentioned above. In addition to the functional regions directly related to the different gene biotypes, there are others where their own nucleotide sequence has a characteristic function without encoding any product. These would be the enhancers, insulators, and so on. These types of functional regions are distributed throughout the genome and can also be considered in the model.  It should be noted that protein-coding genes, in the case of eukaryotic organisms, such as Arabidopsis thaliana, can harbor different transcripts (product of the transcription process from the DNA molecule to the RNA) that include different exons, and therefore present different arrangements of the respective functional regions. This method represents for each gene a single transcript, which we call the canonical transcript, and it is usually the most representative transcript of the gene, although a tissue-or cell line-specific transcript, if available, could be chosen.

Genome Model
As mentioned above, this model accounts for all the existing knowledge of the genome of the cell line, tissue, or organism on which the ChIP-seq experiment was performed. This information is collected from available databases. It aims at describing the genome regions to which the experiment reads can be mapped and how they are organized (autosomal, sex, or mitochondrial chromosomes, plasmids, . . . ). Note that the greater the knowledge about the entity to be modeled, the more accurate the model becomes. Therefore, the regions without defined sequences, called gaps, are taken into account. Once the genome fragments with defined sequence have been identified, the next step is to locate the different classes of functional regions according to the genes that are considered to be part of the genome. In the genome model, the intergenic region is the default one, and it includes regions of the genome that have not been annotated with any kind of functional region and represent either regions that have no functionality (at least of interest for the analysis) or regions whose functionality is unknown. In this work, we assume a haploid genome model regardless of the organism's karyotype, because the state-of-the-art mapping and base calling algorithms do not discriminate between the alleles of homologous chromosomes [19]. However, if peaks falling on homologous chromosomes could be distinguished, they could be represented in this model.

Background Model
An annotated binding site is a region located in the genome with a specific length. It is defined by the chromosome to which it belongs, the start and end chromosome coordinates, the strand where it is located, and the annotation of the functional region class assigned to it. This annotation is the result of an exhaustive and mutually exclusive process through which each annotated binding site is assigned to one functional region class of the genome model under study, or the intergenic class. From the point of view of a given genome model, the background model is the set of all the annotated binding sites that it can host, according to a set of parameters. These parameters allow different background models to be defined for the same genome model. From the point of view of a target protein, a background model is the population of all annotated binding sites to which it could randomly bind.
The setting of a background model involves the following attributes: • The sample space determines which portion of the genome model will be part of the background model: the whole genome model, one or several chromosomes, or even a portion of them; • The peak length sets the length of the annotated binding sites. This must match the standardized peaks of the ChIP-seq model to be analyzed through this background model. This makes all annotated binding sites equiprobable; • The overlap rate is the percentage that determines the minimum number of nucleotides of an annotated binding site that must overlap over the extent of a given functional region for its class to be assigned to it. The higher the value sets, the more restrictive the background model will be; • The priority rule resolves cases where two or more classes of different functional regions overlap with each other, determining the class of functional region that should be assigned to an annotated binding site. There is mutual exclusion between the different classes; • The strand determines the functional regions that are considered in the generation of the background model, according to the strand in which they are located. Its value may be forward, reverse, or both. The latter takes into account all functional regions regardless of the strand where they are located. It represents the flexibility of the method. The higher the value of the overlap rate attribute and the lower the value of the peak length, the more accurate the results of the analysis will be. Nevertheless, the settings of these parameters must be consistent with the accuracy and reliability of the peaks obtained in the ChIP-seq experiment under analysis.

ChIP-seq Model
We define a ChIP-seq model as the set of all standardized peaks derived from the peaks obtained from a certain ChIP-seq dataset through a standardization process (Appendix A). The purpose of a standardized peak is to specify both the location and the extent of the binding site that it depicts as accurately and reliably as possible. In order to achieve this, it is necessary to take into account both the discrepancy between its length and that of the real binding site, the so-called summit of a peak most likely being the coordinate of the peak center.
The features of a standardized peak are: • Location, whose values in the genome are chromosome, strand and start and end, both expressed in chromosomal coordinates; • Peak center, which is determined by the standardization process; • Peak length, which may be either manually estimated by the researcher according to prior information about the target protein under study or automatically computed by motif discovery algorithms according to information gained by this analysis itself.
For each match between the set of standardized peaks of a ChIP-seq model and the annotated binding sites of a particular background model, an annotated peak is obtained.
An annotated peak has the additional attribute Annotation, which indicates the class of functional region or alternatively the class "intergenic" that has been assigned to it. Note that the value of this attribute relies on the background model used.

Analysis Models
A ChIP-seq model annotated through a certain background model can be treated as the result of a random vector following a multivariate hypergeometric distribution, and hence different analyses can be applied. On the one hand, each ChIP-seq model can be characterized by the number of annotated peaks in each class, by providing a characteristic profile of the binding sites. A goodness-of-fit test may be developed to determine the degree of similarity between the distribution of functional regions obtained from the ChIPseq model (observed frequencies) and that of the background model used for its annotation as reference (expected frequencies). Furthermore, an homogeneity test can also be applied to analyze the similarity between the distributions of two different ChIP-seq models.

Characteristic Profile
The characteristic profile of a ChIP-seq model is obtained by counting how many standardized peaks are annotated per class, according to a random vector X with a multivariate hypergeometric distribution (1). For each ChIP-seq model, a profile includes a set of values, such as the absolute and relative frequencies and their confidence intervals (CI), expected values, standard deviations (sd), and Z-scores, whose experimental results have been summarized in Appendix B (Tables A1-A8).
In order to obtain a more accurate standardized profile of a protein-DNA binding sites from a cell line, the ratios of bindings over the functional regions p i = m i /m are estimated from all ChIP-seq experiments available under the same target protein and cell line by regarding the relative frequencies R i = X i /n as k − 1 discrete random variables, since ∑ k i=1 R i = 1 and from (2): and hence, the method of moments yields consistent and unbiased estimators of the ratios of bindings, p i = r i [20,21]. Furthermore, a non-parametric bootstrap resampling approach [22] is applied to estimate the empirical variability of the ratio of each functional region so as to calculate the 95% CIs for the ratios of the peak counts into each region by using 10,000 bootstrapped replicates. In particular, for each functional region S i , the percentile method was applied to calculate the 95% CI for the relative frequency by selecting the 2.5th and 97.5th percentiles of its 10,000 bootstrapped replications, i.e., the r (250) and r (9750) values of R i , where r (j) represents the jth value in increasing order. The distribution of the relative frequencies, along with the 95% CIs ( Figures A1A-A8A in Appendix B), represents the profile of the target protein under study with respect to the sites where it is located throughout the selected portion of the genome of the given cell line under a specific biological condition.
In addition, for the functional region S i according to a ChIP-seq annotated through a given background model, a Z i -score can be obtained by standardising the ith component of X from (2), or equivalently, by standardising the ith relative frequency R i from (3), as follows being E(Z i ) = 0, Var(Z i ) = 1, and the Z-scores' covariances are given by The Z-scores obtained for the functional regions quantify the preferences that the target protein has for each one of these classes in such a particular ChIP-seq model ( Figures A1B-A8B in Appendix B).

Goodness-of-Fit Test
Considering the multivariate hypergeometric distribution of the peak counts into the functional regions on the genome as the reference model (background model), i.e., the null hypothesis this test measures the fit of the observed peak counts (ChIP-seq model) to the expected peak counts (background model), which analyzes whether their differences were by chance.
For population size m large relative to n, sampling without replacement is closely approximated by sampling with replacement. Thus, the joint modeling of functional regions will be analyzed by assuming independence among the Z i -scores (4) due to large m, and in addition, each Z i may be approximated by a standard normal distribution [23]. Other approximations related to the hypergeometric distribution or transformations of it have been studied in [20,[24][25][26][27].
In this first approach, the statistic T defined by the sum of squares of Z i -scores can be approximated by a chi-squared distribution with k − 1 degree of freedom: which is similar to the multinomial test of Pearson, by using the binomial approximation to the hypergeometric given by Sandiford [28] for each Z i -score. Some alternative statistics and corrections for continuity have been discussed in the literature, e.g., see [14,23,24,[26][27][28][29][30], although the difference is negligible for large m [24]. Moreover, the maximum, range and rate of the extreme order statistics can also be used for detecting outliers and extreme values [27,31]. From the result of this test (p-value), the degree of similarity between the corresponding ChIP-seq model and its associated background model can be quantified. Thus, the decision can be made on the randomness (or not) of the binding sites where the target protein has been located, i.e., whether the distribution obtained from the classes of the annotated peaks is what would be expected by chance according to the background model used, or otherwise whether there is a specific functional region whose frequency is significantly different from the expected.

Test of Homogeneity
From two ChIP-seq experiments, this test measures the difference between the observed peak counts into the same set of k functional regions under an underlying multivariate hypergeometric distribution, i.e., the null hypothesis Hence, each Z ji -score associated with Y ji given by (4) depends on the parameters (n j , m ji , m) for j = 1, 2, which can be reduced to (n j , m i , m), since m 1i = m 2i under the null hypothesis. Thereby, the mean and variance of Y ji can be written as where the subpopulation size m i is estimated from the observed values y 1i + y 2i of Y 1i + Y 2i ∼ H(n 1 + n 2 , m i , m). Thus, the maximum likelihood estimate of m i is an integer value in the interval i.e., m i is between the greatest integer less than or equal to u i and the lowest integer greater than or equal to l i , which can be found by using the algorithm of Oberhofer and Kaufmann [32]. In particular, if m/(n 1 + n 2 ) is an integer, the maximum likelihood estimates m i can be expressed as (see [20,32,33]) which are as the ones obtained by the method of moments [20,21]. Therefore, under the same assumptions of Section 2.3.2, the statistic T defined by the sum of squares of Z ji -scores can be approximated by a chi-squared distribution with k − 1 degree of freedom: From the result of this test (p-value), the degree of similarity between the two ChIP-seq models annotated through the same background model can be quantified. Thus, we can identify similar characteristic profiles and the functional region class whose frequencies are most significantly different between both profiles, and therefore also the biological conditions where the target protein location is differentially altered.

The Use Case
The validation of our method was carried out using ChIP-seq experiments collected from the ReMap2020 database (https://remap2020.univ-amu.fr, accessed on 30 November 2021) [34]. The peaks of the ReMap2020 ChIP-seq datasets were generated by applying its own pipeline from reads generated by ChIP-seq, ChIP-exo, and DAP-seq experiments collected from public resources such as GEO (https://www.ncbi.nlm.nih.gov/geo, accessed on 30 November 2021), ENCODE (https://www.encodeproject.org, accessed on 30 November 2021) or ENA (https://www.ebi.ac.uk/ena/browser/home, accessed on 30 November 2021). ReMap2020 contains 5798 ChIP-seq datasets performed on cell lines belonging to Homo sapiens, and 795 performed on Arabidopsis thaliana.
In particular, four Arabidopsis thaliana ChIP-seq datasets correspond to the GSE112951 experiment carried out by Nassrallah et al. [35], which analyzed the influence of the lightmediated development protein (DET1) on the pattern of monoubiquitination (chemical modification where a ubiquitin molecule is added to the target molecule) of histone H2B (H2Bub). The DET1 is a component of light signal transduction machinery, involved in the repression of photomorphogenesis in darkness through regulation of the activity of ubiquitin conjugating enzymes, involved in the repression of de-etiolation in developing seedling, and involved in the repression of the blue light responsive promoter in chloroplasts (UniProt Consortium https://www.uniprot.org/uniprot/P48732, accessed on 30 November 2021). In such a study, two replicates of a ChIP-seq experiment using an antibody against H2Bub were performed in each condition, where each peak obtained indicates that histone H2B is monoubiquitinated. For both, a gene had been tagged if a peak fell within the region spanning from −1 kb from the TSS to +1 kb from the TES. The results of Nassrallah et al. [35] revealed that DET1 is directly linked to the histone H2B monoubiquitination pathway. They found that over 6900 genes supported a peak in all four samples, while the number of genes decayed over 20% in the samples with the mutated gene relative to the wild type samples, regardless of the light or dark condition. Table 1 shows the peaks obtained in the experiments performed on the Col-0-seedling cell line for the wild type and the knockout mutant for the DET1 gene in both light and darkness conditions for 5 days.
These four Arabidopsis thaliana ChIP-seq datasets synthesised in Table 1 have been chosen to illustrate the usefulness of the proposed method.  Table A10. We only used the 5 genomic chromosomes, with a total length of 119,146,348 bp, which were the ones used in the original experiment. The gaps were collected from https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/hubAssembly/ plantAraTha1/araTha1/gap.html (accessed on 30 November 2021), which represent 0.156% of the total genome. Note that gaps smaller than 10 nucleotides (nt) were not considered.
We used the regions indicated in Section 2.2.1 as functional regions of this model, and their description was shown in Figure 2, including the intergenic region. Therefore, we used a 7-dimensional genome model. On the other hand, an 8-dimensional genome model was created by including the enhancer as the functional region. The enhancers were obtained from Zhu et al. [37].

The Arabidopsis thaliana Background Model
The values of the parameters set for this background model were the following: • Peak length = 31 nt, which is an average length to deal with inaccuracy in the peaks; • Overlap rate = 20%; an overlap of at least 7 nt of a standardized peak over a certain class of functional region is necessary for that class to be assigned to that peak; • Sample space = {1, 2, 3, 4, 5}, that is, the 5 genomic chromosomes; • Strand = both. All functional regions belonging to the two strands of the DNA molecule, forward and reverse, are considered; • Prior rule = promoter > proximal > enhancer > cleavage > splice > exon > intron > intergenic. A functional region class had higher priority than those to its right and lower priority than those to its left. Thus, the promoter class had the highest priority over the other classes.
Once all the above parameters were applied to the information stored in the genome model, the two background models with 7 and 8 dimensions (dm) were generated ( Table 2). It should be noted that the highest percentage of annotated binding sites corresponds to the intergenic class, followed by exon and splice. The comparison of both background models reveals that 2% out of the annotated binding sites of the enhancer class in the 8 dm are found in regions of the intergenic class in the 7 dm.

ChIP-seq Models for GSE112951
Although in the original study two replicates were performed for each of the four biological conditions, the Remap2020 database unified the reads from both replicates as input data to its pipeline to generate a single ChIP-seq dataset for each biological condition. Thus the ChIP-seq datasets collected from Remap2020 consisted of 12,782 and 12,305 peaks for the wild type condition in light (5d-L) and dark (5d-D) respectively, and 12,165 and 12,173 peaks for the mutant in light (det1_5d-L) and dark (det1_5d-D), respectively ( Table 1).
The peaks of all four ChIP-seq datasets were standardized (Appendix A). The peak center parameter was the summit of each peak and the peak length was 31 nt. The standardized peaks of each sample were annotated across the two background models (7 dm and 8 dm), so two annotated ChIP-seqs were obtained for each ChIP-seq model.

Characteristic Profiles for GSE112951
The characteristic profiles of the four ChIP-seq models of the GSE112951 study annotated through each of the two background models, 7 dm and 8 dm, were obtained (see Table 3). Taking into account the high percentage of annotated peaks (above 98%) that were assigned a class of non-intergenic functional regions in each of the ChIP-seq models for both background models, it can be stated that these background models fit well the characteristics of the protein under study. On the other hand, the difference in the percentages between both background models was in the range [0.016, 0.033]. This, along with the high percentage value for the background model 7 dm, indicated that the incorporation of the enhancer class in the background model 8 dm is not relevant for the protein under study, at least considering the sample space formed by the 5 genomic chromosomes. Table 3. Characteristic profiles and percentage of annotated peaks for the four ChIP-seq models belonging to study GSE112951, according to both background models 7 dm and 8 dm.   If we visually examine both the relative frequencies and the Z-score values obtained in the four samples with respect to both background models ( Figures A1-A8), two patterns could be distinguished depending on the number of monoubiquitinated H2ub sites in the observed values and the expected ones according to the background model used. One pattern was shared between wild type samples and another between those with the mutated DET1 gene, regardless of light or dark conditions. For the background model 7 dm, in the two wild type 5d-D and 5d-L samples, the intron class showed the greatest difference with positive values of 72.5 and 74.

Goodness-of-Fit Test for GSE112951
The results of the goodness-of-fit test (5) for each of the four samples with respect to both background models are shown in Table 4. The p-value was 0 for each of them, which means that the observed H2Bub sites do not fit to the pattern of the background model, but showed a bias towards certain classes, as described in the previous section.

Test of Homogeneity for GSE112951
Based on the profile drawn by the four relative frequencies, two patterns could be distinguished in Figures 3A and A9A. One pattern was exhibited by the two ChIP-seq models with the wild type condition and the other one by the two ChIP-seq models with the mutated DET1 gene, regardless of the light or dark condition. The results of the homogeneity tests (6) carried out on the six pairs formed by the four ChIP-seq models under study are shown in Table 5.  According to the corresponding p-values (see Figures 3C and A9C), significant differences were found between all pairs formed by one wild type sample and one sample with the mutated gene (p-value with magnitudes below 10 −40 ), while p-values above 0.5 were obtained between pairs formed by the two wild type samples or the two samples with the mutated gene. This indicates that the pattern of monoubiquitination is more dependent on the wild type or mutated gene condition than on the light or dark condition.
Further, Figures 3B and A9B display which functional regions contribute the most to the differences between samples. The exon and intron classes were practically the only ones responsible for the difference between the wild type condition and that of the mutated gene. The results presented in this section held for both 7 dm and 8 dm background models.

Discussion
In this work we have presented a statistical method that allows one to generate more accurate and reliable information from ChIP-seq experiments by modeling the peaks through a random vector with a multivariate hypergeometric distribution. We have showed how multidimensional models can improve the analysis of this type of experiments and how these generated models might facilitate the comparison between different experiments, regardless of when and where they have been carried out.
The results obtained are in line with those reported in the earlier study [35], the difference between samples mainly being due to the wild type/mutated gene conditions rather than by the light/darkness conditions. While the earlier study was based on the number of genes, our method was able to determine the most affected classes of functional regions by using multivariate distribution models. Furthermore, the models can be updated, which has been illustrated by the background models with different dimensions.
The sample space of the background models is related to the scalability of the method. For the sake of simplicity, we have not demonstrated such scalability in this work. As future work, we will study how the scalability is achieved by exploiting the grouping property of the multivariate hypergeometric distribution. Both individual genes and chromosomes can be defined as background models, thus allowing one to focus on the extent of the analysis, but always considering the whole genome and from a biologically reasonable point of view. Further work will also show how the use of more functional regions can also have a positive impact on the level of detail of the study.

Conclusions
The design, implementation, and validation of an analytical method have been introduced to deal with some challenging issues assisting in elucidating the biological implications of the peaks obtained in a ChIP-seq experiment. The coherence of the method has also been shown in this work, since the inclusion of an irrelevant functional region for the behavior of the protein under study did not affect the final results.

Acknowledgments:
The authors would like to thank the three anonymous reviewers for their comments and suggestions, which have improved our manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: ChIP-seq Chromatin immunoprecipitation sequencing DNA Deoxyribonucleic acid NGS Next generation sequencing RNA Ribonucleic acid TSS Transcription start site TES Transcription end site

Appendix A. Peak Standardization
The standardization process consists of combining the peaks of selected ChIP-seq datasets corresponding to the same target protein and cell line, regardless of the biological condition to which the cell line is subjected.
(i) Peak selection consists of a progressive selection process of peaks from different ChIP-seq datasets. From the first peak selected, the peak with the start coordinate immediately above the first one, must meet two conditions to be selected. The first condition is that it must belong to a different ChIP-seq dataset than the raw peaks already selected for the same standardize peak. The second one is that either the summit of each raw peak falls within the extension of the other raw peak, or that they overlap the intervals formed by symmetrically extending the summit of each raw peak until the value of the peak length parameter is reached. In case of a positive selection, both raw peaks are merged, with the start coordinate value being the smaller of the two, and the end coordinate being the larger of the two, keeping the summits of each raw peak for future comparisons. This new merged peak is the one to be compared with the next raw peak. If this third party is also selected, the merged peak is updated, with its start being the lowest of the three and its end being the highest of the three, thereby maintaining their respective summits, and so on. If the selection is negative, the center of the standardized peak representing each of the selected raw peaks in their respective ChIP-seq model would be the position that determines the arithmetic mean of the summits of each of the selected raw peaks; (ii) Peak center determines the center of the standardized peak, which is a position obtained as the arithmetic mean of the summits of the selected peaks from which it is derived; (iii) Peak length is the length of the standardized peak as it extends symmetrically on both sides of its center. The higher the value of this parameter, the more ambiguous the standardized peak obtained. Notice that this value must match the value of the background model with which the ChIP-seq model will be analyzed.