Genome-Wide Study of Colocalization between Genomic Stretches: A Method and Applications to the Regulation of Gene Expression

Simple Summary Addressing a large number of genomic problems requires the comparison of genetic and epigenetic features distributed over the genome (genome tracks). The mutual arrangement of these features determines basic molecular mechanisms related to the dynamics of the genome architecture and gene expression. The analysis of data on the genome tracks stored in numerous databases cannot be performed without suitable bioinformatic tools. A package, the Genome Track Colocalization Analyzer, developed by the authors, is intended for the study of colocalization effects between stretch–stretch and stretch–point genome tracks. Abstract In this paper, we describe a method for the study of colocalization effects between stretch–stretch and stretch–point genome tracks based on a set of indices varying within the (–1, +1) interval. The indices combine the distances between the centers of neighboring stretches and their lengths. The extreme boundaries of the interval correspond to the complete colocalization of the genome tracks or its complete absence. We also obtained the relevant criteria of statistical significance for such indices using the complete permutation test. The method is robust with respect to strongly inhomogeneous positioning and length distribution of the genome tracks. On the basis of this approach, we created command-line software, the Genome Track Colocalization Analyzer. The software was tested, compared with other available packages, and applied to particular problems related to gene expression. The package, Genome Track Colocalization Analyzer (GTCA), is freely available to the users. GTCA complements our previous software, the Genome Track Analyzer, intended for the search for pairwise correlations between point-like genome tracks (also freely available). The corresponding details are provided in Data Availability Statement at the end of the text.


Introduction
The development of next-generation sequencing (NGS) technology (for a review and further references, see, e.g., [1]) caused the explosive growth of experimental data for genomes of various organisms, cellular lines, and tissues. These data are collected in numerous databases (ENCODE [2], EPD [3], GENCODE [4], NCBI GEO [5], FANTOM5 project [6], etc.). A detailed study of the architecture and functioning of the genome in the context of different genetic and epigenetic features cannot be performed without special bioinformatic tools (reviewed in [7]). Formally, the study of the coordination between related genetic and epigenetic features can be reduced to statistical analysis of the point-like and stretch-like objects distributed over the genome. Our previous publication [8] was

Theory and Methods
Below, we present the theory and methods for the statistical analysis of stretch-stretch and stretch-point colocalization. Our general approach, theory, and results are original. We begin with the general definitions.

Characterizing Stretch-Stretch and Stretch-Point Characteristics by Sets of Indices
In most genetic problems, the distribution of stretch lengths as well as the distribution of intervals between the centers of neighboring stretches are rather poorly approximated by standard statistical distributions and should be considered as unique. The stretch lengths and intervals between stretches are strongly variable and inhomogeneously distributed over the genome. For this reason, an integral measure for colocalization between stretches should be built from local pairs, while a local measure for colocalization should be constructed from relative characteristics. To solve this problem, we developed sets of local indices characterizing different aspects of colocalization between stretches and robust to the variations of lengths.
In our approach, the colocalization of pairs of stretches of different types was analyzed. In the theory below, we designate the types as A and B.
The indices are defined for the nearest neighbors in the combinations of different neighboring stretches restricted to BAB and ABA. The nearest neighbors were determined by the positions of stretch centers as described previously [8]. All indices vary within the interval (-1, 1); the value -1 corresponds to the strongest colocalization, while the value +1 corresponds to the absence of colocalization. We used the following set of indices: (i) The index of overlapping (IO) characterizes mutual stretch-stretch colocalization and is defined as: where k refers to the k-th pair of the nearest neighbors, m c (A k ) and m c (B k ) denote the sites of stretch centers over the genome, L A k B k is the distance between the centers of neighboring stretches, and a k and b k denote the lengths of the stretches. (ii) The index of asymmetry (IA) characterizes the skewness between the lengths of the k-th nearest neighbors: (iii) The index of coverage (IC) characterizes the mutual colocalization between stretches (A) and points (B): Figure 1 illustrates the relationships between indices and different geometric characteristics of stretches. The mean indices and their squared deviations for K pairs of the nearest neighbors are defined as:  (1)) is equal to −1 if the centers of A-and B-stretches coincide with each other (complete colocalization). For remote non-overlapping stretches (absence of colocalization), the index of overlapping tends towards +1. (C) Index of asymmetry (Equation (2)) characterizes the difference in the distributions of stretch lengths. If the A-stretches are much shorter in comparison to B-stretches, the index of asymmetry is equal to −1, whereas if the A-stretches are much longer than the B-stretches, it tends towards +1. (D) Index of coverage (Equation (3)) is equal to −1 if the centers of A-stretches and positions of B-points coincide with each other (complete colocalization). For remote non-overlapping positioning (absence of colocalization), the index of coverage tends towards +1.

Statistical Criteria
A non-parametric assessment of statistical significance for genome-wide associations can be conveniently performed via a permutation test [11,[14][15][16]. To exclude the effects related to finite sampling, we used the complete permutation test. For K primary pairs of stretches, the complete permutations produce K(K − 1)/2 additional different pairs.
The consecutive permutations of the type A and type B stretches determine two classes of permuted indices of overlapping: In particular, the index IO  kk , the similar permutation was performed for B-stretches. The criteria for stretch-stretch colocalization were primarily based on the indices (6a) and (6b). In addition to these basic indices, we also used the auxiliary indices obtained by the simultaneous permutations of A and B: For the indices of asymmetry and coverage, the permuted counterparts were defined, respectively, as: The corresponding mean values and variances for all permuted indices were calculated as: (11) In the calculations of the variance for the difference: the covariance terms appear to be crucial because the same stretches participate in many different permutations. After extensive simulations, we found that two generic drawbacks are inherent to the strict calculations of covariance terms. The number of terms in covariance sums grows proportionally to K 3 . Therefore, starting with K about 10 3 and higher, the summation of covariance sums with constraints becomes time-consuming. At the opposite limit of relatively small K within the range 50-500, the resulting variance loses robustness and becomes sensitive to particular random realizations. Instead, we found that such drawbacks were absent in the following simplified approximation for the variance: σ 2 e f f (∆I) ≈ σ 2 (I)/K + 2σ 2 (I kk )/K − 2Cov (13) where K is the total number of stretch pairs; the variances σ 2 (I) and σ 2 (I kk ) are defined by Equations (5) and (11), respectively; and: The resulting criterion was formulated in terms of difference (12) normalized to the effective standard deviation determined by Equation (13): (15) Negative values of ζ I correspond to stronger stretch-stretch and stretch-point colocalization for the indices of overlapping and coverage related to genome tracks under analysis in comparison with permuted configurations, and vice versa (the sign of ζ I is defined as in (1) and (3); see also Figure 1).
The permutation test implies significant variations in stretch lengths. Indeed, if all stretch lengths were identical, the permutations of stretches would become indiscernible.
The restriction for the variability of stretch lengths is formulated in terms of coefficient of variation (CV) for the lengths: (16) where a is the mean length of the type A stretches and σ(a) is the standard deviation for lengths. The criterion for the type B stretches is similarly formulated. The threshold CV thr should be about 0.5-1. If the in Equation (16) is violated for the stretches of a particular type, such stretches should be treated as points by the characteristic starts/centers/ends.

Simulations
Extensive simulations revealed that the statistics for all indices (6)-(9) was approximately universal for random sets. The absolute values of |ζ I | for 5% and 1% empirical probability thresholds shown in Figure 2 indicate their monotonic dependence on the number of pairs ( Figure 2A) and weak dependence (within statistical scattering) on the mean indices ( Figure 2B). The dependence of ζ I on the number of pairs K can, with good accuracy, be approximated by: with the parameters ζ I, max and b depending on the probability threshold. The relevant fitting parameters are summarized in Supplemental Table S1.  At a fixed number of nearest neighbors K, the distributions of ζ I for all indices for the random sets appear to be close to Gaussian statistics N(0, σ) (see Figure 3). The mapping of the thresholds Pr = 0.05 for ζ I onto Gaussian thresholds by the relationship: λ I (K) |ζ I, Pr=0.05 (K)| = |z Gauss, Pr=0.05 | = 1.96 (18) brings λ I (K) ζ I (K) parameters close to the Gaussian z-variables, obeying universal statistics N(0, 1). This similarity provides simple approximation for related p-values. (i) First, the actual normalized deviations ζ I should be determined for the sets under analysis and compared with 5% or 1% probability thresholds for the random sets using interpolation (17). (ii) Then, mapped values λ I (K) ζ I (K) can be used for the approximate assessment of p-values by the Gaussian statistics.   7) and (15)); (B) asymmetry (Equations (2) and (15)); (C) coverage (Equations (3) and (15)).

Extension to AABB/BBAA Patterns
The previous consideration concerned the indices associated with the set of the nearest neighbors in the combinations ABA and BAB. Such a choice of neighbors was essential for the study of correlations between point-point tracks [8] and is needed for uniting previous and current analyses. We found, however, that the approach based on the indices can be extended to the pairs AB and BA in the combinations AABB and BBAA as well. All AB and BA pairs in the sets ABA-BAB and AABB-BBAA are different. Generally, the statistics of indices for the sets ABA-BAB and AABB-BBAA may also be different. Indeed, using the built-in Kolmogorov-Smirnov and Mann-Whitney criteria of the R statistics package, we revealed the statistically significant differences for some of the genome tracks studied below. Therefore, it is better to consider the sets ABA-BAB and AABB-BBAA separately. The simulations for two random tracks showed that the related 5% and 1% probability thresholds for the indices associated with the pairs AB and BA in the combinations AABB and BBAA are close to those found in Section 2.3 up to small statistical scattering. Thus, the set AABB-BBAA can be added to the general scheme and separately studied with a similar method.
The final results (overlapping/asymmetry indices and their statistical significances) depend upon the proportion between ABA-BAB and AABB-BBAA sets in the source data. For example, in most cases from the Results chapter, the ABA-BAB set contains the majority of stretches from the smaller dataset, whereas, for the TSS sets, the AABB-BBAA set contains the majority of TSS. Therefore, in our view, it is expedient in every particular case to study the colocalizations both for the ABA-BAB and AABB-BBAA sets; if the results diverge, the final decision is to be made on the basis of which of these two colocalization classes represents a larger share of the smaller source dataset.

Test: Colocalization between Exons and Random Stretches
The statistical criteria above concern the situation when both sets of stretches are random and independent (i.e., the correlations between positions and lengths of stretches are absent). In this section, we considered the situation when only one of the stretch sets is random, while the other set is non-random. The permutations for the random set should obey the statistics described in Sections 2.2 and 2.3, whereas the statistics related to the permutations for the non-random set may be different. We checked this statement by testing colocalization between exons and random stretches.
A fragment with exons was chosen on human chromosome 1, and the colocalization between exons (A-stretches) and randomly generated sets of stretches (B-stretches) were studied using indices of overlapping (6a) and (6b). As previously seen, the dependence of the statistical thresholds on the mean values IO(see Equation (4)) was weak. The dependence of thresholds on the number of the nearest neighbors was studied for IO held at −0.5 and 0 approximately, with permissible variations ± 0.05. The corresponding depen- IO on the number of the nearest neighbors for 5% and 1% empirical probability thresholds are shown in Figure 4A. As expected, the dependence IO coincided with the random counterparts shown in Figure 2A, whereas the dependence for ζ (a) IO appeared to be strongly different. Only with a very large number of random neighbors (K > 5000), when the combination BAB became dominating, did the thresholds for ζ (a) IO tend towards the random criteria (see curves for IO_A in Figure 4A). This test was also checked with the following software packages: regioneR [11], GenometriCorr [9], StereoGene [17], and Genomic Association Tester [10]. The webservice Coloc-stats [18] is a metaserver that does not offer new statistical methods for assessment of colocalization between genomic tracks and, therefore, was not tested in this work. Test sets were generated in the following way. The set of exons on the forward strand of H. sapiens chromosome 1 was used as a model non-random set. An ad hoc Perl script based upon the Mersenne Twister random number generator was used to generate 1000 random datasets for stretches by the following rules: The stretch centers were distributed randomly and uniformly; the minimum and maximum coordinates of centers were concordant with those for the exons on the forward chain of the human chromosome 1; and the lengths of the stretches were distributed by random uniform distribution and varied from 1 bp to the maximum exon length on the chromosome 1. As the number of nearest neighbors between stretches of two types is most important for the statistics, the number of generated random stretches was adjusted to obtain a fixed number of nearest neighbors. All software packages shared the same datasets during testing.
These 1000 random realizations were used to determine the observable false discovery rate (FDR). For the correct assessments with p-values less than 0.05, FDR should also be 0.05, independent of the number of the nearest neighbors. We found, however, that the actual FDR was about twice as high for regioneR, GenometriCorr, and GAT packages and significantly depended on the reference-query selection for regioneR and GenometriCorr, whereas for StereoGene and for our Genome Track Colocalization Analyzer package, FDRs were correct (see Table 1 for a summary). As is seen from Table 1, typically, FDR exceeds the correct threshold for the packages based exclusively on overlapping stretches [9][10][11], whereas StereoGene [17], assessing the correlations between stretches/profiles, provides correct predictions. Note, however, that StereoGene calculates only correlations between genome tracks and profiles, whereas the information about the character of (non)overlapping between stretches is absent. The Genome Track Colocalization Analyzer provides this latter information as well.
We benchmarked all software packages (Table 1) for the exons on chromosome 1 and random datasets with approximately 500 colocalization pairs. All calculations were performed 100 times for each software package in single-threaded mode. Multithreaded calculations are inherent for regioneR and GAT, while other software packages can be accelerated to run in parallel for many tasks at once. We can conclude that StereoGene and the Genome Colocalization Track Analyzer are the fastest software packages and provide more correct results. We should mention, however, that StereoGene results strongly depend upon the window size parameter. A similar test was also applied to the complete set of human chromosomes (except the Y chromosome). The indices of overlapping (6a) and (6b) were calculated for each chromosome separately for a particular random realization and compared versus 5% and 1% probability thresholds for the random sets. The numbers of the nearest neighbors K for the chosen random realization varied over the chromosomes from 88 to 986. The expected predictions for the random sets were calculated depending on K by Equation (17). The resulting histogram for the distribution of ζ parameters (taken by the absolute values) over the chromosomes is shown in Figure 4B. The probability to observe by chance n events where the ζ values exceed the prediction for a chosen threshold p-value may be assessed by binomial distribution: where N chr is the total number of chromosomes and C N chr n is the binomial coefficient. For p = 0.05 and N chr = 23, the observable number of such events should not exceed 1-3. This is actually the case for ζ (b) IO parameters and is drastically violated for ζ (a) IO parameters, as seen in Figure 4B.
To sum up, stretch-stretch colocalization should be assessed by indices defined by Equations (6a), (6b), and (7) and the respective parameters ζ (a) IO , ζ IO (Equation (15)). The choice between (6a), (6b), and (7) criteria should be made according to the randomness of distribution of stretches over the genome (both sets are non-randomly distributed by the statistical criterion, one set is non-randomly/one set is randomly distributed by the statistical criterion, both sets are randomly distributed by the statistical criterion). The randomness of stretch positions can be assessed by the structural entropy for the centers of stretches similarly to that developed previously for the point-point tracks [8,19]. The difference between structural entropy for the centers of stretches in datasets putatively considered random should be statistically significant in terms of the standard deviations for random entropy [8,19]. Otherwise, the positions of stretches should be considered non-random, i.e.:

•
If structural entropy criterion |z s | ≥ 1.96 for each of the stretch sets (that means the centers of stretches for both sets are distributed non-randomly), then Equation (7) is applied.

•
If the positions of centers for one of the stretch sets are distributed non-randomly (|z s | ≥ 1.96), whereas the other centers are distributed randomly (|z s | < 1.96), the general colocalization should be assessed via either criterion (6a) or (6b) (and the respective ζ IO ) for the random set.

•
If the positions of centers for both stretch sets are random (|z s | < 1.96 for each set), then Equation (7) is applied again.
If the absolute value of the |ζ IO | parameter, as calculated by the selected equation, appeared to be higher than the respective threshold corresponding to Pr = 0.05 for the random sets, the stretch-stretch colocalizations were assumed to be significant; otherwise, colocalizations were assumed to be insignificant. The comprehensive approach consists of the combined set of Equations (6a), (6b), and (7) and the additional estimation of center positioning randomness for subsequent expert assessment as described above.
We implemented the Genome Track Colocalization Analyzer in Perl with the XS extension module to speed up calculations. The current version is a command line utility that depends upon the following CPAN modules (Sort::Key, Math::Random::MT::Auto, Config::Tiny, PerlIO::gzip, Getopt::Long, Math::Round, Math::CDF, Math::Interpolate) and provides output both in the text and HTML formats, facilitating the integration of output data into both bioinformatic pipelines and web servers.

Colocalization between Stretches and Gene Expression
We applied the software we developed to the study of the colocalization and length asymmetry of particular genome tracks related to the regulation of gene expression. All colocalization studies in this work were carried out with the H. sapiens genome build GRCh38/hg38 p.12. The relevant genome tracks data processing methods and accession numbers can be found in Supplemental Methods.
The first three examples that we present in this work include CpG islands. Their colocalization with exons [20], transcription start sites (TSS) [20][21][22][23], and DNAseI Hypersensitive sites (DHSS) [24,25] was previously affirmed experimentally. We observed statistically significant colocalization by our method in all these cases as well. The concluding example concerns the colocalization of H2A.Z (H2AFZ) histone mark and transcription start sites (TSS) in the H. sapiens genome. As a result, we revealed the statistically significant colocalization of the H2A.Z histone mark with the bidirectional promoters in K562 cell line. The colocalization of the H2A.Z histone mark with the bidirectional promoters was compared with that for unidirectional and silent TSS in the same cell line.

Strong Colocalization between CpG Islands and Exons Suggests a Role of CGI in Transcription
The colocalizations between CpG islands (CGIs) and exons were previously studied experimentally [20,26]. The software we developed was applied to the analysis of colocalization between CGIs and H. sapiens exon genome tracks. The results presented in Figures 5A and 6A demonstrate the statistically significant colocalization between CGIs and exons according to all criteria (6a), (6b), and (7) for the vast majority of the chromosomes. The p-values of < 0.01 for each test resulted in much lower integral estimation calculated by Equation (19). The mean values of the colocalization index IO were negative for all chromosomes. The results of length asymmetry analysis presented in Figures 5B and 6B demonstrate that for almost all chromosomes, there was significant length asymmetry. For the correlation pairs ABA/BAB, this asymmetry was significant for all chromosomes, except chrY, whereas for the correlation pairs AABB/BBAA, the asymmetry was significant for 12 chromosomes. For three chromosomes, the number of correlation pairs was too small for statistics. CGI stretches were longer than exon stretches (I A > 0.3, the significance of obtained indices corresponded to p < 0.01 for the vast majority of chromosomes).   (1)) for CpG islands and exons; (B) asymmetry (Equation (2)) for CpG islands and exons; (C) coverage (Equation (3)) for CpG islands and TSS on the forward strand; (D) coverage (Equation (3)) for CpG islands and TSS on the reverse strand; (E) overlapping (Equation (1)) for DNAseI clusters and CpG islands; (F) coverage (Equation (3) (6) and (7); (B) CpG islands and exons, by Equation (8); (C) CpG islands and TSS on the forward strand, by Equation (9); (D) CpG islands and TSS on the reverse strand, by Equation (9); (E) DNAseI clusters and CpG islands, by Equation (7); (F) DNAseI HEK293 peaks and CpG islands, by Equation (9).
Colocalization patterns for ABA/BAB and AABB/BBAA correlation pairs matched completely for the vast majority of chromosomes (excluding chr13). The statistical significance of colocalization for the vast majority of chromosomes was at the level p < 0.01. This result was confirmed by existing experimental data [20], suggesting that CGIs play a role in regulation of transcription. Additional information can be found in Supplemental Table S2.

Strong Colocalization between CpG Islands and Transcription Start Sites Confirms CGIs Take Part in Transcription Regulation
It is known that CGIs are often associated with mammalian promoters and may even be used as alternative promoters for the genes with methylated promoters [27,28]. The colocalization of CGIs with TSS affects gene expression in eukaryotes [28]. The results for the colocalization analysis obtained with our package are shown in Figure 5C,D and Figure 6C,D and confirm statistically significant colocalization between CGIs and TSS. The related colocalization indices IC are negative for almost all chromosomes simultaneously for forward and reverse strands for both ABA/BAB and AABB/BBAA datasets; this means that CpG stretches and TSS points are frequently intersected (IC≈ −0.3 genome-wide). The statistical significance of IC indices corresponds to p < 0.01 for the vast majority of chromosomes for both forward and reverse strands. The difference between the results for ABA/BAB and AABB/BBAA datasets was observed only in the cases when the number of correlation pairs was too small for statistics (in particular, the amount of correlation pairs < 50 was observed for 13, 18, 21, and 22 chromosomes).
The obtained results are supported by existing experimental data [20][21][22][23] and independently confirm the conclusion that CGIs are involved in the regulation of human cells transcription. The additional detailed information for the genome-wide analysis can be found in Supplemental Table S3.

Strong Colocalization between CpG Islands and DNAseI Hypersensitivity Sites Suggests That CGIs Often Correspond to Open Chromatin Regions
The colocalization between CGIs and DNAseI hypersensitive sites (DHSS) was studied earlier and revealed that some DHSS are associated with CGIs [24]. It is also known that DHSS are often located near TSS [25]. This is why we chose to search for colocalizations between CGI and DHSS using our software. We used two genomic tracks for DHSS: clusters of DNaseI hypersensitivity sites derived from assays in 95 cell types (wgEn-codeRegDnaseClustered) and HEK293 DHSS peaks (ENCFF127KSH). Statistical trials were performed in the stretch-stretch mode for DHSS clusters and in the stretch-point mode for HEK293 DHSS peaks (due to the fixed size of HEK293 DHSS peaks).
In both cases, we detected statistically significant colocalizations between DHSS and CGIs. For DHSS clusters (IO≈ −0.5 genome-wide), the significance of colocalization was at the level p <0.01 for the vast majority of chromosomes. Similar results were obtained for HEK293 DHSS peaks (IC≈ −0.25 genome-wide, i.e., the centers of HEK293 DHSS peaks are located within CGI stretches). H. sapiens genome hg38 contains 27,949 CGIs that form 26,741 correlation pairs with DHSS clusters, of which 24,589 pairs have direct overlaps. This means that 87.9% of CGIs overlap significantly with DHSS clusters derived from assays in 95 cell types. We can conclude that the vast majority of CGIs are associated with DHSS. These data strongly suggest that CGIs in human cells almost always correspond to open chromatin regions.
For AABB/BBAA set pairs, the number appears to be lower than the acceptable statistics threshold (<50) for almost all chromosomes for DHSS clusters-CGIs colocalizations. For the HEK293 DHSS peaks-CGIs case, the data amount in ABA/BAB correlation pairs exceeded the data amount in AABB/BBAA correlation pairs by almost a factor of 9 (68.8% in ABA/BAB, 7.9% in AABB/BBAA), which makes AABB/BBAA colocalization results negligible. This case is a good example showing that statistics for ABA/BAB and AABB/BBAA correlation pairs can be quite different, and therefore these datasets should be analyzed separately.
The detailed data can be found in Figure 6E,F and in Supplemental Table S4.

Genome-Wide Study of Colocalization between Promoters and Histone Mark H2A.Z (Isoform H2AFZ) for Cell Line K562
The genomes of eukaryotes contain a fairly large number of gene pairs for which the transcription is moving to non-intersecting directions on the different DNA strands. If a distance between TSSs of corresponding promoters is less than 1000 bp, a DNA fragment between related TSSs is called a "bidirectional promoter". For brevity, we designated TSS associated with bidirectional promoter as bTSS; otherwise, TSS is designated as uTSS. In our previous work [8], we found significant correlations between TSSs on the direct and reverse strands in the genomes of D. melanogaster, C. elegans, M. musculus, H. sapiens, and D. rerio (p < 0.01 for the vast majority of chromosomes). We also found statistically significant positional correlations between them, with TSSs on the reverse strand preceding TSSs on the direct strand. As the factors regulating gene expression are usually located before TSS (on each of the strands), such coordinate and positional correlations between TSSs indicate that regulatory elements such as CpG-islands, enhancers, silencers, etc., can be common and shared for these TSSs. Statistical significance of the detected TSSs correlations supports the positive selection of such expression mode during molecular evolution. Indeed, such a mode of expression regulation was confirmed experimentally [29][30][31]. Despite the progress in the study of bidirectional promoters, the details of the relevant molecular features remain in part controversial and require further investigation.
The analysis was performed for TSSs taken from the following databases: EPD_new [3], Gencode [4], RefSeq [52], and EMBL for the H. sapiens cell line K562. For each database, we first divided TSSs for silent and active genes, and then additionally subdivided TSSs corresponding to bidirectional and unidirectional promoters as defined above. The extended peaks for H2AFZ K562 were taken from EncodeProject accession ENCSR000APC replicated peaks ENCFF921IKK. The colocalization was assessed with indices of coverage, with TSSs being treated as points and extended peaks for H2AFZ being treated as stretches. The detailed description of this procedure can be found in Supplemental Methods.
At the first step, using the Genome Track Analyzer package [8], we studied the correlations between TSSs and the centers of signal peaks for histone modification H2A.Z (isoform H2AFZ), replacing histone H2A in the nucleosome. We found a significant correlation (p < 10 -6 ) for all TSSs related to active genes. Similar results were obtained using StereoGene [17] (Supplemental Table S5).
In the second step, we studied colocalization between bTSSs/uTSSs and extended the signal peaks for H2A.Z using the Genome Track Colocalization Analyzer package (Supplemental Table S6). We found a statistically significant colocalization with overlapping (IC< −0.35, p < 10 -6 ) between bTSSs and H2A.Z peaks for active genes. We found a statistically significant trend for active genes uTSSs to be located near H2A.Z peaks, typically without overlapping these peaks (IC > 0.01, i.e., the overlapping index is positive but close to 0). For uTSSs related to silent genes, we found a statistically significant trend for colocalization with H2A.Z peaks without overlapping (IC > 0.2, p < 10 -6 ). Figure 7 illustrates the clear difference in the trends for colocalizations between bTSSs/uTSSs and H2A.Z peaks for active and silent genes. The trends persisted throughout all four TSS databases and ABA/BAB or BBAA/AABB grouping. The results we obtained demonstrate that in order to reach reliable conclusions about mutual colocalization of genome tracks, it is not sufficient to apply correlationbased GWAS tools. It is necessary, however, to create and implement specific methods for assessing the statistical significance of genomic track colocalization and overlapping (overlapping indices).
For example, it is impossible to make a conclusion about colocalization of the H2A.Z genome track and various TSSs subsets by applying correlation-only GWAS tools (Supplemental Table S5), whereas our new method provides comprehensive results.
Our results indicate that colocalization between TSSs and H2A.Z may facilitate the transcription initiation via freeing promoters from nucleosomes. The replacement of histone H2A by H2A.Z makes the nucleosome less stable [53] and allows RNA-polymerase II to shift or decompose the nucleosome.
Bidirectional promoters have a statistically significant trend to overlap with H2A.Z peaks; however, nucleosome colocalization with one of the TSS of the bidirectional promoter does not obstruct transcription start from the other TSS. We suggest that no nucleosomes containing H2A.Z are to be found near the TSSs of silent genes and, therefore, RNApolymerase II is unable to shift the nucleosome and to start the transcription. Note that for bidirectional promoters, the distance between TSS and the middle of the H2A.Z peak tends to be approximately 100-200 bp (Supplemental Figure S1), matching the promoter NDR (nucleosome-depleted region) width. These conclusions agree with the observations by [54], who proved the relationship between positioning of H2A.Z (H2AFZ) nearby TSS with the expression level. We plan to continue this research in future works by studying the "wide promoters" class and H2A.Z peak colocalizations, as well as by varying the expression threshold for active/silent genes.

Discussion
The method developed in our paper for the study of colocalization between stretchlike genome tracks, as well as the Genome Track Colocalization Analyzer (GTCA) package based on this method, provide an efficient tool for bioinformatic analysis of various genetic mechanisms related to the gene expression. Its reliability and calculation speed proved to be on the level of the top available packages (see Table 1). The mapping of stretch parameters onto the indices varying within the interval (−1, +1) ensures the robustness of the results to inhomogeneity of track distribution over the genome and to variations in positioning and lengths of stretches. The user does not need to tune the parameters related to the selection of representatives in a sampling set and the number of trials, as is typical for partial permutation tests.
In fact, any stretch set subdivides the genome into two sets, original and complementary (or void) ones. In some cases, the mutual absence of features (or colocalization between void sets) may also be of interest. The colocalization between presence/absence of some features can be considered as an equivalent of correlations/anticorrelations.
Certain other problems can also be efficiently reduced through the study of stretches. In particular, coarse-graining of various profiles (e.g., AT/GC content, expression profiles, protein binding profiles) results in a set of stretches within which the characteristics exceed some thresholds. Previously, it was shown that the coarse-grained profiles obtained by transitional automorphic mapping of the genome on itself (TAMGI) contain important information about functional regions in viral genomes [55,56]. The colocalization between stretch sets obtained by coarse-graining of profiles can also be treated by the method developed in our paper. This extends the scope of applications of the methods that we describe.

Conclusions
Combining the method and package developed for the study of colocalization between stretch-like genome tracks with those previously developed for the study of the correlation among point-like tracks [8] provides a complete analysis of relationships between genomic tracks. The united method can be applied to the general scope of genome-wide association studies (GWAS) and/or can be used as a particular option in bioinformatic pipelines. The relevant bioinformatic analysis can be used for data mining and may stimulate further experimental studies.
Supplementary Materials: The supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/biology11101422/s1. Figure S1: Comparison of H2AFZ signals for bi-and unidirectional promoters, both expressing and silent, around K562 TSS; Table S1: The dependence of statistical thresholds for ζ parameters on the number of pairs and on the mean indices; Table S2: Chromosome-and genome-wide colocalizations between CpG Islands and exons for H. sapiens GRCh38p13/hg38 genome build; Table S3: Chromosome-and genome-wide colocalizations between CGIs and TSSs for H. sapiens GRCh38p13/hg38 genome build; Table  S4: Chromosome-and genome-wide colocalizations between CpG Islands and DNAseI clusters/peaks for H. sapiens GRCh38p13/hg38 genome build; Table S5: Genome-wide correlations by the Genome Track Analyzer/StereoGene between H2AFZ and TSSs subsets for K562 H. sapiens cell line; Table S6