Assigning Co-Regulated Human Genes and Regulatory Gene Clusters

Elucidating the role of genetic variation in the regulation of gene expression is key to understanding the pathobiology of complex diseases which, in consequence, is crucial in devising targeted treatment options. Expression quantitative trait locus (eQTL) analysis correlates a genetic variant with the strength of gene expression, thus defining thousands of regulated genes in a multitude of human cell types and tissues. Some eQTL may not act independently of each other but instead may be regulated in a coordinated fashion by seemingly independent genetic variants. To address this issue, we combined the approaches of eQTL analysis and colocalization studies. Gene expression was determined in datasets comprising 49 tissues from the Genotype-Tissue Expression (GTEx) project. From about 33,000 regulated genes, over 14,000 were found to be co-regulated in pairs and were assembled across all tissues to almost 15,000 unique clusters containing up to nine regulated genes affected by the same eQTL signal. The distance of co-regulated eGenes was, on average, 112 kilobase pairs. Of 713 genes known to express clinical symptoms upon haploinsufficiency, 231 (32.4%) are part of at least one of the identified clusters. This calls for caution should treatment approaches aim at an upregulation of a haploinsufficient gene. In conclusion, we present an unbiased approach to identifying co-regulated genes in and across multiple tissues. Knowledge of such common effects is crucial to appreciate implications on biological pathways involved, specifically when a treatment option targets a co-regulated disease gene.


Introduction
In recent years, increasing attention has been given to the non-coding sequences of the human genome, highlighting the importance of common and rare genetic variants that influence disease etiology [1]. An immediate benefit is given for the evaluation of genome-wide association data as most of the genetic variants associated with complex traits are located in intronic or intergenic regions of the human genome [2,3]. In fact, many of these variants are considered to play a prominent role in the regulation of gene expression. Consequently, identifying and characterizing such variants in detail appears key to elucidating the biological mechanisms underlying the association signals [4].
Genome editing allows direct modifications of gene expression regulation using either transcriptional activators [5,6] or repressors [7]. These advances are of paramount importance for gene therapy approaches to human disease [8]. A first study of Matharu et al. (2019) [9] suggests that targeted activation of gene expression is suited to prevent clinical phenotypes attributed to haploinsufficiency. Such strategies, however, require a comprehensive understanding of the regulatory networks involved.

eQTL Calculation
Gene expression data from 49 tissues were downloaded via the GTEx Portal and filtered for EUR individuals based on the genotype PCA ( Figure S1) [29]. The detailed data processing protocols are given elsewhere [18]. The sample sizes varied per tissue from 65 (KDNCTX) to 584 (MSCLSK) ( Table S1).
Local eQTL were calculated for each tissue separately based on linear regression models using FastQTL (version v2.184_gtex) as implemented in the GTEx version 8 analysis pipeline [18,30]. The mapping window to detect local eQTL was defined as 1 Mbp upstream and downstream of the transcription start site. Gender, the WGS platform, the WGS library construction protocol, the first five genotype PCs, and up to 60 PEER factors were included in the models as covariates. These covariates were provided by the GTEx Portal and are explained in [18]. FastQTL was applied in the permute 1000 10000 mode to calculate beta distribution-extrapolated empirical p-values for each potential eGene. The p-values were then used to calculate gene level Q-values with the help of the false discovery rate approach as implemented in the R package qvalue (version 2.6.0) [31]. The lambda parameter was set to 0.85. A Q-value threshold of <0.05 was applied to identify genes regulated by at least one significant eVariant. To detect all significant eVariants, a tissue-specific genomewide empirical p-value threshold was defined as described in [18]. This threshold was used to calculate a nominal p-value threshold for each gene based on the beta distribution parameters from FastQTL. Thereafter, FastQTL was run in normal mode to calculate nominal p-values of all local gene-variant associations. Nominal p-values from the linear regression model below the beforehand defined gene-specific threshold were considered significant.

Identification of Gene Expression Regulation Cluster
Regulatory clusters were defined as genomic regions containing multiple genes regulated by the same eQTL signal. This analysis required a two-step protocol to limit the computational and statistical burden.
First, significant eVariants were filtered for variants regulating two or more eGenes as only these genes have the potential to be regulated by the same genetic signal. By combining variants located on the same chromosome within an arbitrary distance of less than 2 Mbp, the variants were subsequently merged into genomic regions containing potentially co-regulated eGenes.
In a second step, all eGenes within those genomic regions were investigated for colocalization of their associated variants. This was performed using the coloc.abf function of the coloc package in R [21,22]. The method utilized the respective eQTL nominal pvalues, effect sizes, variances of the effect sizes, as well as the SD of the expression for the investigated genes. In addition, the tissue-specific MAF was supplied for all variants as the sample size varied between tissues. coloc.abf further requires the specification of three informative prior probabilities. These are the probabilities that any random genetic variant in the region is associated with exactly gene one (p1), gene two (p2), or both genes (p12). The probabilities p1 and p2 were set to the gene-specific threshold of the corresponding eQTL analysis. The default for p12 was set to 5.0 × 10 −6 as suggested by the author of the coloc package [22]. If either p1 or p2 was below 5.0 × 10 −6 , p12 was also adjusted to this smaller p-value. Next, the posterior probability that the same eQTL signal is regulating both genes (H4) was extracted from the co-localization analysis. If the analyzed gene pair did not share any overlapping variants within the 2 Mbp window, the H4 value was adjusted to 0. The H4 probabilities of all genes within one genomic region were then supplied to the cliques function of the igraph package (version 1.2.6) [32] in R to identify regulatory clusters. The H4 probability threshold for cluster definition was set to at least 0.8. Gene pairs within a single cluster fulfilling this criterion were defined as being regulated by the

Collection of Haploinsufficiency Genes
Two different resources were exploited to generate a list of genes which are known to cause haploinsufficiency phenotypes. The ClinGen database [34] provides a list of genes causing potential dosage sensitivity phenotypes and assigns them to different categories [35]. A list of 312 genes with "sufficient evidence for haploinsufficiency" was downloaded from the ClinGen database website [36]. Furthermore, Matharu and colleagues [9] published a list of 660 genes leading to haploinsufficiency disease (see "Supplemental Table S1" in [9]). Combining the two resources led to a list of 713 unique genes with evidence for causing haploinsufficiency-related phenotypes (Table S5).

Identification of Co-Regulated Genes
Our study aimed to identify co-regulated genes and their organization in gene expression regulation clusters. To this end, we designed a workflow starting with the calculation of local eQTL. We then filtered for all significant variants (eVariants) regulating the expression of two or more genes (eGenes) as only these genes have the potential to be regulated by the same genetic signal. A subsequent pair-wise colocalization analysis revealed coregulated eGenes and facilitated the identification of genes clustered due to underlying genetic signals. The following characterization of the identified regulatory clusters was intended to dissect the results at specified levels to include gene-centered, cluster-centered, and tissue-centered perspectives as well as an example highlighting medical consequences of the data presented ( Figure 1).
Exploiting available data from the GTEx project, eQTL were calculated for 49 different tissues based on gene expression, genotype, and covariate data of 694 samples from donors of European descent ( Figure S1). For each tissue, the sample size varied widely from 65 (kidney cortex, KDNCTX) to 584 (muscle skeletal, MSCLSK) (Table S1). Our analysis included 39,832 expressed genes of which 33,488 (84,1%) were genetically regulated in at least one tissue (Table S2). Remarkably, 14,636 (43.7%) of the latter gene group showed a significant colocalization (coloc probability ≥ 0.80) with at least one other gene, generating a list of 49,637 co-regulated gene pairs (Table S3) across the 49 tissues. A closer look at the underlying eQTL revealed that 37,200 (74.9%) of the 49,637 gene pairs were simultaneously up-or downregulated. The remaining 12,437 (25.1%) gene pairs were regulated by the same genetic signals but in opposite directions.
Of the 49,637 gene pairs, we identified a total of 16,673 unique combinations across multiple tissues (Table S3). Determining the genomic position of the co-regulated genes revealed that in 86.2% (14,368/16,673) the gene loci did not physically overlap ( Figure S2A). Their mean distance was 122,265 bp (standard deviation, SD: 267,498 bp) with 885 colocalizing genes having a physical distance of more than 1 Mbp (Table S3). The remaining co-regulated genes were either partially (4.5%, 759/16,673) or fully overlapping (9.3%, 1546/16,673). Furthermore, we observed that 41.8% (6977/16,673) of the co-regulated gene pairs were identified in multiple tissues with 2469 gene pairs being found in five or more tissues ( Figure S2B). Remarkably, one gene pair-FAM157C/RP11-356C4.5-was expressed in all tissues analyzed and showed a significant co-regulation in 45 of the 49 tissues (not significant for tissues LUNG, SPLEEN, TESTIS, WHLBLD (whole blood)).
While the number of significant eGenes was highly correlated with the sample size of the respective tissue (p-value = 1.6 × 10 −15 , R 2 = 0.744) (Figure 2A), this was only partially the case for co-regulated genes. For tissues comprising less than 200 samples, there was a strong correlation of the sample size with the number of co-regulated genes (p-value = 4.2 × 10 −8 , R 2 = 0.736), whereas no correlation was found above a sample size of 200 (p-value = 0.6, R 2 = 0.010) ( Figure 2B). We therefore considered the 24 tissues with a sample size above 200 (Table S1 and Figure 2C) for our subsequent evaluations, as the findings in these tissues appear not to be biased by the underlying sample size. Still, Tables S1-S6 summarize the findings for all tissues investigated regardless of sample size to facilitate an overall evaluation of tissue-specific effects.

21,
10, x FOR PEER REVIEW 5 igure 1. Schematic workflow of the gene expression regulation cluster analysis. Clusters of co-regulated gene expression ere defined as genomic regions containing multiple genes regulated by the same eQTL signal. In a first step, the eQTL esults were filtered for eVariants regulating at least two eGenes. The corresponding eQTL signals were then analyzed for olocalization (Step 2). Gene pairs with a colocalization probability of 0.80 and higher were considered to be co-regulated ext, eGenes with colocalizing eQTL signals were grouped in regulatory clusters (Step 3). These clusters were subseuently investigated on multiple levels considering a gene-centered, a cluster-centered, and a tissue-centered perspective o enable a comprehensive interpretation of results. Finally, genes with clinical relevance to haploinsufficiency-related isease were investigated for their involvement in gene expression regulation clusters and the impact of potential theraeutic approaches. For example, the CRISPRa technology [9] could affect gene expression of the targeted gene but also of ff-target genes within the same cluster.
Exploiting available data from the GTEx project, eQTL were calculated for 49 d ent tissues based on gene expression, genotype, and covariate data of 694 samples donors of European descent ( Figure S1). For each tissue, the sample size varied w from 65 (kidney cortex, KDNCTX) to 584 (muscle skeletal, MSCLSK) (Table S1). Our ysis included 39,832 expressed genes of which 33,488 (84,1%) were genetically regu in at least one tissue (Table S2). Remarkably, 14,636 (43.7%) of the latter gene g Schematic workflow of the gene expression regulation cluster analysis. Clusters of co-regulated gene expression were defined as genomic regions containing multiple genes regulated by the same eQTL signal. In a first step, the eQTL results were filtered for eVariants regulating at least two eGenes. The corresponding eQTL signals were then analyzed for colocalization (Step 2). Gene pairs with a colocalization probability of 0.80 and higher were considered to be co-regulated. Next, eGenes with colocalizing eQTL signals were grouped in regulatory clusters (Step 3). These clusters were subsequently investigated on multiple levels considering a gene-centered, a cluster-centered, and a tissue-centered perspective to enable a comprehensive interpretation of results. Finally, genes with clinical relevance to haploinsufficiency-related disease were investigated for their involvement in gene expression regulation clusters and the impact of potential therapeutic approaches. For example, the CRISPRa technology [9] could affect gene expression of the targeted gene but also of off-target genes within the same cluster.  Figure 2B). We therefore considered the 24 tissues wi 200 (Table S1 and Figure 2C) for our subsequent evaluations, as the sues appear not to be biased by the underlying sample size. Still, Ta the findings for all tissues investigated regardless of sample size t evaluation of tissue-specific effects.   (Table S1). (A) Number of significant eGenes as a function of sample size. The regression coefficient (R 2 ) of the linear model is 0.744 (p-value = 1.6 × 10 −15 ) (B) Number of significant eGenes, co-regulated with at least one other eGene, as a function of sample size of the respective tissue. A significant correlation was observed for tissues with less than 200 samples (p-value = 4.2 × 10 −8 , R 2 = 0.736), whereas the linear regression model was not significant considering tissues with samples sizes above 200 (p-value = 0.6, R 2 = 0.010). Consequently, the number of eGenes over 200 samples per tissue with at least one co-regulation partner is not correlated with sample size. (C) Visualization of the number of expressed genes (light grey), the number of eGenes (medium grey), and the number of eGenes, which are co-regulated with at least one other eGene (dark grey) in the 49 tissues analyzed. The tissues are ordered from top to bottom by sample size in decreasing order. The black dotted line separates tissues with 200 or less samples (above) from those with sample sizes over 200 per tissue (below). The color code was retrieved from the GTEx project [18]. A detailed list of tissues analyzed and their respective abbreviation is given in Table S1.

Gene Expression Regulation Cluster
We grouped the colocalizing genes into gene expression clusters that were regulated by the same genetic signal. Altogether, 10,482 unique clusters were identified in the 24 tissues with a sample size above 200 (Table S4). Overall, the cluster size varied widely including two to a maximum of nine genes ( Figure 3A). The two largest clusters were found in colon transverse (CLNTRN) tissue and comprised the genes TMEM141, CCDC183, CCDC183-AS1, MAMDC4, LCN12, ABCA2, NPDC1, ENTPD2, and additionally either PHPT1 or RABL6. Furthermore, many clusters were detected in multiple tissues and sometimes included additional genes. For example, the FLG/HRNR cluster in adipose subcutaneous (ADPSBQ) tissue was additionally co-regulated together with the antisense transcript FLG-AS1 in another 18 tissues (Table S4). Nevertheless, the overall distribution of cluster sizes was highly comparable between tissues ( Figure 3B), with a cluster size of two coregulated genes being the most prominent finding (81.2-88.6% of all clusters).
in decreasing order. The black dotted line separates tissues with 200 or less samples (above) those with sample sizes over 200 per tissue (below). The color code was retrieved from the G project [18]. A detailed list of tissues analyzed and their respective abbreviation is given in T

Gene Expression Regulation Cluster
We grouped the colocalizing genes into gene expression clusters that were re by the same genetic signal. Altogether, 10,482 unique clusters were identified i tissues with a sample size above 200 (Table S4). Overall, the cluster size varied including two to a maximum of nine genes ( Figure 3A). The two largest cluste found in colon transverse (CLNTRN) tissue and comprised the genes TM CCDC183, CCDC183-AS1, MAMDC4, LCN12, ABCA2, NPDC1, ENTPD2, and addi either PHPT1 or RABL6. Furthermore, many clusters were detected in multiple tiss sometimes included additional genes. For example, the FLG/HRNR cluster in adip cutaneous (ADPSBQ) tissue was additionally co-regulated together with the a transcript FLG-AS1 in another 18 tissues (Table S4). Nevertheless, the overall dist of cluster sizes was highly comparable between tissues ( Figure 3B), with a cluste two coregulated genes being the most prominent finding (81.2-88.6% of all cluste Only tissues with a sample size above 200 wer ered and clusters, which were identified identically in multiple tissues were included only (B) Relative distribution of cluster sizes within tissues. The grey scale represents the size r from 2 (light grey) to above 5 (5+, dark grey). A detailed list of tissues analyzed, and their tive abbreviations, is given in Table S1.
The number of clusters per tissue ranged from 645 (cells cultured fibrobl BRBLS) to 1,243 in testis (TESTIS) tissue. A comparison between tissues revealed t tissue pair has at least 85 gene expression regulation clusters in common (TES WHLBLD, Figure 4). The mean number of shared clusters between tissues was 21 66.2). Furthermore, some highly similar tissues-such as skin not sun exposed sup (SKINNS) and skin sun exposed lower leg (SKINS)-share up to 439 clusters, w count for approximately 50% of all clusters found in the respective tissues. Rem four tissues showed less shared regulation clusters in comparison to all other  Only tissues with a sample size above 200 were considered and clusters, which were identified identically in multiple tissues were included only once. (B) Relative distribution of cluster sizes within tissues. The grey scale represents the size ranging from 2 (light grey) to above 5 (5+, dark grey). A detailed list of tissues analyzed, and their respective abbreviations, is given in Table S1.
The number of clusters per tissue ranged from 645 (cells cultured fibroblasts, FIBRBLS) to 1243 in testis (TESTIS) tissue. A comparison between tissues revealed that each tissue pair has at least 85 gene expression regulation clusters in common (TESTIS and WHLBLD, Figure 4). The mean number of shared clusters between tissues was 212.4 (SD: 66.2). Furthermore, some highly similar tissues-such as skin not sun exposed suprapubic (SKINNS) and skin sun exposed lower leg (SKINS)-share up to 439 clusters, which account for approximately 50% of all clusters found in the respective tissues.   The heatma plays the number of identical clusters between tissue pairs. Clusters in one tissue, which i additional genes in the other tissue, were also counted in this comparison. The color scale sents the number of shared clusters between tissues. Hierarchical clustering of tissues was formed based on commonly regulated gene expression clusters. Only tissues with a samp above 200 were included in this analysis. A detailed list of tissues analyzed, and their resp abbreviations, is given in Table S1.

Potential Clinical Consequences for Genes within Expression Regulation Clusters
Gene expression regulation cluster may have an impact on clinical applicatio as defined gene therapy approaches. For example, a recent study published by M et al. (2019) applied CRISPR-mediated activation (CRISPRa) to enhance gene ex aiming to ameliorate a phenotype caused by disease-associated haploinsufficie Such an approach could lead to off-target gene expression effects if the target gen of a gene expression regulation cluster ( Figure 5A). The heatmap displays the number of identical clusters between tissue pairs. Clusters in one tissue, which included additional genes in the other tissue, were also counted in this comparison. The color scale represents the number of shared clusters between tissues. Hierarchical clustering of tissues was performed based on commonly regulated gene expression clusters. Only tissues with a sample size above 200 were included in this analysis. A detailed list of tissues analyzed, and their respective abbreviations, is given in Table S1.

Potential Clinical Consequences for Genes within Expression Regulation Clusters
Gene expression regulation cluster may have an impact on clinical applications such as defined gene therapy approaches. For example, a recent study published by Matharu et al. (2019) applied CRISPR-mediated activation (CRISPRa) to enhance gene expression aiming to ameliorate a phenotype caused by disease-associated haploinsufficiency [9]. Such an approach could lead to off-target gene expression effects if the target gene is part of a gene expression regulation cluster ( Figure 5A).
To appreciate the relevance of such a situation, we generated a list of 713 genes known to cause phenotypes due to haploinsufficiency (Table S5) and tested their occurrence in the gene expression regulation clusters across all 49 tissues. Remarkably, 231 of the 713 genes (32.4%) are part of at least one cluster, affecting 429 different clusters altogether ( Figure 5B, Table S6). Besides the genes causing haploinsufficiency, these clusters include an additional 414 potential off-target genes. The 429 affected clusters were distributed across the genome. Furthermore, nearly all of the 49 tissues investigated-except KDNCTX, which had the smallest sample size of all tissues (n = 65)-included at least one cluster harboring a haploinsufficiency gene. Figure 5. The potential role of gene expression regulation cluster in gene therapy. (A) Schematic overview of a gene the apy approach modifying expression of a gene, whose disease-associated deletion causes a haploinsufficiency phenotyp In the regular situation, gene 1 (green) and gene 2 (blue) are expressed but are known to be regulated by the same genet signal (gene cluster). The disease-associated deletion of one copy of gene 1 is leading to haploinsufficiency. A planne gene therapy-e.g., CRISPR/Cas-mediated activation (CRIPSRa)-is intended to enhance expression of the regular cop of gene 1 to ameliorate the disease phenotype. This procedure potentially also increases gene 2 expression as the latt gene is part of a common regulatory gene cluster with gene 1. Gene 2 is therefore a potential off-target, which is current not considered by common off-target prediction approaches. (B) Occurrence of haploinsufficiency genes in gene expre sion regulation clusters as defined in this study. Of 713 genes known for causing haploinsufficiency-related phenotype 231 (32.4%) were found to be part of gene expression regulation clusters.

2021, 10, x FOR PEER REVIEW
To appreciate the relevance of such a situation, we generated a list of 713 known to cause phenotypes due to haploinsufficiency (Table S5) and tested their rence in the gene expression regulation clusters across all 49 tissues. Remarkably, the 713 genes (32.4%) are part of at least one cluster, affecting 429 different cluster gether ( Figure 5B, Table S6). Besides the genes causing haploinsufficiency, these c include an additional 414 potential off-target genes. The 429 affected clusters were d uted across the genome. Furthermore, nearly all of the 49 tissues investigated-KDNCTX, which had the smallest sample size of all tissues (n = 65)-included at lea cluster harboring a haploinsufficiency gene.

Discussion
Here, we report on the co-regulation of gene expression based on comprehensi tasets from 49 tissues that were extracted from the GTEx project [18]. We show that of 33,488 genetically regulated genes (43.7%) are co-regulated together with at lea other gene, resulting in 14,727 unique expression clusters across the tissues analyz these clusters, up to nine genes are co-regulated by the same genetic signal, while pa tissue comparison highlights that each tissue pair shares, on average, 212 clusters co-regulation is such a common phenomenon, we provide an exemplary scenario c ering the field of innovative therapeutics where modifying gene expression within regulation cluster could have significant repercussions on the treatment outcome, o specifically on possible side-effects, and thus need to be acknowledged with great The mechanisms underlying the regulation of local gene expression have been Figure 5. The potential role of gene expression regulation cluster in gene therapy. (A) Schematic overview of a gene therapy approach modifying expression of a gene, whose disease-associated deletion causes a haploinsufficiency phenotype. In the regular situation, gene 1 (green) and gene 2 (blue) are expressed but are known to be regulated by the same genetic signal (gene cluster). The disease-associated deletion of one copy of gene 1 is leading to haploinsufficiency. A planned gene therapy-e.g., CRISPR/Cas-mediated activation (CRIPSRa)-is intended to enhance expression of the regular copy of gene 1 to ameliorate the disease phenotype. This procedure potentially also increases gene 2 expression as the latter gene is part of a common regulatory gene cluster with gene 1. Gene 2 is therefore a potential off-target, which is currently not considered by common off-target prediction approaches. (B) Occurrence of haploinsufficiency genes in gene expression regulation clusters as defined in this study. Of 713 genes known for causing haploinsufficiency-related phenotypes, 231 (32.4%) were found to be part of gene expression regulation clusters.

Discussion
Here, we report on the co-regulation of gene expression based on comprehensive datasets from 49 tissues that were extracted from the GTEx project [18]. We show that 14,636 of 33,488 genetically regulated genes (43.7%) are co-regulated together with at least one other gene, resulting in 14,727 unique expression clusters across the tissues analyzed. In these clusters, up to nine genes are co-regulated by the same genetic signal, while pairwise tissue comparison highlights that each tissue pair shares, on average, 212 clusters. Since co-regulation is such a common phenomenon, we provide an exemplary scenario considering the field of innovative therapeutics where modifying gene expression within a gene regulation cluster could have significant repercussions on the treatment outcome, or more specifically on possible side-effects, and thus need to be acknowledged with great care.
The mechanisms underlying the regulation of local gene expression have been extensively studied although for individual genes only [1,37]. Such studies show that genetic variation can affect promoters [38], enhancers [39], and silencers [40], but also intronic elements such as splice site consensus sequences and consequently the correct splicing of transcripts [18]. At the chromatin level, genetic variation may alter TAD boundaries [41], potentially leading to changes in the regulation of multiple genes [42,43]. Undirected largescale approaches, such as eQTL analyses, can be complicated by complex LD structures. CCC methods, on the other hand, are successful in revealing direct interactions of DNA elements [12,14] and identify TADs [44]. Still, the knowledge of interaction sites per se is often less suited for identifying regulated genes, as the resolution is generally low having to deal with gene loci that position genes in close proximity to each other [14]. Our database of co-regulated genes generated as a result of this study is based on eQTL and colocalization analyses and offers a new perspective to compensate for the shortcomings of existing approaches by pointing directly to the gene of interest. It is therefore recommended to integrate our data directly in the result interpretation of published CCC studies as this will greatly enhance the value of both approaches. Given that CCC data are often generated in a cell-specific context, it is advisable to perform the comparison specifically in a context of a defined research question.
Regarding advantages and limitations of the database compiled in this study, we mainly need to consider two aspects. Firstly, while genes can be regulated by multiple genetic signals, the number of independent signals is strongly correlated with the associated sample size [18]. We therefore only considered the most prominent signal for each gene to not further complicate the comparison of tissues and to avoid an unfavorable dependency on sample size. While this procedure could be prone to miss some of the interactions, the ones we identified can be viewed with high confidence, specifically as they are based on the strongest eQTL signal for each locus [22]. Secondly, most eQTL studies are performed in bulk tissue, which can obviously result in unidentified eQTL by a dilution effect, as it is known that gene expression regulation is frequently a cell type-specific process [45,46]. However, our findings that highlight shared gene expression regulation clusters between tissues point to tissue overlapping mechanisms, even if our database allows no direct conclusion about cell type specificity.
Knowledge about clusters of gene expression regulation should be transferable to generate novel hypotheses and experimental designs. While each cluster needs to be considered individually, a number of mechanisms underlying the co-regulation are conceivable. For one, the effect direction of gene expression regulation provides useful information about the potential reason why genes are co-regulated. Our study revealed that 37,200 of the 49,637 co-regulated gene pairs (74.9%) are regulated in the same direction which brings us to hypothesize that the respective genes could play a direct or indirect role in the same pathway. A prominent example for the coordinated activation of genes are the HOX gene clusters, for which expression was studied extensively in the context of embryogenesis [47,48]. Of note, some of the largest clusters in our study were identified on chromosome two and seven containing up to eight genes of the HOXD and HOXA gene family, respectively. Although our results are based on differentiated tissue cells, the concordance with previously published studies shows the validity of our approach. Of note, the remaining 12,437 of the 49,637 co-regulated gene pairs (25.1%)-which are regulated in opposite directions-also have the potential to directly point to underlying biological mechanisms. A prominent example here is the gene expression regulation cluster including ATE1 and ATE1-AS1, in which simultaneous downregulation of the antisense transcript ATE1-AS1 is associated with increased expression of ATE1. This effect can be observed in various tissues and previous studies have demonstrated that ATE1 is regulated by a bidirectional promoter [49]. Kalinina et al. (2021) suggested that competing RNA structures at this locus affect splicing of the ATE1 gene [50]. These examples demonstrate that our results are in line with published studies, which were performed using different methods and approaches.
A significant strength of our dataset lies in the potential to evaluate the impact of experimental approaches targeting a DNA sequence or the regulation of transcription. For example, the gene FLG encodes profilaggrin whose loss is known to cause impaired keratinization [51]. A hypothetical gene therapy to prevent this phenotype could use the CRIPSR/Cas technology to facilitate an upregulation of FLG expression. Based on our findings, FLG is co-regulated with HRNR although in opposite direction. Thus, manipulating FLG expression may consequently alter HRNR expression, whereby it should be noted that the latter protein plays a role in atopic dermatitis [52]. In this scenario, HRNR needs to be considered as a potential off-target, although classical off-target prediction tools are based on sequence alignment algorithms and do not consider gene expression co-regulation [53,54]. The high proportion of 32.4% (231) of 713 haploinsufficiency genes that are in a regulation cluster illustrates that this phenomenon is in fact common and should be anticipated in any gene therapy designs.

Conclusions
Here, we present our findings in a novel database providing access to thousands of co-regulated genes in 49 tissues. Co-regulated genes are in close proximity or at distances of up to 2 Mbp and many clusters of expression regulation are shared between tissues. We highlight the prevalence of gene expression co-regulation and provide an outset for further follow-up studies. Finally, our results have a strong impact on current and future therapeutic approaches aiming at the modification of gene expression. Specifically, this is demonstrated for gene therapies targeting genes causing disease due to haploinsufficiency. Co-regulated genes are likely affected by such a treatment possibly triggering severe side effects.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/cells10092395/s1, Figure S1: Genotype principal component analysis (PCA); Figure S2: Characterization of colocalizing gene pairs; Table S1: Tissue-centered perspective on results of the gene expression regulation cluster analysis; Table S2: Gene-centered perspective on results of the gene expression regulation cluster analysis; Table S3: Coregulated gene pairs in 49 tissues; Table S4: Cluster-centered perspective on results of the gene expression regulation cluster analysis; Table S5: Collection of genes known to cause haploinsufficiency related phenotypes; Table S6: Gene expression regulation clusters including haploinsufficiency genes.

Institutional Review Board Statement:
This study used data of the GTEx project. For further specifics on the respective ethics approvals, we refer to [18].

Informed Consent Statement:
The GTEx project obtained informed consent from all participants. For further specifics on the respective forms, we refer to [18]. Data Availability Statement: Genotype, gene expression, and covariate data of the GTEx project are available in dbGaP (accession ID: phs000424.v8.p2) or the GTEx portal (www.gtexportal.org). All data generated or analyzed during this study are included in this published article and its supplementary information files.