Transcriptional Regulation of RUNX1: An Informatics Analysis

The RUNX1/AML1 gene encodes a developmental transcription factor that is an important regulator of haematopoiesis in vertebrates. Genetic disruptions to the RUNX1 gene are frequently associated with acute myeloid leukaemia. Gene regulatory elements (REs), such as enhancers located in non-coding DNA, are likely to be important for Runx1 transcription. Non-coding elements that modulate Runx1 expression have been investigated over several decades, but how and when these REs function remains poorly understood. Here we used bioinformatic methods and functional data to characterise the regulatory landscape of vertebrate Runx1. We identified REs that are conserved between human and mouse, many of which produce enhancer RNAs in diverse tissues. Genome-wide association studies detected single nucleotide polymorphisms in REs, some of which correlate with gene expression quantitative trait loci in tissues in which the RE is active. Our analyses also suggest that REs can be variant in haematological malignancies. In summary, our analysis identifies features of the RUNX1 regulatory landscape that are likely to be important for the regulation of this gene in normal and malignant haematopoiesis.


Introduction
Accurate spatiotemporal and quantitative gene expression is crucial for normal development and, in many cases, is achieved by the interaction of promoters with cis-regulatory elements (REs). REs, such as enhancers, are able to control the expression of genes by long-range chromatin interactions [1,2].
Most REs evolve rapidly and are rarely conserved at the DNA sequence level among species due to positive evolutionary selection [3]. However, clusters of conserved REs surround some highly conserved developmental genes. REs can be highly tissue-specific, and surprisingly remote from their gene targets [4,5]. Although REs often regulate the closest gene, they can also control expression of genes further afield [6][7][8]. Fewer than 50% of enhancers contact the nearest gene promoter [9]. Long-range chromatin interactions between promoters and REs can be mediated by scaffolding proteins and transcription factors (TFs) to regulate gene expression [10]. These factors include those responsible for the three-dimensional organisation of chromatin, such as cohesin and CTCF [10,11].
Recent studies have identified recurrent mutations in REs of genes associated with oncogenesis, such in as the TAL1, ETV1, and PAX5 enhancers [35][36][37]. In AML and other myeloid malignancies where no mutation is found in the coding sequence, there may instead be mutations in REs that affect gene function, including RUNX1 [38]. Therefore, defining the compendium of conserved Runx1 REs is imperative to understanding the regulatory landscape of Runx1 in normal and malignant haematopoiesis.
Here we used bioinformatic methods and functional data to identify possible REs that are conserved between human and mouse. Our analyses also suggest that RUNX1 REs are indeed affected by mutation in haematological malignancies.

Epigenetic Analyses Using ENCODE Data
ENCODE data were used to detect the presence of various histone modifications, DNase I hypersensitivity sites, TFs and cohesin binding sites [27,[41][42][43]. Analysis of ENCODE annotations was carried out using data submitted by Stanford, Yale, University of Washington (UW) and Ludwig Institute for Cancer Research (LICR). To determine the RE locations in different genome assemblies the UCSC 'liftOver' tool was used to convert coordinates within a species [42].To determine the human genome co-ordinates for identified Runx1 enhancers, the sequences were searched using the UCSC BLAT and NCBI BLAST tools [44,45]. Further genomic features of the conserved REs were deduced by analysing the human FANTOM 5 data [46,47] in the UCSC Genome Browser.

Comparative Analysis of Cancer Genomic Datasets
To assess whether AML patients had mutations in the conserved REs, publicly available cancer genomic datasets were explored. Online tools were used to categorise publicly available patient information into different tumour types for analysis.

Prediction the Functional Consequences of Non-Coding Variations
To predict the pathogenicity of SNPs and cancer associated genetic variations within R1REs Functional Analysis through Hidden Markov Models (FATHMM v2.3) (GRCh37/hg19) was used [64,65].
When gathering sequence information about these mouse REs we noted that three regions identified by Schütte  Based on sequence analyses we determined that 12 out of the 29 previously identified mouse Runx1 REs are conserved in human. This increases the number of 'known' putative human R1REs from 9 to 21. The Runx1 enhancers that were identified to be conserved between human and mouse were assigned a common identity, R1RE2-R1RE21, and (other than R1RE1) they are numbered in order of 5' to 3' location relative to the orientation of the Runx1 gene (Table 1). Interaction with Runx1 promoters described by 4C and TF binding motif analysis [25].
Conservation between mice and humans suggests that these REs are fundamental for the correct regulation of RUNX1. To determine possible regulatory function in human haematopoietic cells, bioinformatic analysis of R1REs in K562 chronic myelogenous leukaemia cells was undertaken (Table 2, Figure 2). K562 cells express RUNX1 and are comprehensively annotated in ENCODE. Our analysis showed that not all R1REs appear to be active in K562 cells. Moreover, the accessibility of some REs was altered upon cohesin mutation, indicating that cohesin's role in 3D genome structure might influence the function of some REs.

Chromatin Features of R1REs
ChromHMM indicates that R1RE4 and R1RE5 are in repressed chromatin, and that R1REs 1-3, 6, 8, 10-12, and 14-18 are active enhancers. R1REs 9, 13, 19 and 21 show chromatin features of enhancers; however, they are labelled as transcription-associated alongside R1REs 7, 8 and 20 (Table 2, Figure 2). Transcription association may be due to R1RE proximity with exons when compared to the regions annotated as enhancers or it may indicate that these regions are producing RNA, for instance enhancer RNA (eRNA).
ATAC-seq in K562 cells with a null mutation in the STAG2 gene, which encodes a cohesin subunit [51], showed that R1REs 2, 6, 18 and 21 regions reside in differentially open chromatin regions in STAG2 mutant cells (Table 2). This suggests that cohesin influences the accessibility (and possibly the function) of these REs. R1RE1 binds cohesin subunit SMC3 and CTCF, while cohesin subunit RAD21 binds with CTCF at R1RE12, and in the absence of CTCF with R1RE18. CTCF binding was also observed in R1REs 2, 10-11, 13, 15-17, and 19-21 (Table 2).
FANTOM5 data showed that 16 of the 21 R1REs produced eRNA in haematopoietic cells (Figure 2, Supplementary Table S1), whereas R1REs 3, 6, 7, 9 and 10 do not have any reported eRNA expression. ChromHMM labelled R1REs 7-9, 13, and 19-21 as transcriptionassociated. This may be attributed to enhancer RNA transcription, as mR1REs 8, 13, and 19-21 all produce eRNAs. It remains to be determined whether the transcription association observed in R1REs 7 and 9 is also due to enhancer RNA production.

Single Nucleotide Polymorphism (SNP) Analysis of R1REs
Disease-associated variants are found in REs and in genes with equal frequency [83]. More than 95% of genome-wide association studies (GWAS) single nucleotide polymorphisms (SNPs) are located in intergenic regions, of which over 75% are associated with open chromatin (DNase I HS sites) implying a strong association with REs [84]. A variant in a RE may cause differential regulation of a gene, and could be functionally equivalent to a mutation in the coding sequence of the gene itself. Recent studies have shown that alterations to REs are associated with dysregulation of oncogenic genes [85]. It is therefore possible that mutations in enhancers or insulators may change their function, and subsequently could lead to altered gene expression [86].
Of the 40 SNPs identified in the conserved R1REs, 6 had previously reported functional effects in haematopoietic cells. R1RE2 contains rs2834945, a T to C change that is predicted to affect 11 regulatory motifs including GATA2. rs2834945 affects the expression of kynurenine 3-monooxygenase (KMO) in peripheral blood monocytes [87]. KMO encodes a mitochondrial outer membrane protein that catalyses the hydroxylation of L-tryptophan metabolite, L-kynurenine, to form L-3-hydroxykynurenine [88]. High KMO expression leads to neurodegeneration [89].
R1RE3 harbours two functional SNPs: rs16993221 and rs909143. rs16993221 is associated with white blood cell count and contributes to chronic inflammation [90]. The A to T change in rs16993221 alters two regulatory motifs, BATF and IRF, which work together in immune response. The A to G change in rs909143 is also related to expression of KMO in peripheral blood monocytes [87], and is predicted to affect two regulatory motifs (NR3C1, POU2F2). R1RE18 harbours rs73900579, a T to C variant that alters two regulatory motifs and is associated with red cell distribution width [91]. R1RE21 contains two functional SNPS, rs2249650 and rs2268276; both are associated with AML susceptibility. The different SNPs are in linkage disequilibrium (LD) and change the ability of R1RE21 to act as an enhancer. A (rs2249650) G (rs2268276) bases and AA have greater enhancer capability than GA and GG. GA has the least enhancer capability. The SNPs also affect SPI1 binding capability; AG has strong SPI1 binding, GA has weak SPI1 binding, whereas AA and GG have medium SPI1 binding capability [92].

Cancer Associated Genetic Variation Occurrence in R1REs
Although not all SNPs found in this study have been functionally analysed, some of the R1RE SNPs were reported in patients with AML. Two AML patients had R1RE5 SNP rs2834885. One of the patients who had R1RE5 SNP rs2834885 also had three additional alterations to R1REs; R1RE2, R1RE14 SNP rs9976900 and R1RE19 SNP rs2284613. R1RE14 SNP rs9976900 has previously been associated with long non-coding RNA RUNX1-IT1 eQTL in the brain cortex [93] and was also associated with paediatric asthma [94].
The finding that SNPs and alterations in R1REs associates with human disease raises the possibility that these regulatory elements may be important for regulation of RUNX1 and/or other genes. It is currently unclear whether these variants have a higher or lower frequency than if they occurred by chance, and a more comprehensive catalogue of genomic data would be needed to provide significance. R1RE21 SNPs were found to be associated with AML susceptibility. However, it is possible that mutations in R1REs may explain AML progression in patients who have no known protein-coding mutations. In support of this idea, data available at cBioportal [61,62] shows a wide variability in RUNX1 expression, regardless of RUNX1 mutation status (Supplementary Figure S1).
Functional analyses are necessary to determine whether sequence variation is causative of differential gene expression and consequently, disease or neoplasia.

Prediction the Functional Consequences of Non-Coding Variations
The majority of the R1REs cancer associated genetic variations and SNPs identified in these regions have no functional data. Predictions of the functional consequences of R1REs cancer associated variations and SNPs highlight 7 pathogenic predicted variations (Supplementary Table S3). R1RE1 variation (chr21:g.36399191A>T), R1RE3 (rs116951441), R1RE11 (rs2834756), R1RE18 (rs73900579), R1RE19 (rs2284613) and the previously AML susceptibility associated SNPs R1RE21 (rs2249650 and rs2268276) all have prediction scores of >0.87. While the FATHMM algorithm assigned pathogenicity to previously uncharacterised variants, it did not predict pathogenicity for SNPs in R1RE3, R1RE10, and R1RE14 that were associated with eQTLS or other phenotypes. This highlights the importance of using several approaches to determine the effect of variants in R1REs. Importantly, FATHMM predictions reinforced the pathogenic potential of SNPs in R1RE18 and the leukaemia-associated SNPs in R1RE21. The combined analyses show that the R1RE3, R1RE18 and R1RE21 are affected by changes that have potential to be detrimental to correct gene regulation.

Discussion
In this study, we identified 12 Runx1 regulatory regions that are conserved between human and mouse, and are likely to be important for controlling expression of human RUNX1. This analysis increased the defined total of RUNX1 enhancers in human from 9 to 21, compared with 29 known REs in mice (Figures 1 and 2 and Table 1). Previously, both mouse and human Runx1 REs were named according to their distance from the TSS (or ATG) of the P1 promoter of Runx1. Here we instead assign each of the conserved REs a number, starting with R1RE1 through R1RE21. R1RE1 was assigned to +23/+24/eR1/RE1 because it is the best characterised of the REs, with other REs named in order according to their 5'-3' location on the same strand as the Runx1 gene. The assignation of a common identifier for enhancers conserved between human and mouse may eliminate confusion arising from different nomenclature in the numerous studies that seek to identify and characterise RUNX1 enhancers.
Of the identified R1REs, 8 interact with the RUNX1 promoters; R1REs 1, 2, 4, 10, 15, 18 and 21 have differing interactions with Runx1 P1, P2 and R1RE1 in different mouse or human cell lines (Table 1) [23,25,27,28]. A recent preprint describes a study in which enhancer-promoter contacts at the mouse Runx1 gene were determined step-wise during differentiation of mouse embryonic stem cells (ESC) to mesoderm and then on to HPC [66]. During mesoderm specification, the P2 promoter increased interactions with R1REs 1, 4, 6, 13, and 18 (and additional mouse enhancers not conserved in human). Upon differentiation of mesoderm to HPC, contacts were lost with the more upstream R1REs, while contacts to both P1 and P2 were increased with R1REs 11 and 21. Concomitantly, the Runx1 gene was expressed from the P2 promoter upon differentiation to mesoderm, and both P1 and P2 upon adopting HPC identity [66]. This study therefore reinforces the assumption that conserved REs for Runx1 are important for expression of Runx1 during haematopoietic differentiation.
The regulatory activity of 15 of the 21 human R1REs has been functionally confirmed. Moreover, there is good evidence that R1REs 16 and 21 are directly involved in AML disease progression. Cheng et al. (2018) characterised R1RE15 as a silencer of RUNX1 P2 by measuring the repressive effect of various deletion/mutant R1RE15 luciferase constructs on transcription from the P2 promoter in K562, OCI-AML3 and U937 cells [28]. However, in K562 cells R1RE15 has ENCODE and ChromHMM annotations associated with enhancer activity. Regions that have both enhancer and silencer annotations may have multiple roles depending on DNA interaction partners and the cell type. The discovery of RUNX1 silencer R1RE15 resulted from characterisation of a novel t(5;21)(q13;q22) translocation involving RUNX1 that was acquired during the progression of myelodysplastic syndrome to AML in a paediatric patient [28]. This translocation did not generate RUNX1 fusion, but rather it aberrantly upregulated the P2 isoform of RUNX1. The authors state that the translocation facilitated upregulation of RUNX1 P2 because the silencer at R1RE15 was removed.
R1RE21 contains SNPs that are associated with AML susceptibility, and in addition, R1RE21 is the site of translocation in t (8;21) in AML patients in which RUNX1 and ETO genes recombine. Schnake et al. (2019) identified a long non-coding RNA in R1RE21, at a site of these frequent chromosomal translocations [95]. This long non-coding RNA results in a more relaxed chromatin organisation at R1RE21 (at the location of the breakpoint). Consequently, it is possible that chromatin relaxation leads to a higher probability of double stranded breakages in myeloid cells, resulting in translocations [95].
There were no SNPS with MAF of >1% present in R1RE1. The one detected mutation in R1RE1 (chr21:g.36399191A>T) is predicted by FATHMM to have a pathogenic function. This suggests that there is strong selection pressure to conserve R1RE1. Consistent with this, the region containing R1RE1 appears to be functionally important in haematopoietic cells. To determine the importance of R1RE1 for RUNX1 expression and function, Mill et al.
(2019) deleted the P1-P2 intron of RUNX1 P1 in OCI-AML5 cells [34], which led to a strong selection against edited cells. The P1-P2 intronic region, which contains multiple R1REs, is therefore important for cell survival and RUNX1 expression.
Predictions of functional consequences of genetic variation within the R1REs estimate that 7 variations are likely to be pathogenic. Further functional analysis of the genetic variations will allow researchers to determine the roles of these R1REs in normal haematopoiesis as well as disease progression. These predictions may be an underestimation of the functional consequences of genetic variation, as the they are calculated based on available ENCODE data.
The ability to identify significant SNPs and mutations in haematopoietic cells is limited by the scarcity of whole genome sequence in individual haematopoietic cell types, AML, and other myeloid malignancy samples. Therefore, SNP frequency in R1REs could be an underestimate. The present study is limited by both the paucity of genomic sequence data of REs in normal and leukaemic cells, and the lack of data on cell type-specific chromatin status of REs. Limitations to publicly available data is likely to be the reason why we could not assign statistical significance to SNPs present in REs. Nevertheless, determining and categorising SNPs may help us understand each patient's disease progression, individual responses to drugs, or susceptibility to relapse. Some patients had SNPs at the RUNX1 locus with previously described functional effects on haematopoietic cells. However, the SNP data for myeloproliferative disorders lacks significance because the majority of SNP information comes from analysis of whole blood, rather than each specific cell type.

Conclusions
Other than R1RE15 and R1RE21 (which are linked to leukaemia), the relevance of R1RE mutations to the progression of AML is unknown. This study sets the scene for functional analyses to precisely determine how RUNX1 is regulated, including further RUNX1 chromatin interaction analyses, and CRISPR/Cas9-mediated interference with RE activity. The results of this informatics analysis also provide a rationale for screening patients with mutations in enhancer regions. By analysing deep sequencing of AML patient samples that have no identifiable mutations in the RUNX1 coding region or in other leukaemia genes, mutations in REs that lead to dysregulated RUNX1 expression may be discovered. Sequencing analyses that identify enhancer mutations may not only explain a patient's AML progression, but could provide the basis for future therapeutic targets. Understanding the regulatory landscape of RUNX1 will further increase understanding of haematopoiesis as well as identify potential new regions for driver mutations in myeloid malignancies.