Identification of Enhancers and Promoters in the Genome by Multidimensional Scaling

The positions of enhancers and promoters on genomic DNA remain poorly understood. Chromosomes cannot be observed during the cell division cycle because the genome forms a chromatin structure and spreads within the nucleus. However, high-throughput chromosome conformation capture (Hi-C) measures the physical interactions of genomes. In previous studies, DNA extrusion loops were directly derived from Hi-C heat maps. Multidimensional Scaling (MDS) is used in this assessment to more precisely locate enhancers and promoters. MDS is a multivariate analysis method that reproduces the original coordinates from the distance matrix between elements. We used Hi-C data of cultured osteosarcoma cells and applied MDS as the distance matrix of the genome. In addition, we selected columns 2 and 3 of the orthogonal matrix U as the desired structure. Overall, the DNA loops from the reconstructed genome structure contained bioprocesses involved in transcription, such as the pre-transcriptional initiation complex and RNA polymerase II initiation complex, and transcription factors involved in cancer, such as Foxm1 and CREB3. Therefore, our results are consistent with the biological findings. Our method is suitable for identifying enhancers and promoters in the genome.


Introduction
For cells to utilize genetic information, many genes must be expressed in a coordinated manner. The accessibility of genomic information depends on how DNA is packed into the chromatin. Chromatin is the basis of various biological processes, including cell cycle regulation and, DNA replication, repair, and maintenance [1]. Euchromatin is a genome region consisting of DNA with a relatively loose structure. The open structure allows RNA polymerase and other proteins to access the genome for DNA transcription. Enhancers and promoters also approach the euchromatin region to form DNA loops. Gene expression is controlled by promoters near the gene as well as by gene regulatory sites named as enhancers that are distant from the gene. However, how promoters and enhancers interact with each other to regulate gene expression is not well understood. High-throughput chromosome conformation capture (Hi-C) can be used to analyze the 3D structure of a genome by detecting genomic regions that are spatially close to each other using nextgeneration sequencing [2]. This conventional method led to an approximation of the genome structure from the Hi-C heat map [3]. We demonstrated the potential of using this method for identifying enhancers and promoters by applying multi-dimensional scaling (MDS).

Hi-C Data
The Hi-C dataset is valuable for understanding how chromatin is organized in the nucleus to effectively perform its biological functions; that is, it enables examination of the physical interactions of DNA loops. We downloaded eight Hi-C data during the cell cycle: 0 min (metaphase) and 35 min (anaphase/telophase); 60 min (cytokinesis); and 90, 120, 180, 240, and 360 min (G1) from Series GSE141067 [2]. In this study, we analyzed Hi-C data with a resolution of 50 kbp and found that long-range genomic interactions were reshaped from 60 min and completed within 90-120 min [2]. We assigned the average number of Hi-C detections for each distance between genomic coordinates to the missing values. In addition, the Hi-C data showed a significant difference between the number of detections with a far distance between coordinates and the number of detections with a close distance. Equation (1) is a function that does not affect the number of detections when the distance between the coordinates is close but increases the number of detections when the distance between the coordinates is large. As the purpose of this study was to identify enhancers and promoters, we evaluated those with a large number of Hi-C detections, although the distance between coordinates was large. Therefore, the following steps were performed according to the distance between nucleotides (Supplementary Tables S1-S4).
where i and j are coordinates, and d ij is the number of Hi-C detections (Figures 1 and 2). We processed the Hi-C data with various weights. For example, we multiplied the number of detections by the distance between each coordinate.
where a is a constant. We then averaged the number of detections per coordinate distance and multiplied the number of detections greater than the average by a constant.
where N is the length of the base. However, the results of enrichment analysis obtained using methods (2) and (3) were unsatisfactory. Figure 3 shows the heat map of the Hi-C data, where the genomes are close to each other when the number of Hi-C detections was large. To apply the multidimensional scale construction method, the Hi-C data were transformed as follows: The inverse of the Hi-C data was used as the distance data because the multidimensional scale construction method was used for similarity matrices.

Multidimensional Scaling
MDS (Appendix A) is a method that reproduces the original location of objects based on the distance data between objects [4] according to the following principle. Consider an N × P data matrix X = (x ij ), N data o i = (x i1 , x i2 , ...., x iP ). Then, we consider the N × N matrix B = X X t and the N × N distance matrix D = (||o i − o j ||) created from X. We define D (2) as the matrix of all components of the distance matrix D squared, and multiply D (2) X * is the centered data matrix and B cen is the inner product matrix obtained from the centered data matrix.
G is the orthogonal matrix of the inner matrix B cen . From (5) and (7), From the above equations, the original coordinates can be derived.

Applying MDS to the Hi-C data
We applied MDS to the distance matrix D It is unclear which column of the orthogonal matrix U presents the desired structure. However, in many cases, it is thought that the second and third columns of the orthogonal matrix U present the desired structure. Accordingly, the second and third columns of the orthogonal matrix U were selected as the desired structures. We refer to the structure of the second and third columns of the orthogonal matrix U a hypothetical chromosome. It then acquired the euchromatin region where the enhancer and promoter are predicted to be located. Moreover, the distance between the coordinates of the hypothetical chromosomes based on the Euclidean distance was determined.
The length of the DNA loop in the euchromatin region is only partially understood. In this study, we averaged E i every 50 kbp and used this value as the distance between the coordinates, the number of coordinates is reduced because it is averaged every 50 kbp. Accordingly, we included averages taken every 45, 40, and 35 kbp from both sides.
We proceeded to show the criteria for acquiring the coordinates as DNA loops. Five times the average distance between these genomic coordinates was set as the threshold. We focused on coordinates above the threshold value, which are considered to form DNA loops. However, because enhancers and promoters are at the ends of DNA loops, we also focused on coordinates below the threshold. In this study, the following criteria were established: E 0 was set as the threshold value and E i is the distance between coordinates. n k is the coordinate such that E n k − E n k−1 < 0, and E n k+1 − E n k > 0 is satisfied. Then, in arbitrary i, n k of n k < i < n k+1 , where i satisfying E i > E 0 is defined as n k . Finally, we acquired n k−1 to n k+1 as a DNA loop. Thus, we considered that coordinates from the blue point to the next blue point in Figure 4 form a DNA loop.

Enrichment Analysis
BiomaRt was used to retrieve a list of genes from the obtained coordinates. The gene list obtained by BiomaRt was uploaded to g: Profiler [5] to identify functions, processes, and transcription factors related to enhancers and promoters. The coordinates acquired as euchromatin regions were subjected to enrichment analysis using g: Profiler. Enrichment analysis can reveal the functions of differentially expressed genes.

Hypothetical Chromosomes
The heat map of the Hi-C data after organizing these data by (1) is shown in Figure 3. Pairs with large values in the matrix indicate region pairs with a high contact probability. MDS was applied to the Hi-C data, and the resulting hypothetical chromosomes are shown in Figure 5. The euchromatin region was identified (Figure 4). The acquired euchromatin regions were summarized in Table S1 (Biological replicate 1) and Table S2 (Biological replicate 2).

Enrichment Analysis
We used BiomaRt [6] in R to retrieve genes from the obtained coordinates. Finally, the obtained euchromatin regions were subjected to enrichment analysis using g: Profiler [5]. The results are presented in Tables 1 and 2. The functions and processes involved in transcription were also determined. The results of 0-360 min enrichment analysis were summarized in Table S3 (Biological replicate 1) and Table S4 (Biological replicate 2). The DNA loops obtained from the reconstructed genome structure contained bioprocesses involved in transcription, such as the pre-transcriptional initiation complex and RNA polymerase II initiation complex, and transcription factors involved in cancer, such as CAMP responsive element binding protein 3 (CREB3) and forkhead box M1 (FOXM1). Estrogen receptor 1 (ER-alpha) is involved in regulating gene expression, and is associated with breast cancer [7]. FOXM1 plays an essential role in cell cycle progression; its expression peaks in the S and G2/M phases. FOXM1 upregulation occurs in most solid human cancers [8]. MAF BZIP transcription factor G (MafG) interacts with methionine adenosyltransferase a1 to regulate transcription; MafG is overexpressed in cancer cells [9]. RUNX family transcription factor 3 (AML2) forms a heterodimeric complex core-binding factor (CBF) with CBFB and functions as a tumor suppressor. The gene is frequently deleted or transcriptionally silenced in cancer cells [10]. Progesterone receptor (PR) is involved in regulating gene expression and is associated with breast cancer [7]. Retinoic acid receptor alpha (RARA) regulates transcription in a ligand-dependent manner; diseases associated with acute promyelocytic leukemia [11]. CREB3 encodes a transcription factor that is a member of the leucine zipper family of DNA-binding proteins. This protein binds to the CAMP-responsive element and regulates cell proliferation. The mRNA expression of CREB3 is higher in OS tissues than in normal tissues [12].

Comparison with Previous Studies
Previous studies have focused on identifying euchromatin regions or DNA loops in Hi-C maps [3]. The chromosome forms a topologically associating domain (TAD), and each TAD (A/B) compartment is organized with an average of 880 kb. The A compartment is loosely structured and serves as the transcriptionally active region, whereas compartment B is narrowly structured and serves as the transcriptionally inactive region. At the TAD level, loops and stripes/tracks were formed as sub-TADs (average size 185 kbp). Therefore, the TAD compartment is a rectangle that is approximately 880 × 880 kbp in size in the Hi-C map, and the A compartment is a rectangle that is approximately 185 × 185 kbp in the Hi-C map, where enhancers, promoters, and insulators form DNA loops. Figure 6 shows the existence of DNA loops of such sizes. As described above, previous studies identified these DNA loops. If only a rough TAD compartment must be identified, the method used in the previous study may be sufficient. However, detecting sub-TADs and DNA loops is markedly more complicated and will become more difficult as the resolution of Hi-C is improved. In addition, several studies have used MDS to analyze Hi-C data for accurately reproducing 3D genome structures. For example, miniMDS [13] involves splitting highresolution Hi-C data into several parts to which MDS is applied to low-resolution Hi-C data, and then the split high-resolution Hi-C data are reconstructed. The framework HiCRep evaluates the reproducibility of Hi-C data [14] using the stratum-coordinated correlation coefficient as a similarity measure to quantify differences between Hi-C contact matrices. Furthermore, it has been found that HiCRep/MDS method, which combines HiCRep with MDS, is robust to low per-cell sequence depths and that this robustness is further improved when high and low coverage cells are projected together [15]. Another framework for predicting 3D genomic structures using t-distributed stochastic neighbor embedding (t-SNE) is named as StoHi-C [16]. MDS has inherent problems with very sparse high-dimensional Hi-C datasets, whereas tSNE overcomes these limitations. This method can reproduce the characteristics of chromosome 3D structures more clearly than MDS in yeast Hi-C data, which are considered as suitable for recreating the 3D structure of chromosomes. The distances between the coordinates obtained from the 3D structure reproduced by the StoHi-C method are shown in Figure 7. As shown in Figure 7, attempts to precisely reproduce the 3D structure resulted in no significant difference in the distance between coordinates, even when acquiring DNA loops with a threshold value. Therefore, the enhancers and promoters cannot be precisely identified. We focused on the ones with a large number of Hi-C detections, although the distance between coordinates is large because the goal of this study was to identify enhancers and promoters. Therefore, we added weights as shown in Equation (1).
Based on our results, it is useful to obtain DNA loops by automatically visualizing the chromosome structure using MDS, as performed in this study.

Identification of Eigenvectors in MDS Representing the Actual Structure
We sought to identify the eigenvectors in MDS that represent the actual structure. The column of the orthogonal matrix U presenting the desired structure is unclear. However, in many cases, the second and third columns of the orthogonal matrix U are thought to present the considering structure. Herein, we performed a simple simulation, considering a circle with errors, such as x i1 = cos (2π i N ) + g(λ),x i2 = sin (2π i N ) + g(λ). Here, the function g is the error function and is defined as follows: We used MDS to recover the original circle from the distance matrix; however, the second and third eigenvectors were found to represent the original circle. In such complex data with random numbers, the second and third eigenvectors typically represent the original structure ( Figure 8).