Structural Analysis of microRNAs in Myeloid Cancer Reveals Consensus Motifs

MicroRNAs (miRNAs) are short non-coding RNAs that function in post-transcriptional gene silencing and mRNA regulation. Although the number of nucleotides of miRNAs ranges from 17 to 27, they are mostly made up of 22 nucleotides. The expression of miRNAs changes significantly in cancer, causing protein alterations in cancer cells by preventing some genes from being translated into proteins. In this research, a structural analysis of 587 miRNAs that are differentially expressed in myeloid cancer was carried out. Length distribution studies revealed a mean and median of 22 nucleotides, with an average of 21.69 and a variance of 1.65. We performed nucleotide analysis for each position where Uracil was the most observed nucleotide and Adenine the least observed one with 27.8% and 22.6%, respectively. There was a higher frequency of Adenine at the beginning of the sequences when compared to Uracil, which was more frequent at the end of miRNA sequences. The purine content of each implicated miRNA was also assessed. A novel motif analysis script was written to detect the most frequent 3–7 nucleotide (3–7n) long motifs in the miRNA dataset. We detected CUG (42%) as the most frequent 3n motif, CUGC (15%) as a 4n motif, AGUGC (6%) as a 5n motif, AAGUGC (4%) as a 6n motif, and UUUAGAG (4%) as a 7n motif. Thus, in the second part of our study, we further characterized the motifs by analyzing whether these motifs align at certain consensus sequences in our miRNA dataset, whether certain motifs target the same genes, and whether these motifs are conserved within other species. This thorough structural study of miRNA sequences provides a novel strategy to study the implications of miRNAs in health and disease. A better understanding of miRNA structure is crucial to developing therapeutic settings.


Introduction
MicroRNAs (miRNAs) are single-stranded non-coding RNAs made up of short nucleotide sequences with lengths varying between 17 and 27 nucleotides, the vast majority being 20-21 nucleotides long [1,2]. Although miRNAs are relatively short sequences, they are effective enough to function as gene handbrakes and prevent long transcripts from being translated into proteins. MiRNAs interact with specific parts of a transcript by base-pairing [3].
MiRNAs play a crucial role in the maintenance of the homeostasis of important metabolic pathways and processes [4]. The alteration of miRNA levels is associated with cancer and developmental biology [5]. An up-or downregulation of miRNA expression serves as an effective reason for the development and spread of cancer [6]. The changing amount of protein in a cell by miRNA regulation affects the molecular function and harmony of the cell [7]. While the amount of intracellular protein is directly related to the expression of genes, it can also be indirectly affected by miRNA expression by inactivating the target genes before being translated into proteins [8]. This problem leads to less effective the target genes before being translated into proteins [8]. This problem leads to less effective protein synthesis than the required amount in cancer cells and this in turn drives them to act according to cancers' constitution [9]. In a cancer cell, while the gene expression is directly altered via mutations or methylations, it is indirectly inactivated with the expression of miRNAs [10]. Research in fields other than cancer, such as cell developmental biology, stem cell, and cardiovascular research has also shown that the cell's miRNA expression affects the deactivation of some biological mechanisms [11,12]. The impact of such miRNA-based control on gene expression makes miRNAs one of the important epigenetic factors that could be used effectively as therapeutic targets in translational research [13,14].
Although miRNA genes are very short compared to genes coding for proteins, they are transcribed like genes and fulfill their functions using complementary base pairing. They function as a part of the ribonucleoprotein complex RISC (RNA-induced silencing complex) and by binding to the complementary target, they potentiate the action of the RISC [15]. The nucleotides at positions 2 to 8 near the 5′ end of miRNAs are predominantly binding sites of miRNAs and are called seed regions [16]. In some forms of binding, seed complementarity is not enough in itself and requires pairing in the central or end region of the miRNAs [17]. In most cases, miRNAs interact with the 3′ untranslated regions (3′ UTR) of target mRNAs to induce mRNA degradation and translational repression [2]. This striking positional affinity has led to the development of miRNA target search algorithms that focus on 3′UTRs for further amplification of the bias for functional 3′UTR sites [18]. However, effective miRNA-binding sites have also been identified in the 5′UTR or the open reading frame (ORF) of target mRNAs [19][20][21][22]. Different studies and computer tools that measure and reveal the miRNA-gene relationship are expressed in different ways [23]. In this work, our first question was to find the miRNAs with similar nucleotide sequences and the second one was if the extent of the similarity would give us information about the consensus miRNAs and their target proteins [24].

Materials and Methods
The main purpose of the research was to analyze the nucleotide sequences and specific motifs of microRNAs implicated in myeloid cancer. To achieve this, the miRNA-seq data were collected and structured from different databases for analysis, as described in the workflow chart in Figure 1. We used GDC Data Portal to find the miRNAs that are most frequently altered in myeloid cancer (https://portal.gdc.cancer.gov/) (accessed on 25 April 2016) [25]. Specifically, the TCGA project for Acute Myeloid Leukemia (TCGA-LAML) was used under the filters transcriptome profiling, miRNA Expression Quantification, miRNA-Seq, and BCGSC miRNA Profiling. The dataset was downloaded in May 2021 and consists of the microRNA expression levels of the 197 AML patients determined by Illumina HiSeq 2000 microRNA seq. The level 3 sequencing data (expression levels of each miRNA) into the Log2 scale were used [26]. The set of isoform.quantification.txt files, which give read counts at base-pair resolution, contained the total read counts for mature miRNA (corresponding to miRBase v13 MIMAT identifiers), normalized to RPM.
Next, the nucleotide sequences of these miRNAs were retrieved manually from the miRBase database (https://www.mirbase.org/) (accessed on 1 February 2022) [27]  Motif Analysis (C++) Figure 1. Flowchart of data collection and processing. In brackets is the database/tool used to perform the analysis.
We used GDC Data Portal to find the miRNAs that are most frequently altered in myeloid cancer (https://portal.gdc.cancer.gov/) (accessed on 25 April 2016) [25]. Specifically, the TCGA project for Acute Myeloid Leukemia (TCGA-LAML) was used under the filters transcriptome profiling, miRNA Expression Quantification, miRNA-Seq, and BCGSC miRNA Profiling. The dataset was downloaded in May 2021 and consists of the microRNA expression levels of the 197 AML patients determined by Illumina HiSeq 2000 microRNA seq. The level 3 sequencing data (expression levels of each miRNA) into the Log2 scale were used [26]. The set of isoform.quantification.txt files, which give read counts at base-pair resolution, contained the total read counts for mature miRNA (corresponding to miRBase v13 MIMAT identifiers), normalized to RPM.
Next, the nucleotide sequences of these miRNAs were retrieved manually from the miRBase database (https://www.mirbase.org/) (accessed on 1 February 2022) [27] to use in sequence and motif analysis. The nucleotide length of these miRNAs was plotted in a histogram, followed by the conduct of descriptive statistics using R Studio [28]. Next, the nucleotide frequency of each position was assessed together with purine/pyrimidine content using an Excel spreadsheet. In the second part of the project, we wrote a C ++ script to analyze the miRNA sequences and identify the motifs in miRNAs in cancer (code deposited in GitHub [29]).
To identify the target genes of the miRNA that contain the consensus motifs, we first downloaded the validated target genes database from the mirTarBase database [30]. Target genes of all the sequences containing the motif of interest were taken from this database (for each motif separately). A total number of target genes (how many genes are targeted by the sequences with the motif) and the gene frequency (the number of sequences with the motif targeting the same gene) were calculated (for each motif separately).
Finally, to search for the conservation of our motifs, we downloaded the mature mRNA sequences from different species using mirbase.org. All of the sequences for selected species were used and analyzed for motifs with our program written in C ++ . Databases with 5n and 6n motif frequencies for each species were developed and the motif frequency was derived showing how many miRNA sequences (from one species) contain the motifs.

Results
The data retrieved from GDC Portal were from the miRNA profiling of patients with hematopoietic and reticuloendothelial cancer. This yielded a dataset of 587 miRNAs elevated in myeloid cancers. The sequences of these miRNAs were extracted from miRBase, a sample of data is given in Table 1.
1 Position 1-27 corresponds to direction 5 -3 . The complete data can be found in Supplementary Table S1.
It is not yet known whether the nucleotide length of miRNAs plays an active role in cancer or other diseases. Perfect base-pairing leads to the degradation of mRNA (a mechanism mainly seen in plants) and imperfect base-pairing with the target mRNA leads to repression of translation [31]. In this line of argumentation, it could be assumed that the longer the miRNA sequence, the higher the probability to complement the target mRNAs. However, this needs experimental proof. Based on the volume of literature published for each miRNA in miRBase, we noted a higher volume of research carried out on short miRNAs (consisting of 17 and 18 nucleotides) when compared to the ones with 26 and 27 nucleotides ( Table 2). Next, the percentage of nucleotides in each position was quantified ( Figure 3). The analysis was done from the first (5 ) to the last (3 ) position of the miRNA's nucleotides. Overall, Uracil was the most observed nucleotide, and Adenine was the least observed one with 27.8%, and 22.6%, respectively.
Genes 2022, 13, x FOR PEER REVIEW 5 of 12 Overall, Uracil was the most observed nucleotide, and Adenine was the least observed one with 27.8%, and 22.6%, respectively. In the first nucleotide position, 183 miRNAs have an A and 186 miRNAs start with a U. There is a higher frequency of Adenines at the beginning of the sequences when compared to Uracil which is more frequent at the end of miRNA sequences. In particular, there is a high density of Uracils between nucleotide positions 22 and 25. Purine (A and G) and pyrimidine (C and U) nucleotide bases were analyzed for their frequency in the studied In the first nucleotide position, 183 miRNAs have an A and 186 miRNAs start with a U. There is a higher frequency of Adenines at the beginning of the sequences when compared to Uracil which is more frequent at the end of miRNA sequences. In particular, there is a high density of Uracils between nucleotide positions 22 and 25. Purine (A and G) and pyrimidine (C and U) nucleotide bases were analyzed for their frequency in the studied miRNA structures (Supplementary Table S2). The highest and lowest purinescoring miRNAs are listed in Table 3. hsa-mir-765 (85.71% purines), hsa-mir-1468 (81.82%), and hsa-mir-1910 (80.00%) are some of the very high purine content miRNAs. The lowest purine content is present in hsa-mir-1281 (5.88%) and hsa-mir-483 (9.52%).

Motifs in microRNA Sequences Implicated in Myeloid Cancer
In addition to miRNA structure analysis, their common motifs were determined according to their length and frequencies. To find the most abundant motifs in miRNA sequences, we searched for nucleotide motifs containing 3, 4, 5, 6, and 7 nucleotides shared among all miRNA sequences. For this, the smallest motif encoding an amino acid, 3nucleotide (3n), was searched. Then, the same analysis was done in the form of 4n, 5n, 6n, and 7n.

3n miRNA Motifs
In the first round of motif search, we analyzed the sequences for 3n motifs, which are the smallest significant motifs. The most observed and the shortest motifs in the miRNA sequences were CUG, UGC, UGG, UGU, CAG, UUG, CCU, CUU, GUG, AGG, UCU, GCU, CGU, CGC, GCG, UGC, ACG, and CGA. These were found in 91.65% of miRNAs. Only 8.35% did not have any of these 3n motifs in their sequences (for example, hsa-mir-122 and hsa-mir-1181) ( Table 4).  We divided the 4n motifs identified into two groups, the ones occurring in more than 75 miRNAs (the most detected) and the ones occurring in less than 10 miRNAs (the least detected) ( Table 5). CUGC, ACUG, and UGCA were found as the most detected motifs in 87 and 85 different miRNAs, respectively. On the other hand, CGAA, CGAG, CGUA, and UCGA were the least detected 4n motifs in 10 different miRNA sequences. Moreover, 112 sequences of 587 microRNAs (19%) do not have any of the top 4n motifs. In addition, 28 of 112 sequences have the least common motifs, and 84 of the miRNAs do not have any of the listed motifs (Supplementary Table S3).

Longer Motifs
The purpose of finding longer motifs such as 5n, 6n, and 7n new motifs was to find potentially conserved or master sites in miRNAs. A total of 271 different 5n motifs were detected (Supplementary Table S3). AGUGC was the most frequent 5n motif found in 36 miRNAs (6%). A total of 38 different 5n motifs were unique ( Table 6). The other mostly detected long motifs are made of 6n and 7n sequences. The most frequently observed 6n motifs were AAGUGC and GCUUCC (detected in 22 different miRNAs, 4%), while UUUAGAG was the most detected 7n motif in our dataset (found in 19 miRNAs, 3%).

Consensus miRNA Sequences Having Many Motifs
Consensus motifs were analyzed in the miRNA sequences, elucidating the consecutive alignment of our motifs in different miRNA sequences to different degrees. In this way, detailed results were obtained about where the identified motifs are located in the miRNA, and how they appear in high-consensus sequences ( Table 7).  The results of this analysis show that a miRNA can be associated with one or more mRNA targets, using the common motifs it has in the sequence. Apart from the importance of motifs and consensus sequences in the miRNA binding on their target, the secondary importance of our results may arise in these sequences being a target of RNA binding proteins (RBPs), which recognize specific sequence motifs and are key factors to regulate the miRNA function. Although the transcription factors and epigenetic modifications control the synthesis of miRNAs, their regulation after synthesis is highly controlled with RBPs [32]. Overall studies regarding the RBP binding and regulation of miR-NAs are insufficient. Among more than 500 identified human RBPs, only a few have been characterized in terms of functioning in oncogene and tumor suppressor mRNAs [33]. There are many secrets to be revealed behind the miRNA processing by RBPs in healthy and disease states for research to be carried out in the future. The complexity of regulation is further increased with the clues on the cooperative work of miRNAs and RBPs in controlling common mRNA targets [34]. Taking all these into account, a detailed study of the structure of these short RNA molecules, which can perform so many functions, is crucial, and the results presented in our study, can serve as a starting point and raw material for these studies, especially in cancer models.

Target Genes of 7n Motifs
We next analyzed the target genes of the miRNAs that share common motifs, if they give hints on the functional aspects of the motifs we identified in myeloid cancer. For this study, 7n motifs were selected as they are longer and can be more specific in their targets [1]. Using the miRNA-target prediction tool MirTar database, we identified the targets of our miRNAs, which are experimentally validated in different studies. The list of overlapping genes is listed in Table 8, and all the detected targets are given in Supplementary  Table 4. The results of this analysis show that a miRNA can be associated with one or more mRNA targets, using the common motifs it has in the sequence. Apart from the importance of motifs and consensus sequences in the miRNA binding on their target, the secondary importance of our results may arise in these sequences being a target of RNA binding proteins (RBPs), which recognize specific sequence motifs and are key factors to regulate the miRNA function. Although the transcription factors and epigenetic modifications control the synthesis of miRNAs, their regulation after synthesis is highly controlled with RBPs [32]. Overall studies regarding the RBP binding and regulation of miRNAs are insufficient. Among more than 500 identified human RBPs, only a few have been characterized in terms of functioning in oncogene and tumor suppressor mRNAs [33]. There are many secrets to be revealed behind the miRNA processing by RBPs in healthy and disease states for research to be carried out in the future. The complexity of regulation is further increased with the clues on the cooperative work of miRNAs and RBPs in controlling common mRNA targets [34]. Taking all these into account, a detailed study of the structure of these short RNA molecules, which can perform so many functions, is crucial, and the results presented in our study, can serve as a starting point and raw material for these studies, especially in cancer models.

Target Genes of 7n Motifs
We next analyzed the target genes of the miRNAs that share common motifs, if they give hints on the functional aspects of the motifs we identified in myeloid cancer. For this study, 7n motifs were selected as they are longer and can be more specific in their targets [1]. Using the miRNA-target prediction tool MirTar database, we identified the targets of our miRNAs, which are experimentally validated in different studies. The list of overlapping genes is listed in Table 8, and all the detected targets are given in Supplementary Table S4. Our 7n motif GUGCUUC is present in 15 different miRNA sequences and all of them target the same six genes (EIF2S1, SPRED1, HIP1, YOD1, ELK4, ABHD15). We see that this is the case for many motifs in different degrees. This presumes that the motifs which are identified are an important factor for target recognition and small changes in the sequences can impact the specificity of binding/regulation.

Conserved 5n and 6n Motifs
We further wanted to test our motif-finding script in analyzing the conserved miRNA motifs in different species. For this, 5n and 6n motifs were searched in the available miRNA sequences from different species  Table S5). Among the top 15 identified motifs, we combined the ones that were common to all species and derived their percentages for specific species studied (Table 9).
MiRNAs are key regulators of many cellular processes and may be one of the main players in post-transcriptional regulation. Because they also influence vital biological processes, they tend to be conserved between species. However, there have been contradictory reports on the SNP density of these regions when compared to control [35]. Here, we show that the conservation may happen with certain motifs inside the miRNAs and the higher SNP density may be present in other parts of the miRNA, which add to the list of target genes of miRNA without disturbing the main target interactions.

Discussion
More than 50% of human genes are predicted to be regulated by miRNAs [36]. They are powerful post-transcriptional modulators of mRNA translation that are proven to regulate many important processes in cancer progression as well [37]. There are more than 70 disease studies that associate with miRNAs [38]. Some of them target the oncogene products and the others the tumor suppressor gene products [39]. Acute myeloid leukemia is a disease characterized by the buildup of immature myeloid cells, mainly resulting from the genetic background. However, the emerging field of miRNA research has already identified certain miRNA profiles behind the disease that correlate with prognosis [40]. Such studies were generally concentrated on the functional effects of specific miRNAs. In our study, we have a general look at the structural aspects shared by miRNAs elevated in AML patients. We addressed different aspects of the structure of the miRNAs up-or downregulated in myeloid cancer patients.
The first aspect we looked at was the length distribution of miRNAs studied, which had a mode and median of 22 nucleotides, the same as found by a previous study done on overall human miRNAs [1]. In the same study, it was shown that the distribution does not follow a Poisson distribution where the mean and variance would be equal, but rather a Laplace distribution fits better. Next, our analysis of the nucleotide distribution in every position implied the existence of structural patterns in the miRNA sequences. There is a higher occupancy of Adenines at the beginning of the sequences and of Uracil at the end of the sequences. This multinomial distribution of nucleotides in different positions was also noted as significant in the work of Fang et al. for the overall miRNAs studied, except that they did not find a significant difference between the GC and AU content of their samples. In our miRNA dataset of myeloid cancer patients, there was a grouping of miRNAs based on their purine and pyrimidine content, which further supports the pattern presence in their sequences.
MiRNAs can have many mRNA targets due to their ability to exert their function even in imperfect base-pairing. Fang et al. found a positive correlation between the average miRNA length and the number of targets, which may be explained by the higher affinity of longer miRNAs with their targets [1]. In our study, we elucidated a way to look into the motifs that target the same genes. Further research should follow for the functional aspects of this way of the importance of such targeting by many miRNAs using the same motifs on the same genes.
Some miRNAs were shown to have conserved functions beginning from mosses and ferns [41]. In a study by Vazquez et al., longer miRNAs were shown to have a more recent history in Arabidopsis. They also found a correlation between the bases at certain miRNA sites to be conserved [42]. In another study by Lewis et al., they noted that the nucleotide position upstream and downstream of seed regions of miRNAs were highly conserved [43]. In our research, we have shown that the conservation of certain motifs is present between different species to different degrees.
Overall, our research may serve as a strategy to study the common structural aspects of miRNAs in different human diseases. Furthermore, it can be extended to study the functional outcomes of the presence of motifs and more cancer types to produce an inclusive comparative study.

Conclusions
In conclusion, this research reveals motif sequences of miRNAs implicated in myeloid cancer, which were also shown to be clustered in consensus sequences. Moreover, it was shown that many of these motifs tend to be conserved across species.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13071152/s1. Table S1: List of all identified miRNAs implicated in myeloid cancer; Table S2: Purine and py-rimidine analysis of all miRNAs; Table S3: Motif analysis results of miRNAs studied; Table S4: Target gene analysis for all miRNAs; Table S5: Conserved motif analysis among species.
Author Contributions: Conceptualization, methodology, writing-original draft preparation, S.D.; investigation, data analysis, software, A.C.; data analysis, visualization, writing-review and editing, E.S. All authors have read and agreed to the published version of the manuscript.