Spatial Partitioning of miRNAs Is Related to Sequence Similarity in Overall Transcriptome

RNAs have been shown to exhibit differential enrichment between nuclear, cytoplasmic, and exosome fractions. A current fundamental question asks why non-coding RNA partition into different spatial compartments. We report on the analysis of cellular compartment models with miRNA data sources for spatial-mechanistic modeling to address the broad area of multi-scalar cellular communication by miRNAs. We show that spatial partitioning of miRNAs is related to sequence similarity to the overall transcriptome. This has broad implications in biological informatics for gene regulation and provides a deeper understanding of nucleotide sequence structure and RNA language meaning for human pathologies resulting from changes in gene expression.


Introduction
Much focus in biology is directed toward explaining the regulation of protein-coding genes, but lately, interaction networks with non-coding RNAs (ncRNAs) have been under particular scrutiny [1]. There is a suggestion that broad communication networks concerning competitive endogenous RNAs (ceRNAs) exist whereby ncRNAs could modulate regulatory RNA by binding and titrating from sites of protein-coding messenger RNAs [2]. Generally, RNA molecules and proteins undergo constrained diffusion largely limited by spatial constraints of other molecules and move by a stop-and-go mechanism where free diffusion is interrupted by random association and collision with other cellular structures [3]. Most importantly, the dynamic nature of RNAs is emerging as a means to control physiologic cellular responses and pathways [4]. Brownian motion effects are ubiquitous and play a pivotal role when one infers macroscopic functions from the mesoscopic level of description, a route commonly utilized in the study of complex systems. Dynamics at such mesoscopic level is dictated by a set of Langevin processes or equivalently by the associated N-particle Fokker-Planck equation [5]. We apply that conceptual model basis in the present work to examine miRNAs diffusing in a fluid medium that exhibits global RNA interaction resulting from nucleotide motifs or sequence words. We seek to determine if miRNAs with sequence words in common with the whole transcriptome have enhanced mobility since their transport can be facilitated by common transcript pathways, and, due to their small size of 22 nucleotides (nt), could have an influence on transport to the extracellular space, in the form of exosomes.

Simple Transcriptome Model
As an approximation, we develop a transcriptome model version (simple model) using real highly expressed genes, and for comparison separately, randomized sequences of the transcriptome. The simplest realistic model is composed of 8 real human RNA transcripts as a simple representation of the transcriptome in a cell (Model 1). It is comprised of four of the most prevalent tRNAs with lengths of 71-73 nt (which happens to be slightly smaller than an average tRNA in Table 1), and four of the major subunits of the ribosome with sizes from 121 to 5034 nt. For this simple transcriptome, the total number of nucleotides is the sum of the nucleotides in each transcript, or 7470 nt. A program (TIC-generator) was written in C++ that calculates the frequency of words of length W that are contained in each transcript. For a RNA transcript of length L, the number of possible words would be L´W + 1. In the simple model, for each word length from W = 4 to 22, word count was calculated along with the sum of the frequencies of those words corresponding to the simple eight transcripts RNA = 1 to 8 labels (see Section 4). The output from TIC-generator is a listing of all words contained in each transcript, together with its frequency of occurrence. The lists of words from the eight transcripts were combined, and then duplicates removed. The number of duplicates and unique words resulting from duplicate removal is listed in Figure 1a. The total possible words of length W are 4 W , shown as orange boxes in Figure 1a. The fraction of the possible words presents in the simple model transcriptome decreases for increasing word size. It is interesting that the peak in unique and total duplicate (blue diamonds in Figure 1b) words are maximal at the same size as the miRNA "seed" sequence. This peak seen in Figure 1a for duplicate words in a transcriptome construction would increase for increasing numbers of transcripts.

Simple Transcriptome Model
As an approximation, we develop a transcriptome model version (simple model) using real highly expressed genes, and for comparison separately, randomized sequences of the transcriptome. The simplest realistic model is composed of 8 real human RNA transcripts as a simple representation of the transcriptome in a cell (Model 1). It is comprised of four of the most prevalent tRNAs with lengths of 71-73 nt (which happens to be slightly smaller than an average tRNA in Table 1), and four of the major subunits of the ribosome with sizes from 121 to 5034 nt. For this simple transcriptome, the total number of nucleotides is the sum of the nucleotides in each transcript, or 7470 nt. A program (TIC-generator) was written in C++ that calculates the frequency of words of length W that are contained in each transcript. For a RNA transcript of length L, the number of possible words would be L − W + 1. In the simple model, for each word length from W = 4 to 22, word count was calculated along with the sum of the frequencies of those words corresponding to the simple eight transcripts RNA = 1 to 8 labels (see Section 4). The output from TIC-generator is a listing of all words contained in each transcript, together with its frequency of occurrence. The lists of words from the eight transcripts were combined, and then duplicates removed. The number of duplicates and unique words resulting from duplicate removal is listed in Figure 1a. The total possible words of length W are 4 W , shown as orange boxes in Figure 1a. The fraction of the possible words presents in the simple model transcriptome decreases for increasing word size. It is interesting that the peak in unique and total duplicate (blue diamonds in Figure 1b) words are maximal at the same size as the miRNA "seed" sequence. This peak seen in Figure 1a for duplicate words in a transcriptome construction would increase for increasing numbers of transcripts.

miRNA Datasets Examined with Simple Model Transcriptome
Experimental validation of the simple model transcriptome examined various functions of word similarity using published data sets. Functions tested include tWord for transcriptome words in common with target multiplied by word frequency in the transcriptome. Four studies below examined miRNA where data sources were grouped into high and low study parameter sets and mean values and t-test calculated with 2-tail t-test values under two-sample equal variance assumption models.

Exosome Enriched miRNAs
The Villarroya-Beltri [12] study performed microarrays on cellular and exosome fractions from resting and activated human T lymphocytes. They assessed whether certain RNAs are specifically classified into exosomes, and performed a microarray analysis of activation-induced variations in the miRNA and mRNA profiles of primary T lymphoblast and their exosomes. We used that data found in their Supplementary Data and data publicly available at Gene Expression Omnibus through GEO Series accession number GSE50972. Their microarray analysis showed that in most cases miRNAs modulated upon activation are different in cells and exosomes, either for upregulated or downregulated miRNAs. This shows that miRNA and mRNA loading into exosomes remains not a passive process. Certain miRNAs were more highly expressed in exosomes than in cells and in most circumstances this difference is preserved under resting and activated conditions. Similarly, most miRNAs that are more highly represented in cells than in exosomes keep this tendency free from the activation state of the cell. Then they classified some miRNAs as thus specifically sorted into exosomes (EXOmiRNAs), whereas others are specifically retained in cells (CLmiRNAs). We calculated the tCount of raw counts of words in common with the simple transcriptome and tWord, which factors the expression level of that word. Other measures compared tCount and tWord to a randomized transcriptome (RAN). We used a word size w = 7 roughly equal to the seed sequence length as shown in the peaks in Figure 1a,b. Figure 2 shows a clear tendency for the EXOmiRNA cluster on the right (average Log FC of 2.70) to be greater in value (average tCount of 6.80) than the CLmiRNAs (average Log FC of´1.62) in the left cluster of the data points with an average tCount of 4.32. A t-test between the two clusters gives a p-value of 3.2ˆ10´7, indicating a significant difference between exosome and cytoplasmic miRNAs as measured with the tCount measure calculated from a simple transcriptome. To allow comparison with other classes of RNA, we can normalize the transcript size by dividing by sequence length. Transforming the tCount measure in Figure 2 increases the correlation coefficient to R 2 = 0.185 with y = 0.0236x + 0.242, where x is log FC and y is tCount/Len for word size w = 7.

miRNA Datasets Examined with Simple Model Transcriptome
Experimental validation of the simple model transcriptome examined various functions of word similarity using published data sets. Functions tested include tWord for transcriptome words in common with target multiplied by word frequency in the transcriptome. Four studies below examined miRNA where data sources were grouped into high and low study parameter sets and mean values and t-test calculated with 2-tail t-test values under two-sample equal variance assumption models.

Exosome Enriched miRNAs
The Villarroya-Beltri [12] study performed microarrays on cellular and exosome fractions from resting and activated human T lymphocytes. They assessed whether certain RNAs are specifically classified into exosomes, and performed a microarray analysis of activation-induced variations in the miRNA and mRNA profiles of primary T lymphoblast and their exosomes. We used that data found in their Supplementary Data and data publicly available at Gene Expression Omnibus through GEO Series accession number GSE50972. Their microarray analysis showed that in most cases miRNAs modulated upon activation are different in cells and exosomes, either for upregulated or downregulated miRNAs. This shows that miRNA and mRNA loading into exosomes remains not a passive process. Certain miRNAs were more highly expressed in exosomes than in cells and in most circumstances this difference is preserved under resting and activated conditions. Similarly, most miRNAs that are more highly represented in cells than in exosomes keep this tendency free from the activation state of the cell. Then they classified some miRNAs as thus specifically sorted into exosomes (EXOmiRNAs), whereas others are specifically retained in cells (CLmiRNAs). We calculated the tCount of raw counts of words in common with the simple transcriptome and tWord, which factors the expression level of that word. Other measures compared tCount and tWord to a randomized transcriptome (RAN). We used a word size w = 7 roughly equal to the seed sequence length as shown in the peaks in Figure 1a,b. Figure 2 shows a clear tendency for the EXOmiRNA cluster on the right (average Log FC of 2.70) to be greater in value (average tCount of 6.80) than the CLmiRNAs (average Log FC of −1.62) in the left cluster of the data points with an average tCount of 4.32. A t-test between the two clusters gives a p-value of 3.2 × 10 −7 , indicating a significant difference between exosome and cytoplasmic miRNAs as measured with the tCount measure calculated from a simple transcriptome. To allow comparison with other classes of RNA, we can normalize the transcript size by dividing by sequence length. Transforming the tCount measure in Figure 2 increases the correlation coefficient to R 2 = 0.185 with y = 0.0236x + 0.242, where x is log FC and y is tCount/Len for word size w = 7.  For miRNAs with a resting LogFC, which was positive (average 2.70), values of tWord (mean 12.45) were higher than miRNAs with negative LogFC (average´1.62) for tWord (mean 5.47), and hence tWord was greater with miRNAs enriched in exosomes compared to cytoplasmic miRNAs in Figure 3. For miRNAs with a resting LogFC, which was positive (average 2.70), values of tWord (mean 12.45) were higher than miRNAs with negative LogFC (average −1.62) for tWord (mean 5.47), and hence tWord was greater with miRNAs enriched in exosomes compared to cytoplasmic miRNAs in Figure 3. A common pattern with tCount and tWord seen in Figures 2 and 3 is the greater variance with exosomal miRNAs. Standard deviation of tCount is 32% greater in exosomal mrRNAs (S.D. = 2.9) compared to cytoplasmic (S.D. = 2.2). For the measure tWord, the difference is greater, with exosomal miRNAs having 168% greater standard deviation, 10.7 vs. 4.0. The greater variance with tWord compared to tCount is most likely due to the multiplier of expression level in the tWord calculation, as tCount is a simple count of occurrences of words in common between target miRNA and the transcriptome model.
With eight outliers removed which had tWord scores above 25 in Figure 3, the new regression gives y = 0.85x + 6.88 and R 2 = 0.182. This relationship is closely maintained even for activated cells, as with eight outliers removed gives y = 0.82x − 7.71 and R 2 = 0.164. With six outliers removed for the function tWords-RAN, regression improves to y = 0.89x − 0.92 and R 2 = 0.166 from the whole data set shown in Figure 4.  A common pattern with tCount and tWord seen in Figures 2 and 3 is the greater variance with exosomal miRNAs. Standard deviation of tCount is 32% greater in exosomal mrRNAs (S.D. = 2.9) compared to cytoplasmic (S.D. = 2.2). For the measure tWord, the difference is greater, with exosomal miRNAs having 168% greater standard deviation, 10.7 vs. 4.0. The greater variance with tWord compared to tCount is most likely due to the multiplier of expression level in the tWord calculation, as tCount is a simple count of occurrences of words in common between target miRNA and the transcriptome model.
With eight outliers removed which had tWord scores above 25 in Figure 3, the new regression gives y = 0.85x + 6.88 and R 2 = 0.182. This relationship is closely maintained even for activated cells, as with eight outliers removed gives y = 0.82x´7.71 and R 2 = 0.164. With six outliers removed for the function tWords-RAN, regression improves to y = 0.89x´0.92 and R 2 = 0.166 from the whole data set shown in Figure 4. For miRNAs with a resting LogFC, which was positive (average 2.70), values of tWord (mean 12.45) were higher than miRNAs with negative LogFC (average −1.62) for tWord (mean 5.47), and hence tWord was greater with miRNAs enriched in exosomes compared to cytoplasmic miRNAs in Figure 3. A common pattern with tCount and tWord seen in Figures 2 and 3 is the greater variance with exosomal miRNAs. Standard deviation of tCount is 32% greater in exosomal mrRNAs (S.D. = 2.9) compared to cytoplasmic (S.D. = 2.2). For the measure tWord, the difference is greater, with exosomal miRNAs having 168% greater standard deviation, 10.7 vs. 4.0. The greater variance with tWord compared to tCount is most likely due to the multiplier of expression level in the tWord calculation, as tCount is a simple count of occurrences of words in common between target miRNA and the transcriptome model.
With eight outliers removed which had tWord scores above 25 in Figure 3, the new regression gives y = 0.85x + 6.88 and R 2 = 0.182. This relationship is closely maintained even for activated cells, as with eight outliers removed gives y = 0.82x − 7.71 and R 2 = 0.164. With six outliers removed for the function tWords-RAN, regression improves to y = 0.89x − 0.92 and R 2 = 0.166 from the whole data set shown in Figure 4.

Nuclear Enriched miRNAs
The Park (2010) study [13] compared nuclear and cytoplasmic fractions in hct116 colon cancer cells also by microarray. They recognized various miRNAs that existed in isolated nuclei from human colon cancer HC T116 cells. MicroRNA profiles were correlated between cytoplasmic and nuclear fractions of the HC T116 cell line by multiple microarray analyses. Nuclear confinement of the mature form of miRNAs was validated by controlling RT-PCR excluding the exposure of precipitate forms of miRNA, such as pri-miRNA or pre-miRNA. They established elevated levels of representative miRNAs in purified nuclei that support the notion that notable numbers of mature miRNAs survive not only in the cytoplasm but also in the nucleus.
Again we calculated tCount of raw counts of words in common with the simple transcriptome. The tWord factors the expression level of that word and other measures, comparing them to a randomized transcriptome (RAN). Their data was sorted by N/C ratio and partitioned into two groups: N/C > 0.47 which was nuclear enriched (n = 45), and N/C < 0.47 which are preferentially found in the cytoplasm (n = 33). tCount was 4.02 for nuclear enriched, and 5.00 for cytoplasmic, with a t-test p-value of 0.116 between the groups; while tWord was 4.73 for nuclear and 10.58 for cytoplasmic miRNAs, with a significant t-test p-value of 0.023 between nuclear and cytoplasmic groups. With this Park data set, dividing tCount and tWord by miRNA length yields improved t-test p-values of 0.094 and 0.022. Together this data suggests that nuclear-enriched miRNAs share fewer common words with the overall transcriptome than cytoplasmic miRNAs.

Other miRNA Studies
The Huang (2013) study used RNA/seq on exosomes from human plasma [14]. The top 100 exosomes abundant miRNAs had tCount (mean 4.80) and tWord (mean 6.72) measures compared to those lower 100 with low "rcmm" reads (mean 4.64 and 7.41 respectively). Again, exosome transcripts have more similarity to the simple model transcriptome. Similar results are found with the Cheng (2014) study of exosomes in human blood [15]. There, the 50 most abundant miRNAs in exosome sampled labeled "Plasma UC Exo" had tCount and tWord values of 4.56 and 6.00 compared to 5.58 and 8.80 for low abundance transcripts. Related results also found with Guduric-Fuchs (2012) data on exosomes from HEK293 T cells showed that the ratio of EV to cell reads was significant whereas using read counts "rpmm" was not [16]. These data suggest that the relatedness of tCount and tWord measures to spatial partitioning are a function of enrichment factor and not abundance in the compartment.
As a step towards in-depth understanding the mechanism of selective exportation of miRNAs to EVs, Guduric-Fuchs (2012) employed deep sequencing to discriminate the global expression pattern of small RNAs in HEK293T cells and the EVs that they release [16]. Enrichment of overexpressed miRNA in EVs has been manifested by RT-qPCR in HEK293T cells, mesenchymal stem cells, macrophages and immune cells. Using data from Guduric-Fuchs that was sorted by EV/cell ratio, we compared the 10 top (exosome-enriched) and bottom (cytoplasmic enriched) miRNAs and listed tCount and tWord computed values in Table 2 as the mean and standard deviation in parenthesis. From Table 2 for the various measures examined across the studies, tCount, tWord and their difference (tW-tC), values progress from lower for nuclear, higher for cytoplasmic, and highest for exosomal miRNAs. Thus, under transitivity, EXO > CL > NUC for these transcriptome measures. This suggests that miRNAs with sequence similarity to the overall transcriptome transit furthest from their points of transcription. These conclusions are most significant with the tCount measure, with a p-value close to zero for the Villarroya-Beltri study, and 0.016 for the Guduric-Fuchs study, while the Park study showed little difference (p-value = 0.122) for tCount between nuclear and cytoplasmic enrichment.

Discussion
Much focus in RNA research is directed toward understanding the regulation of protein-coding genes [17]. However, ncRNAs also form well-orchestrated regulatory interaction networks [1]. For example, computational modeling of miRNA target sites suggests a broad network of miRNA-lncRNA interaction [18]. Recently, there have also been reports inferring the feasibility of a broad interaction network comprising competing endogenous RNAs (ceRNAs) where ncRNAs could change regulatory RNA by binding and titrating them off the corresponding binding sites on protein coding messengers [2].
We suggest that miRNA sequence delineates the molecular mechanisms underlying Brownian motion as a broad class of RNA with the transcriptome composed as an RNA language with interactions between transcripts and protein molecules at the same location. Recently, the attention of the relevant research community has been focused on non-coding RNAs and their physiological/pathological implications [19]. As the number of RNA experiments reported rapidly increases and transcriptional units are better annotated, databases indexing RNA properties and function from transcriptome measures become essential tools in this process. This early stage software development effort makes use of a sandbox-oriented software development environment [20] that enables development for miRNA physiology study. This work is generalizable to different sequence technologies, RNA/seq, microarray, etc., and is scalable to different organisms [21], organs, or sub-cellular compartments depending on sample preparation for the libraries. Caution must be exercised with reported studies not adequately controlling for the source of extracellular particles, not differentiating between lipid coated RNAs, exosomes, microparticles, or apoptotic bodies.

Sliding Window Word Generator
A sliding window variable size word generator (TIC-generator) from input sequences was written in C++. A workflow functional diagram is shown in Figure 5. The output from TIC-generator is a listing of all words contained in each transcript, together with its frequency of occurrence. The lists of words from the eight transcripts were combined, and then duplicates removed. The number of duplicates and unique words resulting from duplicate removal is listed in Table 3. We provide program listing for the dissemination of software from the project in Supplementary Materials, including software, tools and related resources, to the relevant research and user communities using open source resources.

Discussion
Much focus in RNA research is directed toward understanding the regulation of protein-coding genes [17]. However, ncRNAs also form well-orchestrated regulatory interaction networks [1]. For example, computational modeling of miRNA target sites suggests a broad network of miRNA-lncRNA interaction [18]. Recently, there have also been reports inferring the feasibility of a broad interaction network comprising competing endogenous RNAs (ceRNAs) where ncRNAs could change regulatory RNA by binding and titrating them off the corresponding binding sites on protein coding messengers [2].
We suggest that miRNA sequence delineates the molecular mechanisms underlying Brownian motion as a broad class of RNA with the transcriptome composed as an RNA language with interactions between transcripts and protein molecules at the same location. Recently, the attention of the relevant research community has been focused on non-coding RNAs and their physiological/pathological implications [19]. As the number of RNA experiments reported rapidly increases and transcriptional units are better annotated, databases indexing RNA properties and function from transcriptome measures become essential tools in this process. This early stage software development effort makes use of a sandbox-oriented software development environment [20] that enables development for miRNA physiology study. This work is generalizable to different sequence technologies, RNA/seq, microarray, etc., and is scalable to different organisms [21], organs, or sub-cellular compartments depending on sample preparation for the libraries. Caution must be exercised with reported studies not adequately controlling for the source of extracellular particles, not differentiating between lipid coated RNAs, exosomes, microparticles, or apoptotic bodies.

Sliding Window Word Generator
A sliding window variable size word generator (TIC-generator) from input sequences was written in C++. A workflow functional diagram is shown in Figure 5. The output from TIC-generator is a listing of all words contained in each transcript, together with its frequency of occurrence. The lists of words from the eight transcripts were combined, and then duplicates removed. The number of duplicates and unique words resulting from duplicate removal is listed in Table 3. We provide program listing for the dissemination of software from the project in Supplementary Material, including software, tools and related resources, to the relevant research and user communities using open source resources.

Duplicate Words in Cloud
Eight transcripts were selected as representative of the most abundant RNAs in a cell. Four were tRNAs with lengths 71 and 73, and four were rRNAs with sizes 121, 156, 1871 and 5034 nucleotides. Individually within these eight transcripts, there is a total of 7470 nucleotides, which collectively have 7422 total words of length w = 7, with 1797 duplicate words leaving 5625 unique words describing the transcriptome as a simple model. If we combine the unique words of these eight transcripts, we find 691 duplicates, leaving a total of 4934 unique words in the simple model in Table 3. The combined total number of duplicates would be 1797 + 691 = 2488 while for the random transcriptome (average of four randomized transcriptomes) total duplicates are 840 + 659 = 1499. If instead we examine words of length w = 8, there are a total of 7400 words with 961 duplicates leaving 6439 unique words in the simple model. The combined total number of duplicates would be 961 + 288 = 1249 while for the random transcriptome (average of four randomized transcriptomes) total duplicates are 199 + 195 = 394.

Conclusions
miRNAs that are enriched in exosomes share greater similarity to the overall transcriptome than miRNAs found preferentially in the cytoplasm or nuclear compartments. Nuclear enriched miRNAs share less similarity to the transcriptome than cytoplasmic miRNAs. From the various measures examined in this study, tCount values progress from lower for nuclear, higher for cytoplasmic, and highest for exosomal miRNAs with the greatest significance.