3.1. Inexact q-mers Improve Prediction of LncRNA Subcellular Localization
First, we tested our (
q,
k)-mismatch models on lncRNAs from the LncATLAS dataset. For this initial experiment, we worked on four selected cell lines, namely, HT1080, A549, NCI.H460, and SK.N.SH (see
Section 5).
Table 2a shows the validation results using our proposed model with latent features on the four cell lines for three classifiers with different mismatches. We tested with different numbers of latent features (512, 256, 128, 64, and 32) and found 512 latent features to provide the overall best results. In the table, “kmiss oa” denotes overall accuracy when using 6-mer with
k mismatches; here,
k is from 0 to 3. While “kmiss auc” denotes the area under the receiver operating characteristic curve when using
k-miss,
k is also from 0 to 3. In all cases, the highest accuracy for each classifier was observed when we applied inexact 6-mers (i.e.,
k-mismatch with
k > 0).
Table 2b shows more detailed results using 1DCNN with 512 latent features.
Supplementary Table S1a,b show corresponding results for using SMOTE [
62] and class weight during training for data imbalance for the case of using all 4096 features.
The results revealed three key observations: (1) in general, when predicting lncRNA subcellular localization, the 1D-CNN and RF classifiers performed better than MLP in this study, both with all 4096 features and the 512 latent features that were extracted from the 4096 features; (2) working with SMOTE was more effective than weight fitting during training for our lncRNA localization problem; and (3) the (q,k)-mismatch model worked well for predicting subcellular localization. The highest scores occurred when there was some mismatch (k > 0), showing the performance improvement with inexact q-mers over traditional exact q-mers. In particular, 6-mers with two mismatches usually produced the best performance on our dataset. In the following, we will focus on the (6,2)-mismatch model.
To further evaluate the proposed approach using inexact
q-mers, we applied 6-mers with up to three mismatches on mRNA transcripts from the LncATLAS dataset, following the same general procedure used for lncRNAs (four cell lines: HT1080, A549, NCI.H460, and SK.N.SH).
Table 3a,b show the results for 512 latent features.
Supplementary Table S1c shows the results using all 4096 6-mers. Similar to lncRNAs, we applied SMOTE to handle the data imbalance problem. When using 512 latent features, 6-mers with two mismatches achieved the highest accuracy of 67.63%, with the highest AUC of 0.74 (compared with 63.69% accuracy and an AUC of 0.69 using exact 6-mers). Thus, similar to lncRNAs, using inexact
q-mers (with mismatch) also improved localization prediction performance for mRNAs.
To further evaluate the generality of the proposed approach, we also tested the (
q,
k)-mismatch model on the APEX-Seq dataset using the same procedure as above. First, we observed the heavily imbalanced nature of LncRNA data (see datasets in
Section 5). With the log fold change (logFC) threshold (log FC ≥ 0.75), APEX-Seq had only 34 LncRNAs (4 cytoplasmic, 30 nuclear). Thus, we could not use this for training the model. Rather, we used these as a test set for the model trained using the LncATLAS dataset and log2(CN-RCI) threshold of 0. The results are shown in
Table 4a. Overall, the results are similar, though generally lower than what we obtained when we trained and tested on the LncATLAS dataset (
Table 2a), except for the 1DCNN model. Although the accuracy was comparable, the AUC values in
Table 4a were relatively lower. We suspect that this disparity in performance may be due to the potential difference in the meaning of the two thresholds used for the two datasets. It is possible that the log fold change threshold (log FC ≥ 0.75) used for APEX-Seq may not necessarily correspond to the CN-RCI threshold (log CN-RCI ≥ 0) used in LncATLAS to define the two classes.
Since we have a larger amount of data on mRNAs from the APEX-Seq dataset, we repeated our prior experiments on mRNA sub-cellular localization. Here, we were able to train and test the models using data from APEX-Seq using the log FC ≥ 0.75 threshold (using one combined dataset, not four cell lines this time). Similar to
Table 3a on mRNA results with LncATLAS,
Table 4b shows the results for 1DCNN with 512 latent features when applied on the APEX-Seq mRNA dataset. As with
Table 3a, the best results (highest OA and AUC) in
Table 4b were observed with inexact match, mostly using the (6,1)-mismatch or (6,2)-mismatch models for this dataset. The results from APEX-Seq (
Table 4b) are comparable with (though generally lower than) those from LncATLAS (
Table 3a).
To further investigate the generality of the approach, we also evaluated the performance of the proposed approaches on the Ribosome lncRNA dataset [
63]. Unlike the LncATLAS dataset, this dataset is a non-cellular fractionation dataset, similar to the APEX-Seq dataset [
55]. Given the size of the dataset (272 lncRNA genes, 155 nuclear, and 117 cytoplasmic; see Materials and Methods), we were able to both train a model for lncRNA localization and also validate it using the Ribosome dataset. The results are shown in
Table 5a. (We note that, for the experiments on the Ribosome lncRNA dataset, we show the results for using the 4096 (6,
k)-inexact 6-mer features (that is, without the autoencoder). Using the autoencoder generally resulted in a lower performance on this dataset). As can be observed in the table, the results can be compared with those obtained for lncRNA localization using the LncATLAS dataset (
Table 2a).
Table 5a (results for Ribosome dataset) shows a slightly better AUC, while
Table 2a (results for LncATLAS dataset) shows better accuracy.
As a second experiment on the Ribosome lncRNA dataset, we used the entire dataset as a test set for a model trained using the lncATLAS dataset with a log CN-RCI threshold of 0. (This is similar to what we did for the APEX-Seq dataset (
Table 4a).) To avoid possible data leakage between training and test sets, for this experiment, we removed the lncRNAs that appeared in the APEX-Seq dataset or in the Ribosome dataset from the LncATLAS dataset before training. The results are shown in
Table 5b. Once again, the results with the Ribosome dataset (
Table 5b) showed a better AUC, while results on the APEX-Seq dataset showed better accuracy. As was noted in previous experiments, the best results are usually observed using inexact
q-mers (i.e., with
k > 0), showing that inexact
q-mers produce improved results over exact
q-mer profiles, even with cross-dataset training and testing.
Overall, the results on the APEX-Seq and Ribosome lncRNA datasets are similar, though generally lower than what we obtained when we trained and tested on the LncATLAS dataset (
Table 2a,b), except for the 1DCNN model. Although the accuracy was comparable for the LncATLAS (
Table 2a) and APEX-Seq dataset (
Table 4a), the AUC values in
Table 4a (for APEX-Seq) were relatively lower. Similarly, the AUC was comparable between the LncATLAS and Ribosome datasets, while the accuracy with the Ribosome dataset (
Table 5a) was lower. We suspect that this disparity in performance may be due to the potential difference in the meaning of the two thresholds used for the two datasets. For instance, it is possible that the log fold change threshold (log FC ≥ 0.75) used for APEX-Seq may not necessarily correspond to the CN-RCI threshold (log CN-RCI ≥ 0) used in LncATLAS to define the two classes.
Subsequently, we trained separate models for each of the 15 cell lines and tested the models using test data from the given cell line using our (6,2) inexact
q-mer model. Based on observations from the earlier experiments with four cell lines, we only considered two classifiers in this more expansive evaluation, namely (a) 512 latent features using 1D-CNN and (b) all 4096 features using RF classifiers.
Figure 2 shows the overall accuracy for the two classification models (1DCNN and RF) across the 15 cell lines, using the longest transcript for each gene. The average overall accuracy across the 15 cell lines was 68.14% for 1D-CNN (with 512 latent features) and 68.45% for RF (with all 4096 features).
3.3. Is LncRNA Subcellular Localization Cell-Specific?
Using the CN-RCI values per gene per cell line provided by lncATLAS, we also explored the question of whether lncRNA subcellular localization is cell-type-specific or independent of the given cell type. To address this question, we considered three approaches: (1) correlation-based analysis, (2) cross-cell validation using our machine learning model, and (3) analysis of switching lncRNAs (also called shuttling lncRNAs).
First, we considered the possible correlation between the cell lines. We observed significant correlation between some pairs of cell lines, with some pairs having a correlation coefficient of over 0.8, for instance, 0.88 (HUVEC, IMR.90), 0.83 (SK.N.SH, HT1080), and 0.81 (MCF.7, HepG2). The cell lines IMR.90, SK.N.SH, and HUVEC were, overall, the most correlated with other cell lines, while the cell line H1.hESC (for human embryonic stem cell) appeared to be an outlier, with relatively low correlation with other cell lines (e.g., 0.3 (SK.MEL.5, H1.hESC)). A similar observation on H1.hESC was also made in [
25].
Supplementary Figure S1 shows the detailed Pearson correlation coefficient between the 15 cell lines using the CN-RCI values. The significant correlation between certain pairs of cell lines implies that the CN-RCI values from one cell line could provide us with some information about some other cell lines, indicating that the CN-RCI values are not completely independent.
We then investigated potential similarities or differences between the different cell lines using results from the machine learning models. Using the same general setting for the previous experiments (threshold 0, (6,2)-inexact q-mers, MLP auto encoder (AE), and 1D-CNN for classification using 512 autoencoder latent features), we trained lncRNA localization models using each cell line and then tested on every cell line using the trained model. The expectation is that, if lncRNA subcellular localization is cell-line-specific, the highest performance accuracy will be observed along the left diagonal; otherwise, some off-diagonal elements will be significantly higher for some cell lines.
Figure 3 shows the results, indicating performance for training on 1 cell line (the row) and testing on each of the 15 different cell lines (the columns). Interestingly, we found that, in almost every case, there were other cell lines that achieved higher accuracy than the original cell line (that is, many off-diagonal elements were higher than the diagonal elements). That is, a model trained on a given cell line could predict some other cell lines better than the original cell line it was trained on. For example, the NHEK-trained model scored 67.36% when tested on NHEK, but 74.95% overall on HeLa.S3, and 73.79 on SK.MEL.5. Similarly, a model trained on SK.N.DZ predicts most other models more accurately (on average) compared to predicting on SK.N.DZ.
Expectedly, training with the less-correlated cell lines, such as H1.hESC, led to reduced performance. Similar results were obtained using MLP and RF models. Overall, the results indicate that there are some clusters of cell lines wherein lncRNA localization in one is predictive for the others. This has a significant implication, as it suggests that it might be possible to develop a generalized model that can work well on most cell lines using data from only a few cell lines. This is more in line with the observation in [
60], where the authors suggested that lncRNA localization may not be cell-specific.
To further investigate this issue of cell line specificity, we conducted an analysis of lncRNA localization distribution across cell lines using our training dataset. (Given that the H1.hESC cell line is an outlier, we did not include it in this analysis). For each cell line, we used a class threshold of log(CN-RCI) = 0, that is, an lncRNA is classified as nuclear if log(CN-RCI) < 0 and cytoplasmic if log(CN-RCI) >= 0. We set the label for nuclear lncRNA to 0 and for cytoplasmic lncRNA to 1. For each lncRNA, we counted the number of cell lines where it occurred in each of the two classes across all cell lines. We denote these counts as
C and
N for cytoplasmic and nuclear, respectively. We then define a switching lncRNA (also called switching gene) as one with
. Thus, a switching lncRNA will occur in the nuclear and cytoplasmic regions in at least two cell lines, respectively, and the number of cell lines with each compartment will be about the same across all the cell lines.
Table 7 shows some examples of switching genes in our dataset. Column “A549” to column “SK.N.SH” show the 14 cell lines used for this analysis. Each row corresponds to an lncRNA gene. Each element in the table denotes the observed localization (label) of the lncRNA in the given cell line. An empty cell denotes when there is no available CN-RCI for the given lncRNA in the corresponding cell line. Columns “Cyto (C)” and “Nuclear (N)” record the number of cell lines where the lncRNA had cytoplasmic or nuclear localization, respectively. The column “Cell_count” = C + N is the total number of cell lines which have a CN-RCI value for the lncRNA. “C-N” denotes the difference between the number of cytoplasmic and nuclear localizations. As
Table 7 shows, genes ENSG00000264207, ENSG00000248049, and ENSG00000117242 each appear in 13 cell lines and exhibit varying localization patterns. Using this method, we identified 185 switching genes. More detailed information on switching genes for our entire dataset can be found in
Supplementary Table S2.
We conducted an in-depth analysis of these identified switching genes to gain further insights into this specific group of lncRNAs. We used three bioinformatics resources, namely, DAVID [
64,
65] (
https://david.ncifcrf.gov/tools.jsp, accessed on 14 October 2024), a functional annotation tool, GeneCards (
https://www.genecards.org/, accessed on 14 October 2024), a human gene database, and cncRNAdb [
66], a manually curated resource of bifunctional RNAs. We queried the cncRNAdb and found that 12 of the 185 switching genes were identified as bifunctional lncRNAs (see
Table 8). Bifunctional lncRNAs tend to appear in multiple localizations in a cell. We found that DAVID annotated 11 of the 185 switching genes. Five of these eleven, namely, CTBP1-DT, GNAS-AS1, OIP5-AS1, RHPN1-AS1, and SNHG7, were marked by GeneCards as being localized in multiple subcellular regions, such as nucleus and cytoskeleton or cytosol (see
Table 9).
Overall, while lncRNA localization may not be cell-line-specific in general, the notion of switching genes and the demonstration of specific examples highlight the challenge of accurately predicting lncRNA localization via computational methods.