mintRULS: Prediction of miRNA–mRNA Target Site Interactions Using Regularized Least Square Method

Identification of miRNA–mRNA interactions is critical to understand the new paradigms in gene regulation. Existing methods show suboptimal performance owing to inappropriate feature selection and limited integration of intuitive biological features of both miRNAs and mRNAs. The present regularized least square-based method, mintRULS, employs features of miRNAs and their target sites using pairwise similarity metrics based on free energy, sequence and repeat identities, and target site accessibility to predict miRNA-target site interactions. We hypothesized that miRNAs sharing similar structural and functional features are more likely to target the same mRNA, and conversely, mRNAs with similar features can be targeted by the same miRNA. Our prediction model achieved an impressive AUC of 0.93 and 0.92 in LOOCV and LmiTOCV settings, respectively. In comparison, other popular tools such as miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir scored AUCs at 0.73, 0.77, 0.55, 0.84, and 0.67, respectively, in LOOCV setting. Similarly, mintRULS outperformed other methods using metrics such as accuracy, sensitivity, specificity, and MCC. Our method also demonstrated high accuracy when validated against experimentally derived data from condition- and cell-specific studies and expression studies of miRNAs and target genes, both in human and mouse.


Introduction
The process of microRNA (miRNA)-directed silencing of messenger RNA (mRNA) has been described as another layer of gene regulatory mechanism in many organisms including animals and plants. By means of regulating gene expression at the post-transcriptional level, miRNA are involved in a wide range of biological processes such as cell development and maintenance [1], cell-to-cell interactions [2], and cancer growth and progression [3]. Around 90% of human genes are governed and regulated by one or more miRNAs at the post-transcriptional level [4].
Factually, single miRNA can interact with multiple mRNAs and individual mRNA can also be targeted by several miRNAs, forming a far more complex network of gene regulation [5,6], which is challenging to study and understand. The interaction between miRNA (average~22-nt) and its target mRNA involve a seed region (~2-8 nucleotide long) on the miRNA, which seeks a complementary site mostly in the 3 untranslated region (UTR) of mRNA to bind with; however, perfect seed pairing (canonical interaction) is not required to form a miRNA-mRNA complex in a so-called non-canonical interaction [7,8]. In previous studies, miRNA binding sites have also been identified in the 5 UTR and coding regions [9,10]. These interactions have shown silencing effects on gene expression [11]. Recent studies also suggested that flanking regions (other than seed binding regions) at

Kernel Similarity Scores for miRNA
We developed a comprehensive scoring scheme by using relevant features that are more likely to discriminate between the binding and non-binding MTAs. The rationale for including each feature is provided below.

Free Energy (FE)-Based Similarity
Free energy of RNA molecules (miRNAs and mRNAs) is a very important property that facilitates their interactions because the energy is involved in unfolding the interaction sites to allow pairing of nucleotides between miRNAs and mRNAs. Therefore, lower overall free energy means higher stability of the miRNA-mRNA complex, which can be interpreted as higher possibility of the real interactions. Long et al., 2007 also found a correlation between the folded structure of mRNA and efficacy of miRNAs-driven repression [37]. This concept has also been previously used for the development of various miRNA-mRNA interaction prediction tools such as MiRNATIP [38], Avishkar [39], RNAhybrid [40], and other algorithms [41]. In the current work, Python package, seqfold (https://pypi.org/project/seqfold/, accessed on 28 March 2022) was used to calculate the minimum free energy of each miRNA. This program takes the nucleotide sequence of a given miRNA as input to calculate free energy (also referred as folding energy) based on the thermodynamic principles. The FE-based pairwise similarity between two miRNAs m i and m j is calculated as Euclidean distance (Appendix A, Equation (A1)) and is denoted as FE m m i , m j . The pairwise matrix representing FE-based similarity between all miRNAs is denoted as FE m .

Gaussian Interaction Profile (GP) Kernel Similarity (Based on Known Associations)
The application of GP-based similarity has been successfully implemented in predicting drug-target interactions [42,43], drug-drug interactions [44], and miRNA-disease associations [45]. Here, GP kernel similarity between two miRNAs, m i and m j , is defined as GP m m i , m j .
IP(m i ) is the binary vector representing the interaction profile of miRNA, m i . ϕ m is selected to adjust the kernel width and can be calculated as: nm equals the total number of selected miRNAs. Based on previous studies [46], ϕ m is set to 1. As defined above, pairwise matrix of GP-based similarities of selected miRNAs is denoted as GP m .

Needleman's Sequence Similarity
As evident from experimentally verified miRNA-target pairs, miRNA with similar seed sequences are more likely to regulate a similar set of genes [47]. Based on this line of thought, the sequence-based pairwise similarity score was calculated using Needleman-Wunsch algorithms [48]. The similarity score between two miRNAs, m i and m j is denoted as NS m m i , m j , and the whole pairwise matrix is represented by NS m .

Simple Sequence Repeats (SSRs)-Based Similarity
SSRs are repetitive nucleotide sequences and are considered as important binding signatures embedded at the genetic level. Previous study found that miRNAs binding to complementary regions with SSRs showed perturbation in the RNA cross-talks in case of myotonic dystrophy type 1 (DM1) and type 2 (DM2) [49]. Considering the significance of SSRs in mRNA binding, we extracted repeat motifs (RF) from each miRNA using ssrtool (https://archive.gramene.org/db/markers/ssrtool, accessed on 20 November 2021). With the filtering criteria of minimum 3 repeats, we found 12 di-, 51 tri-, and 32 tetramers in all miRNAs. Considering the repeat counts in each miRNA, the Gaussian profile based pairwise similarity SR m m i , m j between miRNAs, m i and m j are calculated as follows: where RF(m i ) and RF m j are binary vectors representing all RFs in miRNAs m i and m j . Again, ϕ m is selected to adjust the kernel width and can be calculated as: As explained above, ϕ m is set to 1 in this case. nm is the total number of selected miRNA and SR m represents the corresponding pairwise matrix of SR-based similarities.

Integration of miRNA Similarity Scores
All four types of feature scores were combined by employing a weighted combination approach to obtain an integrated similarity matrix, S m , as defined below: where α i represents weights given to the different similarities.

Kernel Similarity Scores for miTS
Similar to the scores for miRNAs, we employed a set of discriminatory features for miTS as follows.

FE-Based Similarity between miTS
The seqfold tool was used in similar manner to calculate the minimum free energy of each miTS, followed by calculation of FE-based similarity between two miRNA binding Genes 2022, 13, 1528 5 of 25 sites, t i and t j , as denoted by FE t t i , t j . The final symmetrical matrix of pairwise FE-based similarities is termed as, FE t .

Target Site Accessibility (TA)-Based Similarity
Accessibility of the miRNA target site is responsible for easing miRNA binding and subsequent miRNA-driven regulation [6,15]. We calculated accessibility of miTS using RNAplfold module of ViennaRNA package (http://www.tbi.univie.ac.at/RNA/, accessed on 20 November 2021). The pairwise similarity between TAs of two miTS t i and t j is calculated based on Euclidean distance and is denoted as TA t t i , t j . The matrix representing score for chosen miTS is termed as TA t .
2.3.3. AU Content (AU)-Based Similarity mRNA can be folded to form a secondary structure which might hinder the repression potency of miRNA by lowering the site accessibility [50]. A previous study suggested that lowering the GC content (or high local AU content) near the target sites and also in the 3 UTR region of mRNA increases accessibility to interact with miRNA [6,51]. Therefore, the GC content on each miTS was calculated separately, followed by calculation of pairwise AU-based similarity between two miTS, t i and t j based on Euclidean distance (Appendix A, Equation (A2)), and is dented by AU t t i , t j . The final similarity matrix of AU-based similarities between different miTS is represented by AU t .

Simple Sequence Repeats (SSRs)-Based Similarity
Similar to miRNAs, SSR motifs were extracted from each miTS with the same filtering criteria, and Gaussian profile-based pairwise similarity SR t t i , t j , between miTSs, t i and t j were calculated. Here, we denote the whole pairwise matrix of all miTS as SR t .

Integration of miTS's Pairwise Similarities
Similar to the miRNAs analysis, different similarity matrices were combined with providing specific weightage β i to each one, as described below, to get final matrix S t .
β i provides weights given to a particular feature.

mintRULS
We developed a computational model, mintRULS, which utilizes known MTAs to predict possible interactions while incorporating multiple similarity-based kernels of miRNA and miTS. The relevance score is calculated based on Kronecker product and the regularized least square (RLS) method. The adjacency matrix, A nm×nt was generated to describe the known and unknown associations between nm miRNAs and nt miTS. For known associations between miRNA m i and miTS t j , the association value A m i ×t j was assigned 1, else 0.
As illustrated in Figure 1, out of the whole interaction data a random dataset with k number of miRNAs M = {m 1 , m 2 , . . . m k }, and l number of target sites T = {t 1 , t 2 , . . . t l } is selected to form random adjacency matrix A k×l ⊂ A nm×nt . The samples for training can be prepared as S = {(x 1 , y 1 ) , (x 2 , y 2 ), . . . (x n , y n )}, where x i and y i represent miRNA-miTS pair and corresponding binary level in the adjacency matrix, respectively with n = k × l. Further, as explained in [52], using the labeled training samples S, the following objective function is minimized with the goal of learning a function to generalize it on new miRNA-miTS samples.
|| || is the norm of function measured in Hilbert space with kernel function .
The regularization parameter > 0 is adjusted for balancing prediction error and model complexity.
According to Representer Theorem [53], the function in the above equation can be expressed in the following form to get minimizer of the objective function . Further, as explained in [52], using the labeled training samples S, the following objective function J is minimized with the goal of learning a function f to generalize it on new miRNA-miTS samples.
f k is the norm of function f measured in Hilbert space with kernel function K. The regularization parameter λ > 0 is adjusted for balancing prediction error and model complexity.
According to Representer Theorem [53], the function f in the above equation can be expressed in the following form to get minimizer of the objective function J.
As calculated in [54], || f || 2 K = α T Kα, the function can be represented as follows: As previously mentioned in [55], α in the above equation can be calculated by solving following linear equation: where K is the Kronecker product of two kernel similarities functions, K = S m ⊗ S t , with S m and S t as integrated similarity matrix of chosen miRNA and miTS. I is the identity matrix. As referred in the previous studies [56,57], the eigen decomposition of the kernel matrices S m and S t are performed as follows: In the above eigen decomposition, Q m and Q T m represent eigenvalue vector and its transpose, respectively for miRNAs. Similar notations stand for miTS. Λ m and Λ t are the diagonal matrices. α in Equation (9) can be calculated as follows: The performance of mintRULS model was evaluated by conducting cross-validation (CV) mainly in two ways: (1) Leave-One-Out-CV (LOOCV) and (2) Leave-miTS-Out-CV (LmiTOCV), using human and mouse datasets, separately. LOOCV refers to the condition when one MTA is considered as a test sample while the remaining ones in the adjacency matrix A k×l are considered as training samples. In LmiTOCV, 10% of all miTS and their associations with miRNA are considered as test data while remaining MTAs in A k×l are kept for training the model. To make the simulation process computationally inexpensive, the random k miRNA and l miTS are chosen from the original adjacency matrix A nm×nt to form a sample adjacency matrix A k×l , with k = nm and l = 0.1 × nt. This randomization is iterated over 100 times to reduce impacts of data overfitting, and the model is simulated each time in both the environments, LOOCV and LmiTOCV.

Score Normalization and Performance Evaluation
Actual and predicted miRNA-miTS interactions were used to calculate true positive rate (TPR), and false-positive rate (FPR). Receiver operating characteristics (ROC) curve was drawn to determine the area under ROC curve (AUC) for estimating the performance of the models. Additionally, other parameters such as accuracy, sensitivity, specificity, and MCC were also calculated for human and mouse datasets, separately. Minimum miTS sequence length as 40 and 30 nucleotides were considered to perform simulations in case of human and mouse, respectively. In the present analysis, AUC with values 0.5 meant the model can predict randomly, while AUC = 1 indicated the best performance of the model. Further, mintRULS-predicted scores were normalized using unity-based methods to classify the miRNA-miTS pairs, as explained below: where a = 0, and b = 1 was set in current model. X is the derived normalized score of predicted score X for an interacting miRNA-miTS pair. X min and X max are minimum and maximum mintRULS score obtained for that miRNA across all miTS. The normalized score will provide space to define the strengths of the predicted interactions rather than classifying them in binary (on/off) relationships. All the pairs were divided into three categories based on quantile normalization of the score. The lower and upper quartile lines are considered as boundaries between each category, as defined below: Weak Targets: <lower quartile (25th quartile). Moderate Targets: between lower quartile (25th quartile) and upper quartile (75th quartile). Strong Targets: >upper quartile (75th quartile).

Comparison with Previous Methods
We also compared mintRULS predictions with the previous popular tools and databases which include miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir [24]. To make the comparison methodologically relevant and effective, we also included the tools whose working strategies directly or indirectly focus on features of miRNAs and their target sites. More specially, the objective here is to compare prediction power of mintRULS with other tools, which will subsequently help to understand importance of inclusion of multiple features (in pairwise manner) over single features. The interacting pairs predicted by these resources were obtained as of 20 March 2021.
MBSTAR is a machine learning program that extracts features from validated potential binding sites in the mRNA and use them to train the classifier and predict target and nontarget mRNAs. Further, by using random forest classifier, the algorithm predicts functional binding sites in the mRNA. To choose a dataset of highly interacting miRNA-mRNA target pairs, all human sequence pairs with scores higher than 0.5 were considered as positive pairs and included in the present comparative analysis. miRDB database contains miRNA-target pairs predicted by MirTarget, which is an algorithm trained by using crosslinking immunoprecipitation (CLIP)-based binding and miRNA expression data using the SVM machine learning framework. The algorithm looks for the common features which are associated with both miRNA and downregulation of the target. As a prediction score, the algorithm generates a probability score between 0 and 100 for each target site. In case of multiple target sites on mRNA, the individual score is combined to calculate final score. miRDB provides only interacting pairs with score > 50. Here, we downloaded all human interacting pairs and compared with mintRULS's predictions.
STarMir, a web server, was developed on a logistic modeling framework and trained using CLIP data. The method incorporates a variety of thermodynamic, structural, and sequence-based features for seed and non-seed regions as well as different regions (e.g., (3 UTR, CDS and 5 UTR)) on mRNA. In terms of the prediction score, the model outputs the probability score representing miRNA-target site interactions. As discussed in the article, predictions with the probability score of 0.75 or higher give highly likely interacting pairs. Therefore, only highly interacting pairs were considered in this analysis for comparison.
TargetScan predicts miRNA-target interactions by matching conserved 8-mer, 7-mer, and 6-mer sites in the seed region. TargetScanHuman (v 7.2) (https://www.targetscan.org/vert_80/, accessed on 20 March 2021) utilizes various binding sites related characteristics and 14 features to predict interactions between miRNA and its targets. From the database, interacting pairs with weighted context++ score percentile higher than 50 were considered as positive pairs in the comparative analysis.
RPmirDIP provides interacting pairs predicted by mirDIP (microRNA Data Integration Portal) [58] which uses a semi-supervised machine learning method "Reciprocal Perspective (RP)". In the present analysis, all the pairs with the recommended Difference of Scores (DoS) of higher than 0.5 were considered.
The separate data matrix representing interactions between miRNA and targets were prepared for each database discussed above. The interacting and non-interacting pairs in the test dataset were searched in each data matrix, and confusion matrix was built to calculate AUC values in each case.

Using Literature-Based Data
The top predictions by mintRULS were compared with the information in literature and databases including miRDB and TargetScan.

Using Expression Data of miRNA and mRNA in Gastrointestinal (GI) Cancer
TCGA level 3 gene/mature miRNA expression data for pan-GI cancers (stomach adenocarcinoma, STAD; cholangiocarcinoma, CHOL; pancreatic adenocarcinoma, PAAD; esophageal carcinoma, ESCA; and liver hepatocellular carcinoma, LIHC) were collected and analyzed using QIAGEN Ingenuity Pathway Analysis (IPA) (please refer to Supplementary Document for the methodology of IPA) to identify negative expression correlations of top predicted miRNA-mRNA pairs from mintRULS.

Using Expression Data of miRNA and mRNA in Normal and Septic Mice
The expression data of miRNAs (GSE74952 study) and genes (GSE55238 study) in control and septic mice, respectively, were downloaded from Gene Expression Omnibus (GEO) database and analyzed using GEO2R. The mintRULS predicted pairs that showed negative expression correlations were identified.
More methodological description of (c) and (d) are provided in Appendix A (method section).

Performance Evaluation of mintRULS
mintRULS achieved an average AUC of 0.93 and 0.92 on the human dataset, while it scored AUC of 0.861 and 0.865 on the mouse dataset in LOOCV and LmiTOCV simulation environments, respectively ( Table 1). The ROC profile indicating AUC measurements in both the cases are shown in Figure 2A,B. The model also recorded high accuracy at 90.8% and 91% in LOOCV and LmiTOCV simulations, respectively, using human data, supporting its strong prediction ability. In the case of mouse also, the achieved accuracies were 84.6% and 84.4% in LOOCV and LmiTOCV settings (Table 1). For more intuitive evaluations, high measurements of the other parameters including MCC, specificity, and sensitivity (Table 1) indicated high performance of the model on human as well mouse datasets. In case of mouse, the prediction performance of the model has been observed to be comparatively similar in both the simulation environments. In addition, the high specificity indicates the better ability for identifying specific interactions between miRNA and miTS in the mouse. We therefore interpreted that the model has the ability to predict miRNA-target site interactions. supporting its strong prediction ability. In the case of mouse also, the achieved accuracies were 84.6% and 84.4% in LOOCV and LmiTOCV settings ( Table 1). For more intuitive evaluations, high measurements of the other parameters including MCC, specificity, and sensitivity (Table 1) indicated high performance of the model on human as well mouse datasets. In case of mouse, the prediction performance of the model has been observed to be comparatively similar in both the simulation environments. In addition, the high specificity indicates the better ability for identifying specific interactions between miRNA and miTS in the mouse. We therefore interpreted that the model has the ability to predict miRNA-target site interactions.   Figure 3).

Evaluation of Regularization Parameter (λ)
As defined in the method section, tuning the regularization parameter (λ) is important to reduce the overfitting which might decrease the variance of estimated regression parameters by adjusting the bias. Herein, we evaluated λ over different datasets in both LOOCV and LmiTOCV settings. Using the adjacency matrix A nm×nt , five different random data matrices, i.e., A 845×1000 , A 845×2000 , A 845×3000 , A 845×4000 , and A 845×5000 comprise of all 845 miRNAs and different numbers of random miTS, as shown in the subscript, were prepared. Figure A1 (Appendix A), indicated that a higher miTS number tends to provide better AUC in both LOOCV and LmiTOCV. However, it is not advisable to choose a larger number of miTS as it creates a very high number of empty cells in the adjacency matrix which eventually could lead to the underperformance of the model. Based on these results, we selected the dataset A 845×3000 as optimal for further analyses.

Evaluation of Regularization Parameter (λ)
As defined in the method section, tuning the regularization parameter (λ) is important to reduce the overfitting which might decrease the variance of estimated regression parameters by adjusting the bias. Herein, we evaluated λ over different datasets in both LOOCV and LmiTOCV settings. Using the adjacency matrix × , five different random data matrices, i.e.,  Figure A1 (Appendix A), indicated that a higher miTS number tends to provide better AUC in both LOOCV and LmiTOCV. However, it is not advisable to choose a larger number of miTS as it creates a very high number of empty cells in the adjacency matrix which eventually could lead to the underperformance of the model. Based on these results, we selected the dataset × as optimal for further analyses. Next, using the data matrix × , AUC was measured for different values of regularization parameter λ. Interestingly, as shown in Figure 4A, λ > 35 obtained the highest values of AUC corresponding to 0.931 and 0.925 in the case of LOOCV and LmiTOCV, respectively, which we interpreted as optimal in our case. With the chosen λ = 35, the model extracts favorable features from miRNA and miTS sequence with adding some obvious biases to predict miRNA-miTS interactions. Next, using the data matrix A 845×3000 , AUC was measured for different values of regularization parameter λ. Interestingly, as shown in Figure 4A, λ > 35 obtained the highest values of AUC corresponding to 0.931 and 0.925 in the case of LOOCV and LmiTOCV, respectively, which we interpreted as optimal in our case. With the chosen λ = 35, the model extracts favorable features from miRNA and miTS sequence with adding some obvious biases to predict miRNA-miTS interactions.

Evaluation of Regularization Parameter (λ)
As defined in the method section, tuning the regularization parameter (λ) is important to reduce the overfitting which might decrease the variance of estimated regression parameters by adjusting the bias. Herein, we evaluated λ over different datasets in both LOOCV and LmiTOCV settings. Using the adjacency matrix × , five different random data matrices, i.e.,  Figure A1 (Appendix A), indicated that a higher miTS number tends to provide better AUC in both LOOCV and LmiTOCV. However, it is not advisable to choose a larger number of miTS as it creates a very high number of empty cells in the adjacency matrix which eventually could lead to the underperformance of the model. Based on these results, we selected the dataset × as optimal for further analyses. Next, using the data matrix

Effect of Longer Sequence Length
The computational models have fully or partially utilized features associated with miTS sequences to predict interactions with miRNAs. As introduced earlier, GC content, accessibility, seed pairing, and flanking sequences are some of the widely used features in these models [15]; however, lack of emphasis has been given on consideration of the length of binding sites in most of the models. This is important mainly in the sense that an optimized length of miTS (including seed regions and flanking regions on both sides) can provide the best and effective features to predict more accurate interactions with miRNAs.
On this note, we performed systematic comparisons between different sequence lengths (=10, 20, 30, 40, and 50 nucleotides) of miTS to observe its impact on the model's performance. As shown in Figure 4B, the higher sequence length corresponds to better AUC, suggesting more powerful and effective features. The shorter length of miTS may possibly cause high noises in the simulation, as also stated in [61]. However, for obvious reasons, too lengthy sequences might side pass any mutational effect on miTS, and are thus not recommended. Therefore, a sequence length of 40 nucleotides was considered as the most optimal in the current analysis.

Feature Selection and Feature Contribution
The model is generalized over different weight combinations used for prioritizing features of miRNA and miTS, separately. In this simulation process, the weights associated with mRNA features were kept constantly distributed to determine individual effect by miRNA's features on model performance, as shown in Figure 5. In this case, Needleman sequence similarity and GP-based similarity showed higher contributions towards better performance of the model. Similarly, the effect of mRNA features was observed individually with no significant differences in the measured AUC values ( Figure 5). Considering these findings, we simulated feature formulations giving more weightage to the features with more individual contributions and achieved significant improvements in AUC up to 0.93 ( Figure 5). The model achieved higher AUCs of 0.81 and 0.80 for miRNA's features, Needleman Sequence (K mi2 )-, and Gaussian profile (K mi3 )-based similarities, respectively, as compared to the other two features, free energy (K mi1 ) and SSRs Gaussian-based similarity (K mi4 ). The GP-based calculations, as their intrinsic characteristic, are done with the assumption that similar miRNAs can interact with the same targets, and vice versa, which is the base hypothesis of this study. It can also cover nonlinear relationship of known miRNA-target interactions. Previous successful applications of GP kernels include development of feature-based models for predicting drugtarget interactions, miRNA-disease associations, circRNA-disease association, drug-disease associations, and drug-drug interactions [42][43][44][45]62]. Likewise, we also interpret that similaritybased models, including the current mintRULS, have the potential to predict miRNA-target interactions. On the other hand, SSR-based features, both from miRNA or mRNA, were not so predictive, perhaps because of the non-specificity of SSRs (i.e., n = 3 or 4 or 5) considered in the present study. As there are a handful of studies showing significance of SSRs in miRNA-target binding [49,63,64], further investigation on feature manipulation is required to better incorporate these features in the similarity-based modeling. From the different features considered for mRNA, free energy, AU content, and accessibility were among the top predictors in case of mintRULS. These many features and their roles in miRNA binding have been previously discussed in the literature [14,32,65], with raising questions on their systematic integration and incorporation to predictive modeling which is still a challenge to the model developers.

Validation
Interacting pairs between miRNA hsa-miR-548ba and three genes which include IFR, PTEN, and NEO1, were classified as "Strong Target", and showed consistency with the results in [59] (Table 2). Similarly, from the study [60], interacting pairs between miRNA hsa-miR-34a-5p and genes including SMAD7, SMAD2, CREB1, and CLOCK, were predicted as "Strong Target", while binding of hsa-miR-34a-5p with GRIA4 was predicted as "Weak Target". It is interesting to notice that most predicted results are consistent with the outcomes of the experimental studies ( Table 2). The interaction between these many pairs were also confirmed by performing protein level analysis in SH-SY5Y cells in the same study. Other interactions such as hsa-miR-22 with BMP-7/6, hsa-miR-146a-3p with TRAF6 and RIPK2, and hsa-miR-125b with PARP1, p53, Beta-actin, and CPSF6 from different studies were also verified and found consistent with the experimental outcomes ( Table 2).
Genes 2022, 13, x FOR PEER REVIEW 13 of 28 different features considered for mRNA, free energy, AU content, and accessibility were among the top predictors in case of mintRULS. These many features and their roles in miRNA binding have been previously discussed in the literature [14,32,65], with raising questions on their systematic integration and incorporation to predictive modeling which is still a challenge to the model developers.

Validation
Interacting pairs between miRNA hsa-miR-548ba and three genes which include IFR, PTEN, and NEO1, were classified as "Strong Target", and showed consistency with the results in [59] (Table 2). Similarly, from the study [60], interacting pairs between miRNA hsa-miR-34a-5p and genes including SMAD7, SMAD2, CREB1, and CLOCK, were predicted as "Strong Target", while binding of hsa-miR-34a-5p with GRIA4 was predicted as "Weak Target". It is interesting to notice that most predicted results are consistent with the outcomes of the experimental studies ( Table 2). The interaction between these many pairs were also confirmed by performing protein level analysis in SH-SY5Y cells in the same study. Other interactions such as hsa-miR-22 with BMP-7/6, hsa-miR-146a-3p with TRAF6 and RIPK2, and hsa-miR-125b with PARP1, p53, Beta-actin, and CPSF6 from different studies were also verified and found consistent with the experimental outcomes ( Table 2). The experimentally validated negative interactions between hsa-miR-125b and Beta-actin, and 18S RNA with gld-1:gfp mRNA were also predicted correctly as 'Weak Targets' (below 25th percentile) by mintRULS ( Table 2).
We also checked the performance of mintRULS for predicting interactions when mutation(s) in the seed region of miRNA occur. To perform this experiment, mutation information of a few randomly selected miRNAs in human (e.g., hsa-miR-124-3p, hsa-miR-662, hsa-miR-125a-5p, etc.) and mouse (e.g., mmu-miR-342-5p, mmu-miR-690, and mmu-miR-743a-3p) along with the effects on the interactions with their target genes were downloaded from the PolymiRTS database [66]. In total, 40 pairs comprising 20 wild-type (WT) and 20 mutated (mut) miRNAs with target genes were included for this experiment. The mutation-driven changes in the interactions are described by context+ score difference (∆S), as mentioned in Table 3. Interestingly, all the WT pairs (WT miRNAs and their target genes) were predicted as "Strong Targets", while 16 (out of 20) of their mutated We also checked the performance of mintRULS for predicting interactions when mutation(s) in the seed region of miRNA occur. To perform this experiment, mutation information of a few randomly selected miRNAs in human (e.g., hsa-miR-124-3p, hsa-miR-662, hsa-miR-125a-5p, etc.) and mouse (e.g., mmu-miR-342-5p, mmu-miR-690, and mmu-miR-743a-3p) along with the effects on the interactions with their target genes were downloaded from the PolymiRTS database [66]. In total, 40 pairs comprising 20 wild-type (WT) and 20 mutated (mut) miRNAs with target genes were included for this experiment. The mutation-driven changes in the interactions are described by context+ score difference (∆S), as mentioned in Table 3. Interestingly, all the WT pairs (WT miRNAs and their target genes) were predicted as "Strong Targets", while 16 (out of 20) of their mutated counterparts were predicted as "Weak Targets", showing good consistency with the information (∆S, representing disruption in the interaction) in the PolymiRTS database. It is noteworthy that even the other four pairs (i.e., hsa-miR-125a-5p with ZMYM3, hsa-miR-645 with COL4A4, mmu-miR-342-5p with RASL10B, and mmu-miR-690 with RBBP5) involving the mutated miRNAs were predicted as "Moderate Targets" but not as "Strong Targets", showing that the predictions are somewhat consistent with the ∆S (Table 3). We also considered a special case study by Dash et al., 2020, where interactions of hsa-miR-124-3p with WT PARP-1 and its mutant were observed. In this case, mintRULS performed very well by correctly classifying interactions of the miRNA with WT PARP-1 and with four of its variants (Mut1, Mut2, Mut3, and Mut4) ( Table 3).   Higher value of the context+ score difference (∆S) indicates an increased likelihood disruption of interactions between miRNA and target gene. * Entries for mutation in miRNAs. The values without * represents WT cases.
Other than the aforementioned case specific validation, we also compared mintRULS predictions with the information in literature and databases. Table 4 listed a few of such miRNA and their target genes which are also mentioned in literature and databases, along with the mintRULS's classifications. In most cases, the model's classifications corroborate with the information in literature and databases, with identifying few novel interactions.

Supporting Predictions by Expression of miRNA and mRNA in Human and Mouse
Comparison between differentially expressed miRNA and genes, IPA results ("High predicted" or "Experimentally observed pairs only), showed that that most of the IPA filtered pairs were predicted either as "Strong Target" or "Moderate Target", with only a few as "Weak Target" by our model (Table 5). In case of ESCA, 7 downregulated miRNAs were found associated with 26 upregulated target genes, while 10 upregulated miRNAs showed opposite expression correlation with 13 target genes ( Figure 6A). Similarly, in LIHC, 3 upregulated miRNAs were associated with 2 downregulated genes; and conversely, 7 downregulated miRNAs showed associations with 20 upregulated target genes. We also identified 28 miRNA-gene pairs with 18 upregulated miRNAs and 24 downregulated genes in STAD. In case of CHOL, 27 downregulated miRNAs with 97 upregulated target genes, and 17 upregulated miRNAs with 58 downregulated target genes associations were identified ( Figure A2, Appendix A). Not enough interacting pairs were identified in PAAD to carry forward in further analysis. Interestingly, the interacting pairs which showed experimental evidence in IPA analysis were all predicted as "Strong Target" by our method, indicating the strong predictability of the model. The detail of the interacting pairs with the FC values, IPA results, and mintRULS classifications are provided in Table S1 (Supplementary Material).  Table 5. The summary of miRNA-target gene pairs with opposite expression correlation of associated miRNA and target genes. The only pairs which showed "Experimental evidence" or "High prediction" in IPA analysis were selected. The corresponding columns also list pairs which were predicted as "Strong Target", "Moderate Target", and "Weak Target". * All the miRNA-gene pairs which showed "Experimental evidence" in IPA were predicted as "Strong Target" in mintRULS. For detailed information, Supplementary target genes, and 17 upregulated miRNAs with 58 downregulated target genes associations were identified ( Figure A2, Appendix A). Not enough interacting pairs were identified in PAAD to carry forward in further analysis. Interestingly, the interacting pairs which showed experimental evidence in IPA analysis were all predicted as "Strong Target" by our method, indicating the strong predictability of the model. The detail of the interacting pairs with the FC values, IPA results, and mintRULS classifications are provided in Table S1 (Supplementary Material). Table 5. The summary of miRNA-target gene pairs with opposite expression correlation of associated miRNA and target genes. The only pairs which showed "Experimental evidence" or "High prediction" in IPA analysis were selected. The corresponding columns also list pairs which were predicted as "Strong Target", "Moderate Target", and "Weak Target". * All the miRNA-gene pairs which showed "Experimental evidence" in IPA were predicted as "Strong Target" in mintRULS. For detailed information, Supplementary In case of mouse, analysis by GEO2R filtered in 11 differentially expressed miRNAs between normal and septic mice, while 5715 mRNA transcripts were differentially expressed. The integration of mintRULS predictions for all 11 miRNAs and the differentially expressed mRNAs identified 15 miRNA-mRNA pairs between 4 miRNAs and 10 mRNAs which also have a negative expression correlation between them ( Figure 6B). The normalized predicted mintRULS score, classification, and other related information for each pair are provided in Table S2 (Supplementary Material).

Discussion
The increasing importance of miRNAs in regulating many biological processes in cells and the overall human physiology is evident from several studies. One of the major challenges in this field is the identification of functional interactions between miRNAs and target genes. The advances in sequencing technologies and the growing volume of reliable data on miRNAs and their target sites on genes have greatly facilitated studies to predict the unknown and biologically relevant interactions. Bioinformatics solutions in this realm are very diverse and inconsistent in the sense that they incorporate unique characteristics in their algorithms and provide contradictory results [77]. Several machine learning models have utilized learning features for predicting miRNA-miTS interactions but could not achieve optimal performance due to the limitations in feature selection and lack of systematic integration of multiple features. In case of mouse, analysis by GEO2R filtered in 11 differentially expressed miRNAs between normal and septic mice, while 5715 mRNA transcripts were differentially expressed. The integration of mintRULS predictions for all 11 miRNAs and the differentially expressed mRNAs identified 15 miRNA-mRNA pairs between 4 miRNAs and 10 mRNAs which also have a negative expression correlation between them ( Figure 6B). The normalized predicted mintRULS score, classification, and other related information for each pair are provided in Table S2 (Supplementary Material).

Discussion
The increasing importance of miRNAs in regulating many biological processes in cells and the overall human physiology is evident from several studies. One of the major challenges in this field is the identification of functional interactions between miRNAs and target genes. The advances in sequencing technologies and the growing volume of reliable data on miRNAs and their target sites on genes have greatly facilitated studies to predict the unknown and biologically relevant interactions. Bioinformatics solutions in this realm are very diverse and inconsistent in the sense that they incorporate unique characteristics in their algorithms and provide contradictory results [77]. Several machine learning models have utilized learning features for predicting miRNA-miTS interactions but could not achieve optimal performance due to the limitations in feature selection and lack of systematic integration of multiple features.
To address some of these limitations, we employed a comprehensive list of learning features and trained them on a large experimental dataset to predict target sites with high accuracy. A special aspect of the current method includes the incorporation of pairwise similarities between various features of miRNA and miTS to improve the performance of the prediction model. The strategy for integrating pairwise correlation between miRNAs and miTS is useful for proving our hypothesis that similar miRNAs are more likely to target the same target site; and similar miTS tend to be targeted by the same miRNA. The real conditions for miRNA-miTS interactions depend on several factors such as target site accessibility [78] and complex stability [79]. mintRULS employed several of such features including binding free energy, the abundance of SSRs, and target site accessibility in the training process to develop an integrated objective scoring system. The working postulate of our method is different from those of the existing methods as evidenced by its superior prediction performance (with an AUC of 0.93) over miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir using human dataset. We attribute the performance advantage of mintRULS to its discrete feature selection and the integrated scoring function. As shown in Figure 5, the kernels built from individual features of miRNAs and miTS fairly performed with the highest AUC of 0.82, but the integrated kernel comparatively achieved higher AUC of 0.93, showing the successful integration of different sequence-derived features of miRNAs and mRNAs in a similarity-based fashion to train the model for predicting interaction pairs. The 100-fold randomization of the training dataset to train the model is extremely powerful to avoid prediction overfitting. Further, validation of predicted interacting pairs using different datasets, i.e., previous gene expression studies, literaturebased findings, IPA knowledgebase with experimental and predicted interactions, and the expression data of miRNA and the target genes in four type of GI cancers (Table 5  and Table S1) showed the potential of the current model to make biologically relevant predictions. Moreover, the capability of mintRULS to predict interactions between gene and miRNAs in WT as well as mutated cases is extremely promising (Table 3).
We also demonstrated that mintRULS program can be used to predict miRNA-miTS interactions in mouse with a reasonable AUC of 0.86. The interacting miRNA-mRNA pairs that show opposite expression correlation between normal and septic mice are in support of the predictions. Negative expression correlation between miRNA and target mRNA is not a clear indication of interactions between them, but throws the high possibility, which can be confirmed by further experiments.
Overall, validation of our top predictions in human and mouse shows the robustness and superior ability of mintRULS to predict miRNA and their target site interactions. Despite obtaining high performing and reliable prediction, mintRULS have worth-noticing limitations, which mainly include lack of an experimentally validated negative dataset, and exclusion of miRNA or target abundance information. The miRNA-gene interactions are surrounded by many of the complex networks such as protein-protein interactions and gene-gene interactions, which along with the other reliable biological information could be incorporated in the future to further improve the prediction accuracy and to extend this method to predict miRNA-gene interactions in other species as well.

Conclusions
We developed a regularized least square (RLS)-based method, mintRULS, which uniquely utilizes multiple feature similarity-based metrics of miRNA and target sites to predict their interactions in human and mouse. mintRULS achieved the highest AUC of 0.93 and 0.86 in case of human and mouse, respectively. The multiple iteration and randomization strategy has helped reduce data overfitting while improving generalization and prediction performance. In comparison to other methods that include miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir, mintRULS demonstrated superior prediction ability. The model successfully utilized the existing knowledgebase as well as its unique design for pairwise incorporation of different features of miRNAs and mRNAs to predict interactions between them. Further, rigorous validation of the top predictions using multiple data sources showed outstanding capability and reliability of the model. Our method also identified new miRNA-mRNA interacting pairs such as hsa-let-7d-5p and TIMP3, hsa-let-7e-5p and ZBTB7A, and hsa-miR-106b-5p and ATAT1, which needs to be validated by further experimental studies.
We anticipate that the current method could be easily adopted to predict miRNA-gene interactions in other species as well to improve our knowledge of miRNA-regulated gene expression at the post-transcriptional level in different species.   RNAseq Data Processing TCGA level-3 gene expression data for pan-GI cancers (ESCA-esophageal carcinoma, STAD-stomach adenocarcinoma, COAD-colon adenocarcinoma, READ-rectum adenocarcinoma, PAAD-pancreatic adenocarcinoma, CHOL-cholangiocarcinoma, LIHC-liver hepatocellular carcinoma) containing fragments per kilobase of transcript per million mapped reads upper quartile (FPKM-UQ) data were downloaded using a R Bioconductor tool, TCGAbiolinks [80]. Differential gene expression analysis was performed using Bioconductor tool, limma. The genes were considered differentially expressed at a false discovery rate (FDR) < 0.05 and abs (log 2 FC ≥ 1) as a cut-off.
miRNAseq Data Processing TCGA level-3 miRNASeq data for Pan-GI cancers (ESCA, STAD, READ, CHOL, PAAD, LIHC) containing reads per million (RPM) counts for each mature miRNA were downloaded from TCGA GDAC Firehose. The IDs were mapped to miRbase mature miRNA name and accession ID. We first removed all miRNA with missing expression values (in at least 25% of the samples) and also miRNA which had CPM (count per million) numbers less than one (in at least 25% of the samples). Differential miRNA expression analysis was performed using limma [81]. Benjamini-Hochberg (BH) adjusted p-value cut-off of 0.05, and an absolute log 2 fold change (FC) of 1 was used to obtain the list of differentially expressed miRNAs. Since mature miRNA counts for normal samples were not available for READ and COAD, these cancers were not considered for further processing.
miRNA Target Identification Using QIAGEN Ingenuity Pathway Analysis (IPA) Target genes of all differentially expressed miRNAs were identified using IPA Target filter (QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuitypathwayanalysis), accessed on 20 July 2021. Further, differentially expressed miRNAs were paired to differentially expressed mRNA targets to prioritize the identified miRNA-mRNA relationship, especially the ones which have negative expression correlation.
The workflow for integrating IPA results with the mintRULS predictions are illustrated in Figure A3.
GEO2R analyzer was used to find differentially expressed miRNAs and genes. Further, a python script was developed to map mintRULS predictions and differentially expressed miR-NAs/genes to identify interacting miRNA-gene pairs which have negative expression correlation.

Appendix A.2. Calculation of Euclidean Distance Using Features
To calculate pairwise similarity between either two miRNAs or two mRNAs, the Euclidean distance (ED) was calculated by taking miRNA/mRNA's signatures into account, as described below.
In case of miRNAs, ED mi i , mi j is the ED between miRNAs mi i and mi j . Fmi i and Fmi j are the signatures (e.g., Free energy) of miRNAs mi i and mi j , respectively.
In case of mRNAs, ED m i , m j is the ED between mRNAs m i and m j . Similar to the illustration in case of miRNAs, Fm i and Fm j are the signatures of mRNAs m i and m j , respectively. Here, n is equal to 1 for both miRNAs and mRNAs.