Construction and Comprehensive Analysis of a Molecular Association Network via lncRNA–miRNA–Disease–Drug–Protein Graph

One key issue in the post-genomic era is how to systematically describe the associations between small molecule transcripts or translations inside cells. With the rapid development of high-throughput “omics” technologies, the achieved ability to detect and characterize molecules with other molecule targets opens the possibility of investigating the relationships between different molecules from a global perspective. In this article, a molecular association network (MAN) is constructed and comprehensively analyzed by integrating the associations among miRNA, lncRNA, protein, drug, and disease, in which any kind of potential associations can be predicted. More specifically, each node in MAN can be represented as a vector by combining two kinds of information including the attribute of the node itself (e.g., sequences of ncRNAs and proteins, semantics of diseases and molecular fingerprints of drugs) and the behavior of the node in the complex network (associations with other nodes). A random forest classifier is trained to classify and predict new interactions or associations between biomolecules. In the experiment, the proposed method achieved a superb performance with an area under curve (AUC) of 0.9735 under a five-fold cross-validation, which showed that the proposed method could provide new insight for exploration of the molecular mechanisms of disease and valuable clues for disease treatment.


Introduction
There are many types of biomolecules inside living cells that form multiple associated regulatory networks as pathways or direct participants to maintain a wide variety of life activities and key functions [1][2][3]. For instance, protein-protein interactions play a key role in numerous life processes and maintain many functions of normal cells. There is also growing evidence that ncRNAs are involved in cell growth and apoptosis, leading to numerous diseases. Therefore, predicting the potential associations between small molecule transcripts and compounds not only helps people to understand important cell activities at the molecular level, but is also significant for prevention, diagnosis, and treatment of disease, as well as genomic drug discovery [4,5]. In fact, it is unrealistic to verify the existence of association between such large-scale nodes one by one through biological experiments under the constraints of time and cost. In addition, the results of the experimental methods will be accompanied with higher false positives and false negatives due to various external factors [6]. Benefiting from the development of high-throughput technologies such as microarray, q-PCR, and yeast two-hybrid screens (Y2H) [7,8], construction of association prediction framework that provide a   In this study, we present a model to predict the relationship between any small molecules in a cell based on the node attribute and node behavior through a more systematic and comprehensive view. The complex associations network of biomolecules (as shown in Figure 1 and Figure 2) consists of two parts: Nodes (miRNA, long noncoding RNA (lncRNA), protein (target), drug, and disease) and edges (the relationships between nodes). Determining the edges between any two nodes in the whole complex network helps people to have a deep and comprehensive understanding of various life activities in living organisms from another micro perspective [21,22].
Firstly, nine kinds of molecular associations, such as miRNA-disease association, proteinprotein interaction, lncRNA-disease associations, and drug-target interaction were collected to consider the relationship between each node and any other kind of node in a global way. After de-redundancy and repetition, five research objects such as miRNA, protein, and drug were obtained and co-combined to construct a complex heterogeneous network in an entire view at the cellular level. Secondly, each node can be represented in two ways. One is the node intrinsic attributes such as the sequence of ncRNA and protein, the molecular fingerprint of drug, and the semantics of disease. The other way is to represent nodes as vectors through network embedding based on the relationship between nodes. Thirdly, all known associations are treated as positive samples, and an equal number of unknown associations are randomly selected as negative samples, which together serve as a training set. Random forest is selected as a classifier for training verification and testing. The five-fold cross validation was adopted to evaluate the proposed method, and we also compared the performance of different types of features, classifiers, and previous methods. The results indicate that our method combined intrinsic attribute feature and behavior feature could achieve an effective and robust prediction performance. The construction of systematic and complex molecular association network offers a new view, which can help us better understand biology and disease pathologies. To the best of our knowledge, we are the first to construct a molecular association network (MAN) using associations between lncRNA, miRNA, disease, drug, and protein. We hope that this work will inspire more research on representational learning on biological networks.

Construction of the Molecular Association Network
In order to systematically and holistically establish the molecular association network, known associations between biological molecule transcripts (miRNA, lncRNA, and protein), diseases and drugs were downloaded from multiple databases. The details of the final experimental data obtained after performing the inclusion of identifier unification, de-redundancy, simplification, and deletion of the irrelevant items are shown in the following Table 1. After aggregating the above database, we separately classified the different nodes to get the final statistics as shown in the following Table 2.

NcRNA and Protein Sequence
The sequences of miRNA, lncRNA, and protein are downloaded from miRbase [29], NONCODE [30], and STRING [11], respectively, to subsequently represent the attribute of the node. For the sake of simplicity, we chose to encode ncRNA sequences using a 64 (4 × 4 × 4) dimensional vector, in which each feature represents the normalized frequency of the corresponding 3-mer appearing in the RNA sequence (e.g., ACG, CAU, UUG). Inspired by the article of Shen et al. [31], in the process of protein sequence encoding, 20 amino acids are classified into four classes according to the polarity of the side chain including Ala, Val, Leu, Ile, Met, Phe, Trp, and Pro; Gly, Ser, Thr, Cys, Asn, Gln, and Tyr; Arg, Lys, and His; and Asp and Glu. Thus, each protein sequence can be represented by a 3-mer that is a 64 (4 × 4 × 4) dimensional vector and each dimension of the vector representing the normalized frequency of the corresponding 3-mer in the sequence.

Disease MeSH Descriptors and Directed Acyclic Graph
The Medical Subject Headings (MeSH) is a comprehensive searchable control vocabulary, which is organized by the National Library of Medicine furnished a rigorous index for journal articles and books in the life sciences. The top-level categories in the MeSH descriptor hierarchy are: Anatomy , and so on. In this system, each disease can be represented by a directed acyclic graph (DAG) generated through its MeSH, accurately and objectively describe its own characteristics. The details to describe the disease with DAG is as follows: DAG(D) = (D, N(D), E(D)), N(D) is the set of points that contains all the diseases in the DAG(D). E(D) is the set of edges that contains all relationships between nodes in the DAG(D). An example of the disease Astrocytoma's DAG is as follows in Figure 3. , and so on. In this system, each disease can be represented by a directed acyclic graph (DAG) generated through its MeSH, accurately and objectively describe its own characteristics. The details to describe the disease with DAG is as follows: DAG(D) = (D, N(D), E(D)), N(D) is the set of points that contains all the diseases in the DAG(D). E(D) is the set of edges that contains all relationships between nodes in the DAG(D). An example of the disease Astrocytoma's DAG is as follows in Figure 3. For the diseases that are included in MeSH, the semantic similarity that is calculated by means of DAG can be chosen to represent the disease according to the previous literature [32]. The semantic similarity between different diseases can be defined as follows. In DAG of disease D, the contribution of any ancestral disease t to disease D is as the formula: ∆ is the semantic contribution factor and equals to 0.5 according to the literature mentioned above. The contribution of disease D to itself is 1 and the contribution of other nodes to D will be attenuated due to ∆. Based on Equation (1), we can obtain the sum of the contributions of all diseases in DAG to D: Similar to the Jaccard similarity coefficient, the semantic similarity between the diseases i and j can be calculated by the following formula: For the diseases that are included in MeSH, the semantic similarity that is calculated by means of DAG can be chosen to represent the disease according to the previous literature [32]. The semantic similarity between different diseases can be defined as follows. In DAG of disease D, the contribution of any ancestral disease t to disease D is as the formula: ∆ is the semantic contribution factor and equals to 0.5 according to the literature mentioned above. The contribution of disease D to itself is 1 and the contribution of other nodes to D will be attenuated due to ∆. Based on Equation (1), we can obtain the sum of the contributions of all diseases in DAG to D: Similar to the Jaccard similarity coefficient, the semantic similarity between the diseases i and j can be calculated by the following formula:

Drug Molecular Fingerprint
The smiles of drugs were downloaded from DrugBank and then transformed into the corresponding Morgan molecular fingerprint by the python package.

Stacked Autoencoder
In order to reduce noise and normalize attribute information to a uniform dimension, a stacked autoencoder was employed to obtain a suitable subspace from the original feature space. Stacked Autoencoder (SAE) can be divided into two parts: The encoder that encodes the input data into corresponding representation h and the decoder that reconstructs an approximationx from the hidden representation h.
Here, we choose the ReLU function as the activation function:

Node Representation
In the entire biomolecular network, each node can be represented by the node intrinsic attributes and behavior. The attributes of the node itself can be the sequence of ncRNA, protein, the semantics of the disease, and the molecular fingerprint of the drug. The node behavior can be considered as a representation of the relationship between nodes. More specifically, in this work, we chose a method of network embedding called LINE [33] to globally represent the behavior of nodes in the entire network and the flow of information directly or latently with other nodes.
For large-scale networks, some existing network representation learning algorithms require complex computational complexity. Recently, some methods of large-scale networks either use indirect methods to reduce computational complexity or lack an explicit objective function (DeepWalk). LINE defines two similarity relationships, including the first-order proximity and the second-order proximity. The first-order similarity is defined as the node connection relationship (local feature) in the network, and the second-order similarity is defined as the common neighbor node (global feature) of the nodes that are not directly connected as a supplement to the first-order similarity. In this work, we used the network representation model LINE to learn how to represent the relationships between each node and other nodes in the entire network. In this way, an undirected edge can be considered as two directed edges with opposite directions and equal weights. The second-order proximity assumes that vertices sharing many connections to other vertices are similar to each other. For each directed edge (i, j), the probability of "context" v j generated by vertex v i is defined as: Therefore, we minimize the following objective functions: The empirical distributionP 2 (·|v i ) is defined as: For the sake of simplicity, λ i is set to the degree of the vertex i, i.e., λ i = d i . Here Kullback-Leibler (KL) divergence is used as the function of distance. After some constants are omitted, the loss function can be simplified as the following form:

Evaluate the Five-Fold Cross Validation Performance of Our Method
For five-fold cross validation, the entire data set was randomly divided into five subsets of equal size, one subset was treated as the test set in turn, and the remaining four subsets were used as the training sets to construct the classifier. Note that at the time of each cross validation, only the currently training set, i.e., 80% of the total edges, would be embedding as the behavior of the node, which avoids the leakage of test information. Although the above operations may cause some of the nodes originally in the network to become isolated, i.e., degree with 0 and these nodes may also lack attribute information at the same time. This situation can better simulate the real environment to provide support and assistance for the exploration of unknown fields by researchers through manual experiments.
A range of broader evaluation criteria including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and Matthews Correlation Coefficient (MCC) were utilized to evaluate the proposed model more comprehensively and fairly. As shown in Table 3 and Figure 4, the results of average Acc., Sen., Spec., Prec., MCC, and area under curve (AUC) of 92.38%, 92.61%, 92.14%, 92.18%, 84.76%, and 97.35% when the proposed framework was applied to predict arbitrary associations in the whole network. Receiver operating characteristic curve (ROC) is a commonly used standard for evaluating models. Area under curve (AUC) is the area of graph that is surround by the ROC, the abscissa false positive rate (FPR), and the ordinate true positive rate (TPR). We also draw the ROC and calculated AUC to visually evaluate our proposed model at the same time under five-fold cross validation. Precision-Recall (PR) curve, whose abscissa is the recall and ordinate is the precision, was applied to evaluate the model from another angle. In conclusion, our method obtained an AUC of 0.9735 and AUPR (area under PR) that indicated that the proposed method combined two types of feature had an excellent ability to identify positive and negative samples. The higher AUC and AUPR indicated our method had a strong predictive performance and the lower variance of the results showed the proposed model was stable and robust.

Comparison of Different Feature Extraction Methods
As mentioned above, each node in the biomolecular network within the cell can be represented by two kinds of information including attribute information and behavior information. In order to evaluate the impact of each type of information on the final classification effect, we respectively utilized the information of attribute, the information of behavior and the combination of the above two to represent the node under the extensive evaluation criteria in five-fold cross validation. As shown in Table 4 and Figure 5, the average of ROC and PR under five-fold cross validation is reported. A variety of evaluation criteria as shown in the table below indicated that the node representation vectors combined with the two kinds of information has more outstanding expressiveness.

Comparison of Different Feature Extraction Methods
As mentioned above, each node in the biomolecular network within the cell can be represented by two kinds of information including attribute information and behavior information. In order to evaluate the impact of each type of information on the final classification effect, we respectively utilized the information of attribute, the information of behavior and the combination of the above two to represent the node under the extensive evaluation criteria in five-fold cross validation. As shown in Table 4 and Figure 5, the average of ROC and PR under five-fold cross validation is reported. A variety of evaluation criteria as shown in the table below indicated that the node representation vectors combined with the two kinds of information has more outstanding expressiveness.

Comparison of Different Feature Extraction Methods
As mentioned above, each node in the biomolecular network within the cell can be represented by two kinds of information including attribute information and behavior information. In order to evaluate the impact of each type of information on the final classification effect, we respectively utilized the information of attribute, the information of behavior and the combination of the above two to represent the node under the extensive evaluation criteria in five-fold cross validation. As shown in Table 4 and Figure 5, the average of ROC and PR under five-fold cross validation is reported. A variety of evaluation criteria as shown in the table below indicated that the node representation vectors combined with the two kinds of information has more outstanding expressiveness.

Comparison of Different Classifiers
In order to evaluate the performance of the classifier, we compared random forest with Adaboost, logistic regression, naïve Bayes, and XGBoost under five cross-validation in various evaluation criteria. Under the control variable method, the various settings of the experiment are the

Comparison of Different Classifiers
In order to evaluate the performance of the classifier, we compared random forest with Adaboost, logistic regression, naïve Bayes, and XGBoost under five cross-validation in various evaluation criteria. Under the control variable method, the various settings of the experiment are the same except the classifier, which makes the comparison of experimental results fairer and more credible. The results are shown in Table 5 and Figure 6. The difference in the effect of different classifiers may be caused by the following factors: (1) Naive Bayes can get better results when the properties of the sample are independent of each other. In this experiment, there are cases where the attributes are not independent and cross-joining together affects the final classification effect. (2) Logistic regression is essentially a linear classifier whose performance is limited by the distribution of the data and did not perform well in this article. (3) The parameters of all classifiers are default values, which may cause Adaboost and XGBoost to have under-fitting or over-fitting on this task. same except the classifier, which makes the comparison of experimental results fairer and more credible. The results are shown in Table 5 and Figure 6. The difference in the effect of different classifiers may be caused by the following factors: (1) Naive Bayes can get better results when the properties of the sample are independent of each other. In this experiment, there are cases where the attributes are not independent and cross-joining together affects the final classification effect. (2) Logistic regression is essentially a linear classifier whose performance is limited by the distribution of the data and did not perform well in this article. (3) The parameters of all classifiers are default values, which may cause Adaboost and XGBoost to have under-fitting or over-fitting on this task.

Additional Comparison Experiment for lncRNA-Disease Association Prediction
In order to compare with traditional methods that focus on single or isolated objects, the lncRNA-disease association prediction was chosen to perform this comparison experiment because of the serious lack of node attribute information. After processing the data, 1263 independent lncRNA-disease association pairs were obtained including 345 different lncRNAs and 295 different diseases. The sequence of each lncRNA was determined when the experimental materials were collected. However, among all diseases associated with lncRNA, only 76 of 295 diseases were able to obtain attribute information by constructing DAG to produce similarity with other diseases. The pairs, which include both two kinds of information only took possession of 259 in 1263 associations.

Additional Comparison Experiment for lncRNA-Disease Association Prediction
In order to compare with traditional methods that focus on single or isolated objects, the lncRNA-disease association prediction was chosen to perform this comparison experiment because of the serious lack of node attribute information. After processing the data, 1263 independent lncRNA-disease association pairs were obtained including 345 different lncRNAs and 295 different diseases. The sequence of each lncRNA was determined when the experimental materials were collected. However, among all diseases associated with lncRNA, only 76 of 295 diseases were able to obtain attribute information by constructing DAG to produce similarity with other diseases. The pairs, which include both two kinds of information only took possession of 259 in 1263 associations. Figure 7a shows results of the link prediction under five-fold cross validation with pure attribute information as the characteristics of the node. It can be treated as a baseline. Figure 7b shows the results of link prediction based on the feature combined attribute feature with previous isolated embedding (behavior) feature. That is, 80% lncRNA-disease associations were utilized to construct the adjacency matrix for generating Gaussian profile kernel similarity in each fold cross validation [34]. Figure 7c show the results of global embedding, that is, 80% of the lncRNA-disease associations and all the other eight kinds of associations were processed by LINE in each cross-validation. Obviously, after combining the global behavior information, the performance of prediction in lncRNA-disease association can be greatly improved. It also proves that the cell is a complete unit of life, and the interaction of biomolecules in the cell together maintains the normal conduct of life activities. Figure 7d show the results of a special graph embedding strategy inspired by Chen et al. [35]. Eight other associations in the whole network besides lncRNA-disease pairs were processed by LINE in each fold cross-validation. LncRNA-disease pair vectors combined attribute feature and behavior feature were sent into random forest classifiers for prediction and evaluation. Inspired by experiment 7d, we did a case study based on lncRNA NONHSAT017462.2. The lncRNADisease database (remove all lncRNA NONHSAT017462.2 association pairs) was used as the training data set to construct the model. The pairs between lncRNA NONHSAT017462.2 and all diseases were predicted by the model. Furthermore, we converted lncRNA NONHSAT017462.2 to its alias h19 and did cross-dataset validation in another dataset MNDR 2.0 [36]. The verification of Top-20 prediction is as follows in Table 6.  Figure 7a shows results of the link prediction under five-fold cross validation with pure attribute information as the characteristics of the node. It can be treated as a baseline. Figure 7b shows the results of link prediction based on the feature combined attribute feature with previous isolated embedding (behavior) feature. That is, 80% lncRNA-disease associations were utilized to construct the adjacency matrix for generating Gaussian profile kernel similarity in each fold cross validation [34]. Figure 7c show the results of global embedding, that is, 80% of the lncRNA-disease associations and all the other eight kinds of associations were processed by LINE in each cross-validation. Obviously, after combining the global behavior information, the performance of prediction in lncRNA-disease association can be greatly improved. It also proves that the cell is a complete unit of life, and the interaction of biomolecules in the cell together maintains the normal conduct of life activities. Figure 7d show the results of a special graph embedding strategy inspired by Chen et al. [35]. Eight other associations in the whole network besides lncRNA-disease pairs were processed by LINE in each fold cross-validation. LncRNA-disease pair vectors combined attribute feature and behavior feature were sent into random forest classifiers for prediction and evaluation. Inspired by experiment 7d, we did a case study based on lncRNA NONHSAT017462.2. The lncRNADisease database (remove all lncRNA NONHSAT017462.2 association pairs) was used as the training data set to construct the model. The pairs between lncRNA NONHSAT017462.2 and all diseases were predicted by the model. Furthermore, we converted lncRNA NONHSAT017462.2 to its alias h19 and did cross-dataset validation in another dataset MNDR 2.0 [36]. The verification of Top-20 prediction is as follows in Table 6.  We tried to analyze the flow of information in the molecular association network (MAN) based on a specific miRNA, lncRNA, and protein, which makes the model more interpretable. Two pairs of lncRNA-miRNA interactions were selected to perform this experiment. Firstly, each lncRNA and miRNA-associated protein was queried and aggregated. Then the obtained proteins were constructed into a PPI network. LncRNA and miRNA were directly connected and therefore were each other's first-order neighbors. The process of a message transmission through the PPI network as an intermediary makes lncRNA and miRNA each other's high-order neighbors, although the intersection of the second-order neighbors was not large, but the third-order neighbors and their associations were very rich. The specific lncRNA-miRNA interactions and associated proteins are shown in the following Table 7 and Figure 8.

Conclusions
Accumulating evidence demonstrated the superiority of link prediction based on massive data through machine learning models, which not only served as an addition to manual experiments, but also provided researchers with an overall and macro insight into the interactions between intracellular molecules. In this article, we proposed a novel framework based on five different kinds of nodes and nine different kinds of relationships to detect any potential associations between arbitrary research objects in the whole network. Each node could be represented as a vector by two kinds of information including node attribute and node behavior. For attribute information, ncRNA and protein could be encoded into 64-dimensional vectors by the method of k-mer, in which each feature represents the normalized frequency of the corresponding 3-mer appearing in the RNA or protein sequence. The characteristics of the disease and the drug could be represented by their own semantic and molecular structure and transformed into 64-dimensional vectors through the function of feature selection and transformation in SAE. For behavior information, the relationships of each node with others could be abstracted by the network embedding method LINE. Combined with the above two kinds of feature, each node could be represented by a 128-dimensional vector and random forest classifier was chosen to carry out prediction tasks. The experimental results showed that our method could achieve outstanding performance. The construction of the molecular association network (MAN) in human cells, offers a new systematic view on understanding complex life activities and diseases.

Conclusions
Accumulating evidence demonstrated the superiority of link prediction based on massive data through machine learning models, which not only served as an addition to manual experiments, but also provided researchers with an overall and macro insight into the interactions between intracellular molecules. In this article, we proposed a novel framework based on five different kinds of nodes and nine different kinds of relationships to detect any potential associations between arbitrary research objects in the whole network. Each node could be represented as a vector by two kinds of information including node attribute and node behavior. For attribute information, ncRNA and protein could be encoded into 64-dimensional vectors by the method of k-mer, in which each feature represents the normalized frequency of the corresponding 3-mer appearing in the RNA or protein sequence. The characteristics of the disease and the drug could be represented by their own semantic and molecular structure and transformed into 64-dimensional vectors through the function of feature selection and transformation in SAE. For behavior information, the relationships of each node with others could be abstracted by the network embedding method LINE. Combined with the above two kinds of feature, each node could be represented by a 128-dimensional vector and random forest classifier was chosen to carry out prediction tasks. The experimental results showed that our method could achieve outstanding performance. The construction of the molecular association network (MAN) in human cells, offers a new systematic view on understanding complex life activities and diseases.

Conflicts of Interest:
The authors declare no conflict of interest.