GCGACNN: A Graph Neural Network and Random Forest for Predicting Microbe–Drug Associations

The interaction between microbes and drugs encompasses the sourcing of pharmaceutical compounds, microbial drug degradation, the development of drug resistance genes, and the impact of microbial communities on host drug metabolism and immune modulation. These interactions significantly impact drug efficacy and the evolution of drug resistance. In this study, we propose a novel predictive model, termed GCGACNN. We first collected microbe, disease, and drug association data from multiple databases and the relevant literature to construct three association matrices and generate similarity feature matrices using Gaussian similarity functions. These association and similarity feature matrices were then input into a multi-layer Graph Neural Network for feature extraction, followed by a two-dimensional Convolutional Neural Network for feature fusion, ultimately establishing an effective predictive framework. Experimental results demonstrate that GCGACNN outperforms existing methods in predictive performance.


Introduction
Microbes exist either as single-celled organisms or as colonies of cells and are composed of bacteria, archaea, fungi, viruses, and protozoa [1].The constituents of the microbiotabacteria, viruses, and eukaryotes-have been shown to interact with one another and with the host immune system in ways that influence the development of disease [2].A variety of microbes exist throughout the human body and have a fundamental role in human health [3].For example, the microbiota is an essential component of immunity and a functional entity that influences metabolism and modulates drug interactions.Furthermore, ecological dysbiosis or imbalance of microbes may also lead to other diseases in the human host.It is thus clear that microbes are important to human health, and many microbes present in the human organism can regulate host physiology and disease development [4,5].
A variety of organisms, such as bacteria, fungi, and plants, produce secondary metabolites, also known as natural products.Natural products have been a prolific source and an inspiration for numerous medical agents with widely divergent chemical structures and biological activities, including antimicrobial, immunosuppressive, anticancer, and anti-inflammatory activities, many of which have been developed as treatments and have potential therapeutic applications for human diseases [6].In recent years, as the variety of drugs investigated by the medical field increases, the resistance of microbes has become more and more intense [7].Previous research in the pharmaceutical industry has involved culturing some microbe species under greenhouse conditions and subsequently using them in drugs [8].However, traditional wet lab experiments are costly and time-consuming, necessitating the urgent adoption of novel computational approaches to uncover potential Biomolecules 2024, 14, 946 2 of 13 relationships between microbes and drugs, thereby contributing to drug development analysis and human disease diagnosis.
In recent years, owing to the rapid advancement of bioinformatics, numerous distinguished researchers have constructed a series of databases concerning the associations between microbes and diseases, as well as microbes and drugs, greatly facilitating the computational analysis of potential relationships between microbes and drugs.For instance, Sun et al. established MDAD [9], which is a database consisting of 5505 associations between 180 microbes and 1388 drugs.Rajput et al. [10] developed the aBiofilm database, which records microbial resistance to drugs and includes biological, chemical, and structural details of 5027 antimicrobial agents.Andersen et al. [11] curated a dataset named DrugVirus, which includes 1281 associations between 118 compounds and 83 human viruses.
Based on these datasets, the application and development of various learning methods in bioinformatics have rapidly progressed, leading to the emergence of several computational models aimed at inferring potential microbe-drug associations.For instance, Anahtar et al. [12] explored the application of machine learning to the problem of antimicrobial resistance.However, the lack of high-quality training datasets resulted in suboptimal machine learning performance.Zhu et al. [13] proposed the HMDAKATZ method based on the KATZ measure.The method constructed a heterogeneous microbe-drug network and subsequently employed the Katz measure to calculate the correlation of nodes in this heterogeneous network.However, this method uses a simple measure, which fails to fully reflect similarity representation, thereby affecting the accuracy of MDA prediction.To include more node and edge information, Long et al. [14,15] introduced computational methods named GCNMDA and EGATMDA.GCNMDA is based on graph convolutional networks and conditional random fields with an attention mechanism to detect potential microbe-drug associations.The model primarily focuses on the first-order neighbors of nodes, neglecting the importance of higher-order neighbors, which limits the model's performance.Deng et al. [16] designed a method called Graph2MDA, which predicts potential microbe-drug associations by constructing multimodal attribute graphs as input to a variational graph autoencoder to learn information from each node and the entire graph.Ma et al. [17] proposed a computational method named GACNNMDA for predicting associations between microbes and drugs.This method constructs two feature matrices and two heterogeneous microbe-drug networks, combining graph attention networks (GAT) and a convolutional neural network (CNN)-based classifier to predict potential microbe-drug associations.However, both of these methods have limitations in integrating multimodal attribute features.
Although previous methods have made progress in predicting microbe-drug associations, they still have limitations, such as neglecting higher-order neighbor information, inadequate handling of outlier nodes, and insufficient integration of multi-modal features.To improve prediction accuracy and model performance, this study proposes a model combining deep learning and machine learning: GCGACNN.This model integrates deep learning techniques, including Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and two-dimensional Convolutional Neural Networks (CNN), with the Random Forest algorithm.It comprehensively considers various aspects of microbes, drugs, and diseases to enhance the prediction of microbe-drug resistance associations.Specifically, we construct association and similarity matrices, extract features using deep graph convolutional networks, embed these features through graph attention mechanisms, integrate low-dimensional features using two-dimensional convolutional neural networks, and finally apply the Random Forest algorithm for prediction.This approach effectively combines the feature extraction capabilities of deep learning with the classification power of machine learning, significantly improving prediction performance.

Data Sources
MDAD is a comprehensive database that integrates associations between microorganisms and drugs, encompassing 5505 validated records of interactions between 180 microorganisms and 1388 drugs.The microorganisms in this database include a variety of entities, such as bacteria and fungi, with primary information concentrated at the species level while also providing detailed information on specific strains.The drug-related information includes various chemical compounds and their effects on microorganisms, covering both direct antimicrobial agents and their targets.The research data in MDAD cover both acute and chronic bacterial infections, including classic antibiotic treatments as well as non-antibiotic drug applications.
In addition, we retrieved known associations among microorganisms, drugs, and diseases from the dataset compiled by Wang et al. [18].This dataset includes 70,315 drugdisease associations and 15,633 microbe-disease associations.By filtering the disease data related to drugs and microorganisms in MDAD, we ultimately obtained 1121 unique drugdisease associations involving 233 drugs and 109 diseases and 402 distinct microbe-disease associations involving 73 microorganisms and 109 diseases.
Finally, we collected 2470 validated microbe-drug association records from a dataset compiled by Ma et al. [17], which includes 1373 drugs and 173 microorganisms.In constructing our dataset, we focused on the species level of microorganisms.Additionally, the drugs we studied include not only single chemical substances but also their potential mechanisms of action or drug combinations.The data encompass both acute and chronic forms of bacterial infections, thus comprehensively considering the complex interactions between microorganisms, drugs, and diseases.Our model still provides valuable predictions for microbe-drug associations at the species level.Future work will incorporate more detailed genetic information to further enhance the model's predictive capabilities.Detailed information about these data are provided in Table 1.

Overview
As illustrated in Figure 1, the GCGACNN model primarily comprises three components: Part A involves constructing the microbe-drug, microbe-disease, and drug-disease correlation matrices based on downloaded data associated with microbes, drugs, and diseases.Subsequently, three feature matrices are built utilizing these three correlation matrices.
Part B involves utilizing the feature matrices obtained from the initial step as inputs to the network layers.Graph convolutional layers and graph attention layers are employed to extract feature representations from various modalities.Ultimately, a two-dimensional convolutional neural network is used to fuse features and acquire effective representations.
Part C involves the introduction of a random forest-based classifier that employs learned embeddings to predict scores for microbe-drug associations.

Overview
As illustrated in Figure 1, the GCGACNN model primarily comprises three components:

Construct Association Matrices
Given a downloaded dataset encompassing m microbes, n drugs, d diseases, and their interconnections, our objective is to predict novel microbe-drug resistance associations by leveraging known associations among microbes, drugs, diseases, and their respective similarity characteristics.Firstly, based on the known microbe-drug interaction relationships, we construct a microbe-drug association matrix A 1 ∈ R m×n .The construction rule is as follows: for any given microbe m i and drug n j , if a known interaction relationship exists between them, then A 1 (i, j) = 1; otherwise, A 1 (i, j) = 0. Next, based on the known microbe-disease interaction relationships, we construct a microbe-disease association matrix A 2 ∈ R m×d using the same construction rule, where A 2 (i, j) = 1 if there is a known interaction, and A 2 (i, j) = 0 otherwise.Subsequently, based on the known drug-disease interaction relationships, we construct a drug-disease association matrix A 3 ∈ R n×d using a similar construction rule, where A 3 (i, j) = 1 if there is a known interaction, and A 3 (i, j) = 0 otherwise.

Similarity Calculation
We derive features related to microbes, drugs, and diseases based on the association matrices among these entities.Given the high sparsity of microbiome data, we employ the Gaussian Interaction Profile (GIP) [19] kernel function to calculate Gaussian kernel similarity between microbes and drugs in order to uncover more valuable similarity information.The GIP kernel function has been successfully applied for computing topological similarity between nodes, with the core idea that similar microbes (or drugs) interact in a similar manner to produce comparable interaction profiles.Specifically, in the microbe-drug association matrix, we posit that microbes with functional similarities exhibit analogous patterns of drug resistance.Two Gaussian similarity matrices, G 1 m and G 2 m , are calculated for a given microbe employing the i-th row of the associated matrices A 1 and A 2 .The computation of microbial similarity involves the utilization of a Gaussian kernel function, outlined as follows: The given description entails that A 1 ∈ R m×n denotes the known microbe-drug resistance associations, while A 2 ∈ R m×d represents the known microbe-disease resistance associations.Additionally, ∥X∥ signifies the Euclidean distance from X to the origin, with m, n, and d denoting the quantities of microbes, drugs, and diseases associated with the network, respectively.In this context, G 1 m (i, j) stands for the similarity between two microbes based on their drug associations, with α serving as the kernel bandwidth parameter.The similarity G 2 m (i, j) signifies the resemblance between two microbes based on their disease resistance associations.The interactive contour vectors, A 1 (k, :) and A 2 (k, :), are derived from the microbe associations of the i-th drugs and i-th diseases, respectively.
Similarly, the similarity between drugs is computed using the j-th column of matrix A 1 and the i-th row of matrix A 3 .
A 3 ∈ R n×d signifies the established drug-disease resistance associations.The similarity assessment between two drugs, n i and n j , based on their microbial associations denoted as G 1 n , is computed utilizing a Gaussian Interaction Profile (GIP) kernel with a bandwidth parameter, α.G 2 n represents the similarity between two drugs based on their associations with diseases.To derive interaction profile vectors A 1 (:, k) and A 3 (k, :) effectively, they are based on the drug relevance of the j-th microbe and the drug relevance of the i-th disease, respectively.
Subsequently, we can compute the similarity of diseases using the j-th column of matrix A 2 and the j-th column of matrix A 3 as follows: The GIP similarity between two diseases, d i and d j , based on their microbe associations, was computed and denoted as G 1 d (i, j).The similarity between two diseases based on their drug associations was computed and denoted as G 2 d (i, j).The corresponding interactive contour vectors A 2 (:, k) and A 3 (:, k) were obtained based on the disease relevance of the j-th microbe and the disease relevance of the j-th drug.
Finally, we obtained similarity features G m ∈ R 173×173 for microbes, G n ∈ R 1373×1373 for drugs, and G d ∈ R 109×109 for diseases.In calculating these similarity features, we addressed the high sparsity and compositional nature of microbiome data.Specifically, the Gaussian similarity matrix captures nonlinear relationships among microbes, drugs, and diseases, effectively extracting latent similarity information.The Gaussian kernel function measures distances between entities and maps them into a higher-dimensional space, overcoming challenges posed by data sparsity and revealing meaningful similarity patterns.Our approach takes into account the compositional nature of microbiome data, ensuring the effectiveness of the proposed Gaussian similarity matrices in this complex data context.

Embedding Learning
We designed a deep learning module that combines Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) to learn embeddings for microbes, drugs, and diseases.GCN performs convolution operations on the features of nodes and their neighbors to aggregate local information and generate context-rich node embeddings.GCN leverages normalized adjacency matrices to effectively propagate graph structural information and progressively extracts global features from local ones, providing comprehensive feature representations for subsequent tasks.Formally, given an undirected graph G with a node feature matrix X and an adjacency matrix A, the Graph Convolutional Network updates the node embeddings according to the following rule: where ∼ S = I + S and I represents the identity matrix,

∼
A represents the degree matrix of matrix ∼ S, and W represents a trainable weight matrix.The processing through this layer yields the H 1 m , H 1 n and H 1 d .To better extract features for microbes, drugs, and diseases, we incorporate a Graph Attention Network layer into each module.GAT mitigates the issue of feature over-smoothing by applying attention weights to neighbor node features, and alleviates the problem of information over-squashing through its self-attention mechanism.Firstly, for any given node i in H 1 v (v = m, n, d), the computation of similarity coefficients with its adjacent nodes is as follows: where W is a trainable weight matrix, and a is a projection.Furthermore, the attention score α ij between node i and node j would be calculated based on e ij according to the following formula: α ij can be fully expanded as: where a T is a weight vector and || is the concatenation operation.W ′ v represents the weight of the edge S ij .Based on this, we obtained the output feature as: where H ′ was further calculated by the multi-head attention mechanism as: The processing of this layer results in three distinct feature embeddings, namely H 2 m , H 2 n and H 2 d .Next, we apply Graph Convolutional Networks (GCN) for more advanced feature extraction of node features.At this stage, the node features have been preliminarily processed by Graph Attention Networks (GAT), incorporating rich local interaction information.Through this stage, we aim to further enhance the aggregation effect of node features and integrate more comprehensive neighborhood information.The detailed implementation is as follows: In order to achieve effective representation, the two-dimensional convolutional neural network was used to fuse the features of the above-mentioned feature matrix, and finally the effective representation was obtained.

Predicting Microbe-Drug Associations
After obtaining the feature embeddings for microbes H m and H n drugs, these embeddings can be used to generate the following matrices: The predicted score matrix values represent the likelihood of the relationships.We trained the model using the binary cross-entropy (BCE) loss function, and the implementation details are as follows: Random forests have demonstrated strong performance in binary relationship prediction.We utilize random forests as a classifier to predict the associations between microbiomes and drugs.For a new fused feature X, each decision tree T i provides a classification prediction result H i (X).
The final classification result of the random forest is determined by the majority voting of all decision trees.Assuming there are K decision trees, the final prediction for category ŷ is given by the following formula: where 1[•] denotes the indicator function, which equals 1 if the condition is satisfied and 0 otherwise.The operator argmax c denotes the selection of the class c that maximizes the sum of the indicator functions.

Model Training and Validation
To demonstrate the practical application of GCGACNN in predicting microbe-drug relationships, we trained the model using real-word data.The training process involved several steps: Data Preprocessing: We preprocessed the data by normalizing feature values and handling missing data through imputation techniques.

2.
Feature Extraction: Using Gaussian kernel functions, we calculated similarity matrices for microbes, drugs, and diseases.

3.
Model Training: We trained the GCGACNN model using the preprocessed data and extracted features.The model parameters were optimized using a grid search approach to find the best hyperparameters.

4.
Validation: The model was validated using a five-fold cross-validation approach, and performance metrics such as AUC, AUPR, Accuracy, and F1-score were calculated to evaluate the model's predictive performance.

Comparison with State-of-the-Art Methods
In order to validate the prediction performance of GCGACNN, we compared GC-GACNN with five existing microbe-drug association prediction methods, such as HM-DAKATZ, GCNMDA, EGATMDA, Graph2MDA, and GACNNMDA.During the experimental process, we employed a control methodology similar to GACNNMDA, wherein the original parameters of all methods were fixed and executed on MDAD.These methods were evaluated using a five-fold cross-validation framework.Specifically, 20% of known associations and 20% of unvalidated potential associations were randomly selected as the test set, while the remaining 80% of known and unvalidated potential associations constituted the training set.Performance assessment was based on metrics including AUC, AUPR, Accuracy, and F1-Score.
TP represents the number of samples correctly predicted as positive by the model, TN represents the number of samples correctly predicted as negative by the model, FP represents the number of negative class samples mistakenly predicted as positive by the model, and FN represents the number of positive class samples mistakenly predicted to be negative by the model.TPR, known as the True Positive Rate, denotes the proportion of positive class samples correctly predicted as positive by the model.FPR represents the proportion of negative class samples mistakenly predicted as positive.Precision signifies the proportion of samples predicted to be positive by the model that are actually positive.Accuracy indicates the proportion of samples correctly classified among the total samples.The F1-score is an indicator that comprehensively considers Precision and Recall.
According to Table 2, the GCGACNN model exhibits outstanding performance, with an AUC value of 0.9853 ± 0.0026, surpassing the second-highest AUC value of 0.9777 ± 0.0109 from GACNNMDA by 0.76%.The AUPR value is 0.9860 ± 0.0028, showing a 4.8% improvement over the second-highest AUPR value of 0.9380 ± 0.0098 from Graph2MDA.The F1 score is 0.9385, significantly outperforming the F1 values of other models.However, in terms of accuracy, GCGACNN does not surpass these models.Nevertheless, GCGACNN can be considered a potential tool for predicting microbial-drug resistance associations.

Hyperparameter Sensitivity Analysis
In this section, we investigated the sensitivity of parameters in GCGACNN, including the learning rate (LR) used during model training, the number of layers in GCN, the number of attention heads in GAT, the output channel sizes in the convolutional layers, and the feature dimensions for embedding.The overall results are illustrated in Figure 2.During the debugging process, we tested the learning rate (lr) within the range of {0.0001, 0.001, 0.005, 0.01}.As depicted in the figure, the model achieved optimal performance when lr was set to 0.0001, as shown in Table 3.For the convolutional layer's output channel sizes, we conducted experiments within the range of {64, 128, 256, 512}.The graph illustrated that the model exhibited optimal performance when the channel size was set to 128, per the data depicted in Table 4. Exploring the impact of feature embedding dimensions, we varied the dimensions within the range of {64, 128, 256, 512}.The graph indicated that the model achieved its best performance when the feature embedding dimension was set to 256, according to the data presented in Table 5.Moreover, we examined the effect of the number of attention heads within the range of {2, 3, 4, 5}.As observed from the graph, the model performed optimally when the number of attention heads was set to 4, as shown in Table 6.

Ablation Study
In this section, we will explore several variants of GCGANN to assess the significance of different components within the model.The detailed results are shown in Figure 3.

Ablation Study
In this section, we will explore several variants of GCGANN to assess the significance of different components within the model.The detailed results are shown in Figure 3. GCGANN with Single-layer GCN and GAT: We employ a single layer of GCN along with GAT for feature extraction, followed by concatenating these two features and passing them into convolutional layers for processing.
GCGANN without GAT: In this experiment, we omitted the GAT and utilized two layers of GCN for feature extraction.Subsequently, the two extracted features were concatenated and fed into convolutional layers for processing.
GCGANN without GCN: We utilize GAT for feature extraction without concatenation, directly passing it into convolutional layers for processing.
The results above indicate that using GCN alone outperforms the combination of GCN and GAT, while GCGACNN achieves the best results.This suggests that a reasonable combination of GCN and GAT can achieve better performance.Specifically, GCN effectively aggregates local information from neighboring nodes and captures global structural features between nodes.In contrast, GAT assigns different weights to neighboring nodes through a self-attention mechanism, thereby more accurately capturing complex interactions between nodes.

Discussion
This study further investigated the profound potential impact of microbial-drug interactions on human health.By combining deep learning and machine learning techniques, the GCGACNN model significantly outperformed some existing methods in terms of predictive performance.These findings are consistent with previous studies, highlighting the importance of advanced computational methods for understanding complex biological interactions.However, it is worth noting that although we used published databases of microbial, drug, and disease associations, these databases may not fully reflect the biological context.Therefore, the predictions made by the proposed deep learning method should be interpreted with caution.Future research should aim to further validate and expand these datasets to ensure the reliability and validity of the predictions.
Moreover, while the GCGACNN model demonstrates impressive predictive capabilities, there are still certain limitations in practical applications.Future work should focus on developing specific application guidelines and tools to help users effectively utilize this method in real-world scenarios.Through these efforts, we can better advance the field and provide stronger support for personalized medicine.
To date, numerous studies have provided substantial evidence for the profound potential impact of microbial-drug interactions on human health.Traditional culture-based methodologies indeed exhibit inherent limitations, particularly when confronted with intricate microbial assemblages.The advent of diverse computational methodologies has furnished us with a more comprehensive avenue for apprehending these interactions.GCGANN with Single-layer GCN and GAT: We employ a single layer of GCN along with GAT for feature extraction, followed by concatenating these two features and passing them into convolutional layers for processing.
GCGANN without GAT: In this experiment, we omitted the GAT and utilized two layers of GCN for feature extraction.Subsequently, the two extracted features were concatenated and fed into convolutional layers for processing.
GCGANN without GCN: We utilize GAT for feature extraction without concatenation, directly passing it into convolutional layers for processing.
The results above indicate that using GCN alone outperforms the combination of GCN and GAT, while GCGACNN achieves the best results.This suggests that a reasonable combination of GCN and GAT can achieve better performance.Specifically, GCN effectively aggregates local information from neighboring nodes and captures global structural features between nodes.In contrast, GAT assigns different weights to neighboring nodes through a self-attention mechanism, thereby more accurately capturing complex interactions between nodes.

Discussion
This study further investigated the profound potential impact of microbial-drug interactions on human health.By combining deep learning and machine learning techniques, the GCGACNN model significantly outperformed some existing methods in terms of predictive performance.These findings are consistent with previous studies, highlighting the importance of advanced computational methods for understanding complex biological interactions.However, it is worth noting that although we used published databases of microbial, drug, and disease associations, these databases may not fully reflect the biological context.Therefore, the predictions made by the proposed deep learning method should be interpreted with caution.Future research should aim to further validate and expand these datasets to ensure the reliability and validity of the predictions.
Moreover, while the GCGACNN model demonstrates impressive predictive capabilities, there are still certain limitations in practical applications.Future work should focus on developing specific application guidelines and tools to help users effectively utilize this method in real-world scenarios.Through these efforts, we can better advance the field and provide stronger support for personalized medicine.
To date, numerous studies have provided substantial evidence for the profound potential impact of microbial-drug interactions on human health.Traditional culture-based methodologies indeed exhibit inherent limitations, particularly when confronted with intricate microbial assemblages.The advent of diverse computational methodologies has furnished us with a more comprehensive avenue for apprehending these interactions.
Although numerous challenges persist at present, the momentum of development within this domain is swiftly accelerating.By amalgamating experimentally validated microbialdrug associations with advanced computational techniques, we envisage expediting the drug development process while simultaneously affording improved support for personalized medicine.

Conclusions
In this study, we proposed a computational model named GCGACNN for predicting potential associations between microbes and drugs.The GCGACNN model utilizes graph convolutional networks to learn latent representations of microbes and drugs and then obtains attention representations through graph attention networks.These integrated representations are processed through a two-dimensional convolutional neural network, and the final predictions are made using a random forest classifier.The main contributions of this study include: 1.
Introducing known microbe-disease-drug interaction relationships in experiments and calculating their Gaussian similarity matrices.

2.
Employing a stacked structure composed of multiple network layers to extract effective representations from the input similarity matrices.

3.
Demonstrating the superior performance of the GCGACNN model compared to existing advanced methods in predicting potential microbe-drug associations.
Despite achieving satisfactory predictive performance, the model has limitations due to the sparse data structure.Future research should consider incorporating additional biological data to enrich the input features and construct multidimensional network data, aiming to enhance the predictive performance and generalizability of computational models.

Figure 1 .
Figure 1.Flowchart of the GCGACNN.Part A shows data retrieval from MDAD and aBiofilm databases, constructing graph structures based on known microbe, drug, and disease associations, and evaluating node similarity.Part B demonstrates using GCN and GAT to extract multi-level node features and generate representations.Part C depicts integrating features via a 2D-CNN and predicting associations with a random forest algorithm.

Figure 2 .
Figure 2. Influence of Different Hyperparameters on Model.

Figure 3 .
Figure 3. Results of the ablation experiments.

Figure 3 .
Figure 3. Results of the ablation experiments.

Table 1 .
Details of our downloaded data.

Table 2 .
Results of various models.Bold indicates the optimal results.

Table 3 .
Metrics obtained with different learning rates.Bold indicates the optimal results.
Figure 2. Influence of Different Hyperparameters on Model.

Table 4 .
Metrics obtained with different output channel numbers.Bold indicates the optimal results.

Table 5 .
Metrics derived from different feature embedding dimensions.Bold indicates the optimal results.

Table 6 .
Metrics derived from different attention mechanism heads.Bold indicates the optimal results.