Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph

Yang, Jucheng; Ren, Fushun

doi:10.3390/app15062901

Open AccessArticle

Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph

by

Jucheng Yang

^*

and

Fushun Ren

College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300453, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2901; https://doi.org/10.3390/app15062901

Submission received: 20 January 2025 / Revised: 1 March 2025 / Accepted: 6 March 2025 / Published: 7 March 2025

Download

Browse Figures

Versions Notes

Abstract

Drug–target affinity (DTA) prediction is a critical step in virtual screening and significantly accelerates drug development. However, existing deep learning-based methods relying on single-modal representations (e.g., text or graphs) struggle to fully capture the complex interactions between drugs and targets. This study proposes CM-DTA, a cross-modal feature fusion model that integrates drug textual representations and molecular graphs with target protein amino acid sequences and structural graphs, enhancing feature diversity and expressiveness. The model employs the multi-perceptive neighborhood self-attention aggregation strategy to capture first- and second-order neighborhood information, overcoming limitations in graph isomorphism networks (GIN) for structural representation. The experimental results on the Davis and KIBA datasets show that CM-DTA significantly improves the performance of drug–target affinity prediction, achieving higher accuracy and better prediction metrics compared to state-of-the-art (SOTA) models.

Keywords:

drug–target affinity; deep learning; cross-modal; neural network

1. Introduction

Currently, the development of a new drug takes nearly 20 years to obtain approval from the U.S. Food and Drug Administration (FDA) and costs approximately USD 260 million. In the field of drug–target affinity prediction, various wet lab methods, including Surface Plasmon Resonance (SPR [1]), Isothermal Titration Calorimetry (ITC [2]), and Enzyme-Linked Immunosorbent Assay (ELISA [3]), have been widely used. However, the high initial investment costs and complex technical requirements have been persistent barriers to technological advancement in this field.

Drug–target affinity prediction methods can currently be categorized into three types: those based on traditional experimental methods, those based on computational methods, and those based on deep learning. Figure 1 provides an overview of the representative methods. With the development of computer technology, data-driven in silico methods have achieved significant success across various drug discovery fields and have clearly become an important tool [4]. For instance, In silico Medicine, a leading biotech company, has successfully applied deep learning models in the field of drug discovery. Its designed drug, INS018_055, advanced to clinical trials in just 18 months, significantly reducing the timeline by nearly three to four years compared to the traditional drug discovery process, which typically takes about five years. Against the backdrop of deep learning and big data, computer-based drug–target affinity prediction methods have effectively improved the efficiency of financial and human resource utilization.

Since the introduction of the SimBoost model by He et al. [5] in 2017, the concept of drug–target affinity (DTA) prediction has garnered significant attention. SimBoost demonstrated outstanding performance in computing prediction intervals and assessing affinity confidence. However, its limitation lies in the underutilization of deep learning’s potential. With technological advancements, Öztürk et al. [6] introduced DeepDTA in 2018, the first deep learning model combining convolutional neural networks (CNNs) and fully connected neural networks for DTA prediction. Although DeepDTA exhibited high configurability and predictive performance across multiple datasets, it lacked a deep exploration of structural features. To address this issue, researchers proposed WideDTA [7], which significantly enriched the original input data by incorporating protein domain features, motifs, and maximum common substructure word features. However, WideDTA’s design focused solely on intra-textual modality fusion, employing a simple weighted summation of extracted textual features. This approach failed to explore or effectively capture the mutual guidance between drug and target textual features and other representational modalities, limiting its performance in DTA prediction. In 2019, Zhao et al. [8] introduced AttentionDTA, pioneering the use of attention mechanisms for efficient feature interaction. While this model improved prediction accuracy by emphasizing relationships between important features, its computational complexity posed a significant challenge when handling large-scale datasets. In 2020, Nguyen et al. [9] advanced the field with GraphDTA, which shifted from traditional drug textual sequence modeling to extracting information from molecular graphs. This innovation reduced the mean squared error (MSE) from 0.282 to 0.254. Despite its success in improving prediction performance, GraphDTA’s aggregation strategy remained limited to first-order neighborhood information, failing to fully exploit higher-order neighborhood features. Consequently, the model struggled to capture complex molecular structural relationships. Building on the success of GraphDTA, Jiang et al. [10] proposed DGraphDTA, which replaced amino acid sequences with protein contact graphs and developed a dual-branch graph-based model, further enhancing representational capabilities. Despite the significant progress achieved by DGraphDTA and subsequent models like SAM-DTA [11], particularly with SAM-DTA reducing the MSE to 0.229 on the same dataset, these models still predominantly relied on single-modality modeling. They failed to effectively integrate the advantages of textual and graphical modalities, limiting their potential in DTA prediction.

In summary, previous deep learning-based methods for drug–target affinity prediction have relied either on textual sequences of drugs and proteins (such as the Simplified Molecular Input Line Entry System (SMILES [12]) for drugs and amino acid sequences for proteins) or on simple graph structures (where nodes represent atoms and amino acid residues, and edges represent chemical bonds and spatial proximities). However, these methods are limited in their ability to capture the intricate molecular interactions and complex structural dependencies that influence drug–target binding. Therefore, integrating various data types through cross-modal fusion methods to capture multidimensional information and provide a more comprehensive perspective will offer new possibilities for drug–target affinity prediction research.

To address the above issues, this study proposes a deep learning model, CM-DTA, based on the cross-modal fusion of text and graphs for drug–target affinity prediction. The model consists of five key components: the drug and target representation module, the text modality processing module, the graph modality processing module, the feature fusion module, and the prediction module. First, in the drug and target representation module, the input drug and target data are preprocessed to generate four types of input data: the drug’s SMILES expressions, molecular graphs (generated using the open-source software RDKit (2024.09.6) [13]), amino acid sequences of proteins (as the specific components of targets in this study), and protein graphs (predicted using the Pconsc4 program [14]). By introducing these four types of input data, this study addresses the issue of traditional methods focusing on a single modality and neglecting the interactions between multiple modalities. Secondly, two feature extraction modules are constructed, one for text data and one for graph data. The text modality is processed using a module based on Gated Recurrent Units (GRU), while the graph modality is handled using a Graph Isomorphism Network (GIN) [15] with a multi-perceptive neighborhood self-attention aggregation strategy. This allows for a deeper extraction of features from both the text and graph modalities, solving the issue of traditional methods’ inability to adequately capture graph modality feature information. The features extracted by these two modules better support the subsequent fusion of drug and target features. In the feature fusion stage, a cross-modal bidirectional adaptive guided fusion module is designed. This module integrates the drug’s text feature vector with the graph feature vector, as well as the target’s text feature vector with the graph feature vector, combining them pairwise to generate two fused feature vectors, each weighted and guided by different modalities. This strategy addresses the problem of insufficient information transfer in cross-modal feature fusion, allowing the model to more accurately capture key information between modalities. Finally, in the prediction module, an explicit prediction strategy based on multi-head collaborative attention is applied. This module performs interactions between the two fused feature vectors of the drug and the target, concatenates them, and inputs the result into a multi-layer, fully connected neural network to complete the affinity prediction. This method effectively addresses the limitation of traditional methods that rely only on simple feature concatenation, thereby enhancing the modeling of interactions between drug and target features, significantly improving prediction performance and biological interpretability.The main contributions of this study are as follows:

Design of a cross-modal fusion model for drug-target affinity prediction: A novel prediction model is developed based on the fusion of textual and graphical features. This study is the first in the field of affinity prediction to integrate textual and graph modalities for enhanced input feature representation.
Proposed multi-perceptive neighborhood self-attention aggregation strategy: Applied to the GIN, this strategy extends the traditional fixed weight allocation for first-order neighborhood aggregation to simultaneously capture both first-order and second-order neighborhood information. It dynamically adjusts aggregation weights based on neighboring node features, thereby enhancing the model’s structural perception capabilities.
Proposed cross-modal bidirectional adaptive guided fusion strategy: This strategy is implemented in the feature fusion module to calculate interaction weights by guiding attention between the two modalities. This enables textual and graphical features to focus on each other’s relevant information, achieving efficient cross-modal information fusion.
Proposed explicit prediction strategy based on multi-head collaborative attention: This strategy is applied in the prediction module, leveraging similarity matrices and parallel subspaces of multi-head attention to enable deep interaction between drug and protein features. This approach not only enhances predictive performance but also improves biological interpretability by making the modeling process more intuitive.

2. Methods

This section presents the five key components of CM-DTA: the drug and target representation module, the text modal processing module, the graph modal processing module, the feature fusion module, and the prediction module. The overall structure of the model is shown in Figure 2. The drug and target representation module corresponds to data preprocessing, which is detailed in the experimental section. For the text modal processing modules of drugs and proteins, this study uses the classical GRU for processing, with more details available in the literature [16], so they will not be elaborated here. The graph modal processing module, feature fusion module, and prediction module have been innovatively improved in this study, and the details will be described sequentially below.

2.1. Multi-Perceptive Neighborhood Self-Attention Aggregation Strategy

2.1.1. Method Overview

Considering the complex chemical properties and molecular interactions in drug molecular graphs and protein graphs, this study proposes an improved GIN model that employs the multi-perceptive neighborhood self-attention aggregation strategy, as shown in Figure 3. Compared to traditional GIN, this approach introduces two key innovative optimizations in the aggregation and update mechanism, aiming to enhance the model’s ability to represent molecular structures and capture complex relationships.

First, building on the original principle of fixed weight allocation for neighboring node information, a self-attention aggregation strategy is employed. The core of this strategy lies in dynamically adjusting the aggregation weights of neighboring nodes. Specifically, during each information propagation step, the similarity between the feature vectors of the neighboring nodes is first computed to generate an attention weight matrix. This matrix is based on the feature vectors of each node and is normalized through a dot product operation followed by the softmax function to obtain the relative importance of neighboring nodes. This means that during the aggregation process, the model is able to more flexibly focus on those neighboring nodes that contribute more to the information update of the current node, thereby enhancing the effectiveness of information fusion.

Next, considering that a larger range of atoms and chemical bonds is particularly important for the representation of functional groups and substructures within a molecule, this study introduces second-order neighborhood information in addition to the first-order neighborhood to capture more complex relationships within a broader neighborhood. This process is implemented through the following steps: during each iteration, the model first aggregates information from the first-order neighborhood and then combines these aggregated features with those from the second-order neighborhood. Specifically, the introduction of second-order neighborhood features not only enhances the model’s understanding of the overall graph structure but also enables it to capture potential molecular interactions and subtle chemical properties. By constructing composite feature vectors that incorporate both first- and second-order neighborhood information, the model can consider richer contextual information when calculating node representations, thereby improving prediction accuracy.

2.1.2. Aggregation and Update Rules

In practice, this optimization process is carried out through iterative updates. Each round of aggregation is based on the current node features and its neighborhood features, and the weighted sum computed through the self-attention mechanism forms the new node feature representation. This approach ensures that, during information propagation, each node not only relies on the information from its direct neighbors but can also indirectly obtain valuable features from more distant neighborhoods, thus comprehensively enhancing the model’s ability to represent and understand the molecular structures. The specific update rule for the improved GIN is as follows:

First-Order Neighborhood Aggregation: For each node, the features are first aggregated from their first-order neighborhood, denoted as. This aggregation process uses a self-attention mechanism to dynamically assign appropriate weights based on the features of neighboring nodes. Let

h_{v}^{(k - 1)}

represent the features of a first-order neighboring node

u \in N_{1} (v)

at the

(k - 1)

-th layer. The first-order neighborhood aggregation

{AGG}_{1}^{(k - 1)} (v)

is defined as follows:

{AGG}_{1}^{(k - 1)} (v) = \sum_{u \in N_{1} (v)} α_{vu}^{(k - 1)} h_{u}^{(k - 1)}

(1)

where

α_{v u}^{(k - 1)}

is the self-attention weight, calculated based on the feature similarity between u and v, used to measure the relative importance of node u to node.

Second-Order Neighborhood Aggregation: To strengthen the model’s ability to capture structural information, second-order neighborhood information is introduced. Similar to first-order neighborhoods, the features of the second-order neighbors

w \in N_{2} (v)

are aggregated. Let

h_{w}^{(k - 1)}

represent the features of a second-order neighbor, and the second-order neighborhood aggregation

{AGG}_{2}^{(k - 1)} (v)

is defined as follows:

{AGG}_{2}^{(k - 1)} (v) = \sum_{w \in N_{2} (v)} β_{vw}^{(k - 1)} h_{w}^{(k - 1)}

(2)

where

β_{v w}^{(k - 1)}

is the self-attention weight assigned to second-order neighbors, enabling the model to capture broader contextual relationships.

Combining First- and Second-Order Neighborhoods: After completing the aggregation of first- and second-order neighbors, the features are combined to obtain enriched node representations. To balance the contributions of first- and second-order neighbors, we introduce a learnable parameter

γ^{(k - 1)}

, which adjusts adaptively during each training iteration based on the task requirements, optimizing the relative importance of first- and second-order neighbors in the final aggregation. Through this adaptive training mechanism, the model dynamically adjusts the value of

γ^{(k - 1)}

according to the characteristics of the data, ensuring that the weight distribution between first- and second-order neighbors aligns precisely with the task goal, thereby enhancing model performance. The final aggregation representation

m_{v}^{(k)}

for node v is expressed as follows:

m_{v}^{(k)} = (1 + γ^{(k - 1)}) \cdot A G G_{1}^{(k - 1)} (v) + γ^{(k - 1)} \cdot A G G_{2}^{(k - 1)} (v)

(3)

where

γ^{(k - 1)}

is a learnable parameter used to balance the contributions of first- and second-order neighbors. The learning process of this parameter is achieved through backpropagation and gradient descent, where the model adjusts the value of

γ^{(k - 1)}

based on the feedback from the loss function, ensuring that it adaptively optimizes the weights of first- and second-order neighbors. Specifically, during training, the model minimizes prediction errors to adjust

γ^{(k - 1)}

, calculating gradients and updating the parameter at each iteration based on the current node and neighbor features. In this way,

γ^{(k - 1)}

can dynamically adjust according to different training data. For example, when the local information from first-order neighbors is sufficiently rich,

γ^{(k - 1)}

may approach smaller values, emphasizing the role of first-order neighbors; when first-order neighbor information is insufficient, the model will automatically increase

γ^{(k - 1)}

to introduce more second-order neighbor information. This dynamic adjustment process allows the model to flexibly capture multi-level structural information, thereby improving the accuracy of tasks such as drug–target affinity prediction.

Feature Update: Finally, the node representations are updated using a multi-layer perceptron (MLP) to integrate the aggregated features. The updated representation

h_{v}^{(k)}

for node v is calculated as follows:

h_{v}^{(k)} = MLP (m_{v}^{(k)})

(4)

where

m_{v}^{(k - 1)}

represents the combined feature vector after first- and second-order neighborhood aggregation. This hierarchical feature aggregation and update strategy significantly enhances the model’s capacity to represent molecular structures and capture intricate chemical properties, allowing for a deeper understanding of molecular interactions and improved predictive performance.

2.1.3. Additional Enhancements

In this module, we also introduced an inter-layer feature fusion and reuse mechanism by incorporating residual connection structures between multiple layers of the GIN network. This structure allows the model to comprehensively consider low-level detailed features (atomic-level attributes) and high-level abstract features (molecular-level attributes), enabling a multi-level understanding of the data. Furthermore, the residual connection structure effectively alleviates the vanishing gradient problem in deep networks and improves the model’s generalization ability.

2.2. Cross-Modal Bidirectional Adaptive Guided Fusion Strategy

2.2.1. Method Overview

To better capture the complex relationships between the textual and graphical modalities of drugs and proteins, this study proposes a cross-modal bidirectional adaptive guided fusion strategy, which is applied to the feature fusion module. Inspired by the Transformer model, this module constructs Self-Attention (SA) and Guided-Attention (GA) units and organizes them in a cascading design to achieve the cross-modal fusion of drug and protein features, as illustrated in Figure 4.

2.2.2. Feature Refinement

Although the feature extraction for each modality has been initially refined using a dual-layer GRU network for text features and an improved GIN network with a multi-perceptive neighborhood self-attention aggregation strategy for graph features, there remain limitations in capturing fine-grained dependencies within each modality. Specifically, the GRU network primarily focuses on sequence information but struggles to model long-range dependencies between words, limiting its ability to fully capture the global semantic relationships within the textual modality. Similarly, while the GIN network improves structural information extraction for the graph modality through its enhanced aggregation strategy, it’s still constrained by predefined neighborhood scopes, making it difficult to dynamically adjust dependencies between nodes at the feature level.

To address these limitations, this study incorporates an SA unit for further feature refinement within each modality before proceeding to cross-modal fusion. The SA unit leverages a multi-head attention mechanism to adaptively capture dependencies within each modality at the feature level, dynamically adjusting weight allocation based on feature importance. This process allows the model to focus on key features within a single modality, yielding more precise and semantically enriched representations. The introduction of the SA unit complements the GRU and the improved GIN by enhancing the modeling of intra-modal relationships, thereby providing more effective inputs for subsequent cross-modal fusion.

During the feature fusion stage, the GA unit facilitates bidirectional feature guidance between the textual and graphical modalities. Using a multi-head attention mechanism, the GA unit computes interaction weights between the two modalities, enabling each modality to update its feature representation under the guidance of the other. Specifically, the GA unit allows the textual features in the drug branch to guide the focus areas of graphical features, and vice versa, ensuring that each modality can adaptively capture the most relevant aspects of the other. For instance, the textual features of drugs guide the weighting of molecular structure features generated by the graph modality, highlighting graph regions pertinent to the text. Similarly, graphical features emphasize critical words in the text modality, aiding the model in comprehending cross-modal information more effectively.

Through a series of ablation experiments, it has been demonstrated that, compared to traditional weighted averaging fusion and the single guided-attention model shown in Figure 5, the dual guided-attention fusion module not only establishes bidirectional connections between different modalities, enabling more flexible and adaptive feature capture, but also allows for the design of cascading structures based on task requirements, achieving the layer-by-layer refinement of features. Through multi-level deep aggregation, this method fully explores the fine-grained interactions between modalities, significantly improving the model’s ability to represent and predict the complex relationships between drugs and proteins.

2.3. Explicit Prediction Strategy Based on Multi-Head Collaborative Attention

2.3.1. Method Overview

In practical drug development, the reliability of model prediction results directly impacts the efficiency of drug research and development. Traditional methods typically adopt the direct concatenation of feature vectors extracted from drug and protein branches. However, such implicit modeling approaches fail to adequately capture complex interactions between features, potentially leading to significant discrepancies between predicted results and wet lab experimental validation. To address this issue, this study proposes an explicit modeling approach based on a multi-head collaborative attention mechanism, which introduces two key innovations to improve drug–protein affinity prediction.

2.3.2. Functional Mechanism

First, this study introduces a similarity matrix-based interaction modeling mechanism to explicitly quantify the relationships between drug and protein features. Compared to traditional direct concatenation methods, this mechanism captures both local and global interaction patterns between the two types of features by computing their similarities in a multi-dimensional space. Specifically, let the drug features be denoted as

D \in R^{n_{d} \times d}

and the protein features as, where

n_{p}

and

n_{d}

represent the feature dimensions of the drug and protein, respectively, and is the embedding dimension of each feature. By mapping both features into the same embedding space, a similarity matrix is constructed, defined as follows:

S = softmax ({DP}^{⊤})

(5)

where

D P^{⊤} \in R^{n_{d} \times n_{p}}

represents the inner product similarity between drug and protein features, and the softmax(·) function ensures row-wise normalization, allowing each row to reflect the interaction weight distribution between a drug and all protein features. This explicit interaction modeling approach enhances the model’s ability to capture potential biochemical associations between drug and protein features, providing a solid foundation for subsequent feature fusion.

Second, the study employs a multi-head collaborative attention mechanism to achieve efficient feature fusion. Unlike traditional direct concatenation methods, the multi-head collaborative attention mechanism extracts deep interaction relationships between features in parallel across multiple subspaces. Specifically, given the similarity matrix, the multi-head attention mechanism performs weighted aggregation of drug and protein features, expressed as follows:

H^{k} = softmax (S) {PW}^{k}

(6)

where

H^{k} \in R^{n_{d} \times d_{k}}

is the representation of drug features generated by the k-th attention head,

W^{k} \in R^{d \times d_{k}}

is the learnable parameter matrix for the k-th attention head, and

d_{k}

is the output dimension of each head. The outputs of all attention heads are concatenated to form the fused feature representation:

H = [H^{1}; H^{2}; \dots; H^{K}]

(7)

where k is the number of attention heads. This mechanism extracts deep interaction information from multiple subspaces, mitigating the risk of missing key information in a single attention head and enhancing the model’s representational capacity.

Based on the above mechanisms, the final fused representation is passed to a fully connected layer for drug–protein affinity prediction. This method effectively addresses the limitations of insufficient feature interaction modeling, achieving fine-grained interaction modeling at the feature level while providing robust theoretical support and biological interpretability for prediction.

3. Experiments

3.1. Datasets

We conducted experiments using two benchmark datasets, Davis [17] and KIBA [18], listed in Table 1, to train our model and evaluate its performance. In the field of drug–target affinity prediction, these two datasets are widely used. The Davis dataset compiles specific kinase proteins and their corresponding inhibitors, containing 68 drugs, 442 proteins, and 30,056 drug–target interactions. The average length of drug SMILES strings is 64, and the average length of protein amino acid sequences is 788. This dataset represents the affinity between drugs and targets using the kinase dissociation constant. Furthermore, the

K_{d}

values are transformed into

p K_{d}

values with the calculation formula as follows:

{pK}_{d} = - {log}_{10} \frac{K_{d}}{10^{9}}

(8)

The KIBA dataset gathers the bioactivity data of kinase inhibitors, including inhibition constant, dissociation constant, and half-maximal inhibitory concentration. It contains 2111 drugs, 229 proteins, and 118,254 drug–target interactions. The average length of drug SMILES strings is 58, and the average length of protein amino acid sequences is 728.

Additionally, we introduce a multi-fold cross-validation mechanism. The dataset is divided into five equal-sized subsets. Whenever one subset is selected as the validation set, the remaining four subsets serve as the training set. This mechanism effectively enhances data utilization and, by reducing the evaluation bias caused by data partitioning, helps to provide a more accurate estimate of the model’s generalization performance.

3.2. Data Preprocessing

In the drug representation section, this study conducts modeling and analysis based on the textual representation and molecular graph of drug molecules, as shown in Figure 6. The SMILES expression is used as the textual representation of drug molecules in this study. SMILES expressions represent complex chemical structures, such as heavy atoms or valence electrons, using strings composed of letters or numbers, making it suitable for feature extraction by deep learning models such as convolutional neural networks. To restore and extract the true structural information of drug molecules, the open-source cheminformatics software RDKit is used to convert SMILES expressions into corresponding molecular graphs. In these graphs, each node is represented by a multi-dimensional feature vector that includes five types of information: atomic symbol, the number of adjacent atoms, the number of adjacent hydrogen atoms, the atom’s implicit valence, and whether the atom is part of an aromatic structure. This intuitive topological information can be effectively extracted by deep learning models such as graph convolutional networks.

In the target representation section, this study conducts modeling and analysis based on protein amino acid textual sequences and protein graphs, as shown in Figure 4. Each target protein is expressed through its corresponding amino acid sequence via the translation process. ASCII characters are used as the textual expression for amino acid sequences, where each type of amino acid is encoded by its corresponding English letter through a unique integer. Similar to drug SMILES expressions, these amino acid textual sequences can be effectively extracted by deep learning models such as convolutional neural networks. To acquire protein structural information, such as the angles and distances between different residue pairs, the prediction tool Pconsc4 is used to predict residue contact maps from amino acid sequences. After filtering to select contact information with higher confidence, the adjacency matrix of the protein graph is ultimately constructed.

The contact map is a matrix form expression of the residue contact graph, not the final protein graph itself. The matrix size is

L \times L

, where L is the length of the protein sequence. The element

m_{i j}

of the matrix indicates whether the i-th and j-th residues in the sequence are in contact. The definition of contact is typically understood as follows: if the Euclidean distance between the

C_{β}

atoms of two residues (

C_{α}

for glycine, as it does not have a

C_{β}

) is less than a specified threshold, they are considered to be in contact.

3.3. Evaluation Metrics

Drug–target affinity prediction is considered a regression task. To ensure a fair performance comparison with previous models, we selected three identical evaluation metrics to assess our model, including Mean Squared Error (MSE), Concordance Index (CI), and Regression towards the Mean Index (

r_{m}^{2}

).

MSE is a metric used to measure the deviation between predicted values and actual values, calculated using the squared loss function. The specific calculation formula is as follows:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}

(9)

where

{\hat{y}}_{i}

represents the predicted value,

y_{i}

is the actual value, and n is the total number of drug–target pairs in the dataset.

CI is used to evaluate whether the ranking of predicted values for two randomly selected drug–target pairs is consistent with the ranking of their actual values. The specific calculation formula is as follows:

C I = \frac{1}{Z} \sum_{d_{x} > d_{y}} h (b_{x} - b_{y})

(10)

h (x) = \{\begin{matrix} 1, & if x > 0 \\ 0.5, & if x = 0 \\ 0, & if x < 0 \end{matrix}

(11)

where

h (x)

is a step function,

b_{x}

represents the predicted value corresponding to the higher affinity, and

b_{y}

represents the predicted value corresponding to the lower affinity. Z is a normalization constant representing the total number of drug–target pairs involved in the analysis.

The

r_{m}^{2}

metric is used to assess the external predictive capability of the model. The specific calculation formula is as follows:

r_{m}^{2} = r^{2} \times (1 - \sqrt{r^{2} - r_{0}^{2}})

(12)

where

r^{2}

is the squared correlation coefficient between the actual values and the predicted values including the intercept, and

r_{0}^{2}

is the squared correlation coefficient between the actual values and the predicted values without including the intercept.

3.4. Hyperparameter Settings and Tuning Experiments

Detailed information on the model’s hyperparameter settings is provided in Table 2.

This study conducted tuning experiments on the batch size, learning rate,

γ

, and the drug–target interaction modeling approach (mentioned in the previous chapter).To optimize the model’s performance, we conducted hyperparameter tuning experiments on batch size and learning rate, with the detailed results shown in Table 3. Through these experiments, we confirmed the significant impact of batch size and learning rate on model performance and selected the optimal configuration (batch size of 512 and learning rate of 0.001) for further model training.

The study conducted experiments with different values of the learnable parameter

γ

, as shown in Table 4. When

γ

= 0.5, the model exhibited the best performance, indicating that smaller values of

γ

help to enhance the capture of drug–target interaction features and further optimize the model’s prediction accuracy.

This study conducted experiments using both softmax inner product similarity and cosine similarity, as shown in Table 5. The results indicate that the softmax method performs better in drug–target interaction modeling. These findings validate the choice of inner product similarity over cosine similarity for this task, as inner product similarity more effectively captures the interactions between the drug and target.

To ensure a fair comparison, all network models in the experiment are implemented and executed in the Pytorch framework. The hardware configuration used in this study is as follows: the operating system is Ubuntu 20.04, the CPU model is Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10 GHz, the GPU model is NVIDIA GeForce RTX 4090, the GPU memory is 24 GIB, the RAM is 64 GB DDR5 (32 × 2 GB), and the solid-state drive capacityis 4 TB. The deep learning framework used is Pytorch 2.2.2, the CUDA version is 12.2, the cuDNN version is 8.7, and the Python version is 3.9.19. The details of the experimental environment settings are shown in Table 6.

3.5. Performance Comparison

To verify the performance of CM-DTA in the DTA domain, we conducted a detailed comparison with existing models, such as DeepDTA, GraphDTA, DGraphDTA, ProLiGraph [19], TransVAEDTA [20], and Tag-DTA [21] based on experiments on the Davis and KIBA datasets using three identical evaluation methods: MSE, CI, and

r_{m}^{2}

. The specific performance metrics of the model are reproduced from previously published papers, and the final experimental results are shown in Figure 7.

Based on the Davis dataset, our model significantly outperforms other models on this dataset, with an MSE of 0.182, CI of 0.929, and

r_{m}^{2}

of 0.786. Compared to the SOTA, the MSE decreased by 1.6%, CI increased by 0.6%, and

r_{m}^{2}

increased by 0.2%. Based on the KIBA dataset, our model significantly outperforms other models on this dataset as well, with an MSE of 0.114, CI of 0.917, and

r_{m}^{2}

of 0.806. Compared to SOTA, the MSE decreased by 0.7%, CI increased by 1.5%, and

r_{m}^{2}

increased by 0.8%.

Figure 8 illustrates the comparison between true affinity and predicted values for both the Davis and KIBA datasets. The x-axis represents the ground truth, and the y-axis represents the predictions. The variance between the expected affinity and the actual value is shown by the vertical distance

|Δ y|

between each point and

y = x

. The distribution of the expected and actual affinities is shown by the histograms at the edges. The results show that the data points in both datasets tend to be symmetric around

y = x

, with a denser distribution around

y = x

in the KIBA dataset. The y-axis displays the predicted value for each data point, and the x-axis displays the actual value for each data point. Each sample’s vertical distance

|Δ y|

from

y = x

shows the difference between the predicted and actual values of its affinity.

The experimental results demonstrate that by leveraging the characteristics of different data types, adopting suitable network modeling strategies, and introducing appropriate cross-modal feature fusion and final prediction strategies, the overall performance of the model is significantly improved. Specifically, the multi-perceptive neighborhood self-attention aggregation strategy enables the graph modality to capture the structural information of drugs and targets with greater precision, thereby enhancing the model’s structure-awareness capabilities. Meanwhile, the cross-modal bidirectional adaptive guided fusion strategy achieves bidirectional information guidance between the text and graph modalities, allowing features from both modalities to focus on key information. This improves the complementarity between modalities and enriches the expressiveness of feature representations. In addition, the explicit prediction strategy based on multi-head collaborative attention facilitates deep interaction between drug and protein features through a similarity matrix and parallel subspaces of multi-head collaboration, enhancing both prediction accuracy and biological interpretability. Compared to existing models, CM-DTA demonstrates superior performance on the Davis and KIBA datasets. These results indicate that the advantages of CM-DTA in feature representation and information complementarity make it a more competitive approach for drug–target affinity prediction.

3.6. Ablation Study

To evaluate the impact of each module on model performance, this study conducted a systematic ablation experiment on the Davis and KIBA datasets, with the results presented in Table 7. In the experimental setup, MA refers to the GIN model enhanced by the multi-perceptive neighborhood self-attention aggregation strategy, MB represents the cross-modal bidirectional adaptive guided fusion module, and SMB denotes the cross-modal single-guided adaptive fusion module (including SMB1, where text guides the graph, and SMB2, where the graph guides the text). MC stands for the Multi-Head Collaborative Attention Explicit Prediction Strategy, while RE represents the introduction of residual connections, applied to the GIN, MA, and MB modules. The key findings are summarized as follows:

Single-Modality Baselines (Models 1 and 2): Using GRU for text modality (Model 1) and GIN for graph modality (Model 2) independently achieves limited performance, with higher MSE and lower and CI values. This highlights that single-modality features cannot fully capture the complex interactions between drugs and targets.

Effect of MA (Model 5): Incorporating the MA in GIN significantly improves performance over the basic GIN (Model 2), reducing MSE and enhancing CI values. This demonstrates the importance of capturing both first-order and second-order neighborhood information for better structural representation.

Basic Cross-Modal Fusion (Model 6): Combining text and graph features using GRU and GIN improves performance compared to single modalities (Models 1 and 2). However, this basic concatenation approach lacks deep interaction modeling, leading to limited gains compared to advanced fusion strategies.

Advanced Cross-Modal Fusion (Models 7–9): Introducing MB (Model 8) outperforms SMA (Model 9) and basic fusion (Model 6). The bidirectional mechanism enables more effective interaction between modalities, enhancing complementary information exchange. Effect of Explicit Prediction (Model 10): Adding the MC further improves performance, as it deepens the interaction between drug and protein features while increasing biological interpretability.

Final Model (Model 13): The integration of all modules (GRU+MA+MB+RE+MC) achieves the best results on both datasets, with the lowest MSE and highest CI values. The RE effectively stabilizes training and enhances generalization, while the combination of MA, MB, and MC captures fine-grained interactions and improves predictive performance. In summary, each module contributes to model performance, with the full integration of MA, MB, MC, and RE achieving the strongest results by capturing both intra- and inter-modality relationships and effectively leveraging complementary information.

4. Discussion

4.1. Comparison Between Existing Methods and Cross-Modal Fusion of Text and Graph Methods

Cross-modal methods offer significant advantages over traditional experimental methods and computational methods. Traditional experimental methods rely on experimental data, have low computational efficiency, are suitable for small-scale datasets, and face high costs and long timeframes. While computational methods improve computational efficiency, they depend on manual annotation and feature extraction, which limits their applicability. In contrast, cross-modal deep learning methods can handle large-scale datasets, and, by integrating text and graph features, capture complex structural relationships and interactions. These methods offer greater flexibility and prediction accuracy, achieving better performance in a wider range of tasks, as shown in Table 8.

4.2. Strengths and Limitations

This study proposes an efficient cross-modal feature fusion model, CM-DTA, for drug–target affinity prediction. CM-DTA is based on the multi-perceptive neighborhood self-attention aggregation strategy, which captures both first-order and second-order neighborhood information to enhance the structural perception of the graph modality. It also incorporates the cross-modal bidirectional adaptive guided fusion strategy, establishing effective interactions between the text and graph modalities, enabling features to focus on each other’s key information for efficient cross-modal information fusion. Furthermore, the model leverages the explicit prediction strategy based on multi-head collaborative attention, which deeply explores the complex relationships between drug and target features, improving predictive performance and biological interpretability. Experimental results on the Davis and KIBA datasets demonstrate that CM-DTA significantly outperforms the current SOTA models. However, the model’s performance may be influenced by the quality and size of the training data, as it relies heavily on large, well-labeled datasets for accurate predictions. Additionally, while the model effectively captures cross-modal interactions, it may still struggle with extremely noisy or incomplete data, potentially affecting its robustness.

4.3. Potential Application Areas

The cross-modal feature fusion approach proposed in CM-DTA has significant potential in various domains of drug discovery and bioinformatics. Primarily, it can be applied to drug–target affinity prediction, where its ability to integrate textual and graphical information from drug molecules and protein sequences enables more accurate and interpretable predictions compared to traditional methods. Beyond drug–target interactions, the methodology can extend to other areas such as drug repurposing, where identifying similarities between existing drugs and new disease targets is crucial. Furthermore, the model’s capacity to handle multi-modal data makes it applicable to complex biomedical tasks, such as protein–protein interaction prediction, biomarker discovery, and personalized medicine, where integrating diverse data sources is essential for making informed decisions. Given its robustness in capturing intricate relationships across different data types, CM-DTA offers a versatile tool for advancing precision medicine and accelerating the drug development process.

4.4. Future Directions

Although the proposed CM-DTA model demonstrates promising results, there are several areas for further improvement and exploration. First, the model’s reliance on large-scale, well-labeled datasets may limit its performance in scenarios where data are sparse or noisy. Future research could focus on developing more robust methods for training with limited or noisy data, possibly through semi-supervised learning or transfer learning techniques, to enhance its generalization ability. Secondly, the current model primarily focuses on drug–target affinity prediction, but its applicability could be extended to other biomedical tasks, such as drug combination prediction, adverse drug reaction prediction, and patient-specific treatment recommendations. Integrating additional biological data, such as genomic and proteomic information, could further improve the model’s performance and make it more versatile. Furthermore, improving the interpretability of the model is a key area for future development. Developing methods that allow users to better understand and visualize the decision-making process of the model would help to foster trust in its predictions, especially in clinical settings. To this end, developing an intuitive visualization interface as soon as possible could be considered to help users better understand and use the model by displaying the prediction process and results. Finally, considering the computational complexity of the model and optimizing its efficiency and scalability will be an important direction for the future. This will not only improve the model’s ability to work with larger datasets but also ensure its efficiency and stability in real-world applications.

Author Contributions

Investigation, data curation, supervision and funding acquisition, conceptualization, methodology and writing—review and editing, methodology, software and resources, J.Y.; methodology, software, formal analysis, data curation and writing—original draft, software, validation, resources and project administration, F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CI	Concordance Index
CNN	Convolutional Neural Network
DTA	Drug–Target Affinity
ELISA	Enzyme-Linked Immunosorbent Assay
FDA	Food and Drug Administration
GA	Guided-Attention
GIN	Graph Isomorphism Network
GRU	Gated Recurrent Unit
MSE	Mean Squared Error
MA	Multi-Perceptive Neighborhood Self-Attention Aggregation Strategy
MB	Cross-Modal Bidirectional Adaptive Guided Fusion Module
MC	Multi-Head Collaborative Attention Explicit Prediction Strategy
RE	Residual Connection
SMB1	Cross-Modal Single-Guided (text) Adaptive Fusion Module
SMB2	Cross-Modal Single-Guided (graph) Adaptive Fusion Module
SMILES	Simplified Molecular Input Line Entry System
SPR	Surface Plasmon Resonance
SOTA	State of the Art
ITC	Isothermal Titration Calorimetry

References

Bellassai, N.; D’Agata, R.; Giordani, E.; Ziccheddu, G.; Corradini, R.; Spoto, G. A novel method for detecting genetic biomarkers in blood-based liquid biopsies using surface plasmon resonance imaging and magnetic beads shows promise in cancer diagnosis and monitoring. Talanta 2025, 286, 127543. [Google Scholar] [CrossRef] [PubMed]
Bastos, M.; Abian, O.; Johnson, C.M.; Ferreira-da-Silva, F.; Vega, S.; Jimenez-Alesanco, A.; Ortega-Alarcon, D.; Velazquez-Campoy, A. Isothermal titration calorimetry. Nat. Rev. Methods Prim. 2023, 3, 17. [Google Scholar] [CrossRef]
Liu, Y.; Jin, Z.; Sun, D.; Xu, B.; Lan, T.; Zhao, Q.; He, Y.; Li, J.; Cui, Y.; Zhang, Y. Preparation of hapten and monoclonal antibody of hesperetin and establishment of enzyme-linked immunosorbent assay. Talanta 2025, 281, 126912. [Google Scholar] [CrossRef] [PubMed]
D’Agata, R.; Bellassai, N.; Spoto, G. Exploiting the design of surface plasmon resonance interfaces for better diagnostics: A perspective review. Talanta 2024, 266, 125033. [Google Scholar] [CrossRef] [PubMed]
He, T.; Heidemeyer, M.; Ban, F.; Cherkasov, A.; Ester, M. SimBoost: A read-across approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. 2017, 9, 1–14. [Google Scholar] [CrossRef] [PubMed]
Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 2018, 34, i821–i829. [Google Scholar] [CrossRef] [PubMed]
Allenspach, S.; Hiss, J.A.; Schneider, G. Neural multi-task learning in drug design. Nat. Mach. Intell. 2024, 6, 124–137. [Google Scholar] [CrossRef]
Zhao, Q.; Xiao, F.; Yang, M.; Li, Y.; Wang, J. AttentionDTA: Prediction of drug–target binding affinity using attention model. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 64–69. [Google Scholar] [CrossRef]
Nguyen, T.; Le, H.; Quinn, T.P.; Nguyen, T.; Le, T.D.; Venkatesh, S. GraphDTA: Predicting drug–target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140–1147. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Wang, S.; Zhang, S.; Zhou, W.; Zhang, Y.; Li, Z. Sequence-based drug-target affinity prediction using weighted graph neural networks. BMC Genom. 2022, 23, 449. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Zhou, Z.; Cao, X.; Cao, D.; Zeng, X. Effective drug-target affinity prediction via generative active learning. Inf. Sci. 2024, 679, 121135. [Google Scholar] [CrossRef]
Feng, W.; Wang, L.; Lin, Z.; Zhu, Y.; Wang, H.; Dong, J.; Bai, R.; Wang, H.; Zhou, J.; Peng, W.; et al. Generation of 3D molecules in pockets via a language model. Nat. Mach. Intell. 2024, 6, 62–73. [Google Scholar] [CrossRef]
Sieg, J.; Feldmann, C.W.; Hemmerich, J.; Stork, C.; Sandfort, F.; Eiden, P.; Mathea, M. MolPipeline: A python package for processing molecules with RDKit in scikit-learn. J. Chem. Inf. Model. 2024, 64, 9027–9033. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Li, Z.; Chen, G.; Yin, Y.; Chen, C.Y.C. Multi-perspective neural network for dual drug repurposing in Alzheimer’s disease. Knowl.-Based Syst. 2024, 283, 111195. [Google Scholar] [CrossRef]
Peng, Y.; Lin, Y.; Jing, X.Y.; Zhang, H.; Huang, Y.; Luo, G.S. Enhanced graph isomorphism network for molecular admet properties prediction. IEEE Access 2020, 8, 168344–168360. [Google Scholar] [CrossRef]
Fairuzabadi, M.; Kusrini, K.; Utami, E.; Setyanto, A. Advancements and Challenges in Gated Recurrent Units (GRU) for Text Classification: A Systematic Literature Review. In Proceedings of the 2024 7th International Conference of Computer and Informatics Engineering (IC2IE), Bali, Indonesia, 12–13 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Pasquale, E.B. Eph receptors and ephrins in cancer progression. Nat. Rev. Cancer 2024, 24, 5–27. [Google Scholar] [CrossRef] [PubMed]
Bernett, J.; Blumenthal, D.B.; Grimm, D.G.; Haselbeck, F.; Joeres, R.; Kalinina, O.V.; List, M. Guiding questions to avoid data leakage in biological machine learning applications. Nat. Methods 2024, 21, 1444–1453. [Google Scholar] [CrossRef] [PubMed]
Paendong, G.G.; Ngnamsie Njimbouom, S.; Zonyfar, C.; Kim, J.D. ERL-ProLiGraph: Enhanced representation learning on protein-ligand graph structured data for binding affinity prediction. Mol. Inform. 2024, 43, e202400044. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Li, Z.; Song, J.; Xiang, W. TransVAE-DTA: Transformer and variational autoencoder network for drug-target binding affinity prediction. Comput. Methods Programs Biomed. 2024, 244, 108003. [Google Scholar] [CrossRef] [PubMed]
Monteiro, N.R.; Oliveira, J.L.; Arrais, J.P. TAG-DTA: Binding-region-guided strategy to predict drug-target affinity using transformers. Expert Syst. Appl. 2024, 238, 122334. [Google Scholar] [CrossRef]

Figure 1. A overview of representative methods for DTA prediction.

Figure 2. The overall architecture of CM-DTA.

Figure 3. Multi-perceptive neighborhood self-attention aggregation strategy.

Figure 4. Cross-modal bidirectional adaptive guided fusion strategy.

Figure 5. Single guided-attention cross-modal fusion module.

Figure 6. Data preprocessing.

Figure 7. Prediction performance on the Davis and KIBA dataset.

Figure 8. The real affinity against the predicted value on Davis (a) and KIBA (b) datasets.

Table 1. Summary of the datasets.

Dataset	Drug	Target	Interactions	Active Data	Inactive Data	Drug Length	Protein Length
Davis	68	442	30,056	2457	27,599	64	788
KIBA	2111	229	118,254	22,729	95,525	58	728

Table 2. Hyperparameters used in our experiments.

Hyperparameter	Setting Range
Epoch	1000
Batch size	(128, 256, 512, 1024)
Learning rate	(0.0001, 0.0005, 0.001, 0.005)
Optimizer	Adam
Dropout rate	0.1

Table 3. Tuning experiments of batch size and learning rate.

Hyperparameters		MSE	CI	$r_{m}^{2}$
Batch size	128	0.220	0.901	0.723
	256	0.205	0.905	0.738
	512	0.182	0.929	0.786
	1024	0.190	0.914	0.751
Learning rate	0.0001	0.245	0.887	0.709
	0.0005	0.190	0.916	0.779
	0.001	0.182	0.929	0.786
	0.005	0.231	0.899	0.710

Table 4. Tuning experiments of the learnable parameter

γ

.

Table 4. Tuning experiments of the learnable parameter

γ

.

	MSE	CI	$r_{m}^{2}$
$γ$ = 0.5	0.182	0.929	0.786
$γ$ = 1	0.217	0.893	0.755

Table 5. Tuning experiments of the drug–target interaction modeling approach.

	MSE	CI	$r_{m}^{2}$
Softmax	0.182	0.929	0.786
Cosine Similarity	0.215	0.901	0.763

Table 6. Environment configuration.

Configuration Item	Value
Operating System	ubuntu20.04
CPU	Intel(R) Xeon(R) Platinum 8352V
RAM	64 GB
SSD	4 TB
GPU	GeForce RTX 4090
Deep Learning Framework	Pytorch 2.2.2 + cu118
CUDA Version	12.2
cuDNN Version	8.7
Python Version	3.9.19

Table 7. Ablation study of each module on Davis and KIBA datasets.

	Model	Drug	Target	Davis			KIBA
	Model	Drug	Target	MSE	CI	$r_{m}^{2}$	MSE	CI	$r_{m}^{2}$
1	GRU	Text	Text	0.279	0.848	0.659	0.231	0.843	0.663
2	GIN	Graph	Graph	0.241	0.880	0.681	0.185	0.860	0.684
3	GRU+MC	Text	Text	0.244	0.867	0.660	0.216	0.849	0.691
4	GIN+MC	Graph	Graph	0.233	0.889	0.702	0.174	0.871	0.684
5	MA	Graph	Graph	0.227	0.898	0.712	0.166	0.875	0.705
6	GRU+GIN	T+G	T+G	0.235	0.899	0.684	0.173	0.863	0.688
7	GRU+MA	T+G	T+G	0.221	0.905	0.759	0.148	0.890	0.774
8	GRU+GIN+MB	T+G	T+G	0.223	0.902	0.736	0.159	0.875	0.702
9	GRU+MA+SMB1	T+G	T+G	0.224	0.901	0.744	0.151	0.885	0.772
10	GRU+MA+SMB2	T+G	T+G	0.221	0.901	0.746	0.148	0.889	0.782
11	GRU+MA+MB	T+G	T+G	0.219	0.912	0.767	0.144	0.897	0.785
12	GRU+MA+MB+RE	T+G	T+G	0.198	0.923	0.784	0.121	0.902	0.798
13	GRU+MA+MB+RE+MC	T+G	T+G	0.182	0.929	0.786	0.114	0.917	0.806

Table 8. Comparison of existing research methods and cross-modal fusion of text and graph methods.

Category	Traditional Experimental	Computational	Deep Learning-Based
Category	SPR, ITC, ELISA, etc.	Ligand-Based or Structure-Based	Single-Modal-Based	Cross-Modal-Based
Data Requirements	None	Long-term reliance on manually labeled and extracted data by experts	Initial training requires labeled data, after which features can be automatically extracted
Application Scope	Only a few drugs or proteins	Small-scale datasets	Large-scale datasets	Very large-scale datasets
Computational Efficiency	Up to ten years	Three to five years	A few months to five years
Interpretability	High, experimental results are easy to understand	Moderate, limited by feature extraction methods	High, limited by feature extraction methods
Significant Advantages	High experimental accuracy and interpretability	Efficient computation	Can automatically extract features	Comprehensive analysis of multiple modalities, able to leverage complex structural relationships and interactions
Significant Limitations	High cost, time-consuming, and small-scale	Long-term reliance on manual labeling, small-scale	Single modal information leads to low utilization	Relies on high-performance hardware devices

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Ren, F. Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph. Appl. Sci. 2025, 15, 2901. https://doi.org/10.3390/app15062901

AMA Style

Yang J, Ren F. Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph. Applied Sciences. 2025; 15(6):2901. https://doi.org/10.3390/app15062901

Chicago/Turabian Style

Yang, Jucheng, and Fushun Ren. 2025. "Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph" Applied Sciences 15, no. 6: 2901. https://doi.org/10.3390/app15062901

APA Style

Yang, J., & Ren, F. (2025). Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph. Applied Sciences, 15(6), 2901. https://doi.org/10.3390/app15062901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Drug–Target Affinity Prediction Based on Cross-Modal Fusion of Text and Graph

Abstract

1. Introduction

2. Methods

2.1. Multi-Perceptive Neighborhood Self-Attention Aggregation Strategy

2.1.1. Method Overview

2.1.2. Aggregation and Update Rules

2.1.3. Additional Enhancements

2.2. Cross-Modal Bidirectional Adaptive Guided Fusion Strategy

2.2.1. Method Overview

2.2.2. Feature Refinement

2.3. Explicit Prediction Strategy Based on Multi-Head Collaborative Attention

2.3.1. Method Overview

2.3.2. Functional Mechanism

3. Experiments

3.1. Datasets

3.2. Data Preprocessing

3.3. Evaluation Metrics

3.4. Hyperparameter Settings and Tuning Experiments

3.5. Performance Comparison

3.6. Ablation Study

4. Discussion

4.1. Comparison Between Existing Methods and Cross-Modal Fusion of Text and Graph Methods

4.2. Strengths and Limitations

4.3. Potential Application Areas

4.4. Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI