1. Introduction
Currently, the development of a new drug takes nearly 20 years to obtain approval from the U.S. Food and Drug Administration (FDA) and costs approximately USD 260 million. In the field of drug–target affinity prediction, various wet lab methods, including Surface Plasmon Resonance (SPR [
1]), Isothermal Titration Calorimetry (ITC [
2]), and Enzyme-Linked Immunosorbent Assay (ELISA [
3]), have been widely used. However, the high initial investment costs and complex technical requirements have been persistent barriers to technological advancement in this field.
Drug–target affinity prediction methods can currently be categorized into three types: those based on traditional experimental methods, those based on computational methods, and those based on deep learning. 
Figure 1 provides an overview of the representative methods. With the development of computer technology, data-driven in silico methods have achieved significant success across various drug discovery fields and have clearly become an important tool [
4]. For instance, In silico Medicine, a leading biotech company, has successfully applied deep learning models in the field of drug discovery. Its designed drug, INS018_055, advanced to clinical trials in just 18 months, significantly reducing the timeline by nearly three to four years compared to the traditional drug discovery process, which typically takes about five years. Against the backdrop of deep learning and big data, computer-based drug–target affinity prediction methods have effectively improved the efficiency of financial and human resource utilization.
Since the introduction of the SimBoost model by He et al. [
5] in 2017, the concept of drug–target affinity (DTA) prediction has garnered significant attention. SimBoost demonstrated outstanding performance in computing prediction intervals and assessing affinity confidence. However, its limitation lies in the underutilization of deep learning’s potential. With technological advancements, Öztürk et al. [
6] introduced DeepDTA in 2018, the first deep learning model combining convolutional neural networks (CNNs) and fully connected neural networks for DTA prediction. Although DeepDTA exhibited high configurability and predictive performance across multiple datasets, it lacked a deep exploration of structural features. To address this issue, researchers proposed WideDTA [
7], which significantly enriched the original input data by incorporating protein domain features, motifs, and maximum common substructure word features. However, WideDTA’s design focused solely on intra-textual modality fusion, employing a simple weighted summation of extracted textual features. This approach failed to explore or effectively capture the mutual guidance between drug and target textual features and other representational modalities, limiting its performance in DTA prediction. In 2019, Zhao et al. [
8] introduced AttentionDTA, pioneering the use of attention mechanisms for efficient feature interaction. While this model improved prediction accuracy by emphasizing relationships between important features, its computational complexity posed a significant challenge when handling large-scale datasets. In 2020, Nguyen et al. [
9] advanced the field with GraphDTA, which shifted from traditional drug textual sequence modeling to extracting information from molecular graphs. This innovation reduced the mean squared error (MSE) from 0.282 to 0.254. Despite its success in improving prediction performance, GraphDTA’s aggregation strategy remained limited to first-order neighborhood information, failing to fully exploit higher-order neighborhood features. Consequently, the model struggled to capture complex molecular structural relationships. Building on the success of GraphDTA, Jiang et al. [
10] proposed DGraphDTA, which replaced amino acid sequences with protein contact graphs and developed a dual-branch graph-based model, further enhancing representational capabilities. Despite the significant progress achieved by DGraphDTA and subsequent models like SAM-DTA [
11], particularly with SAM-DTA reducing the MSE to 0.229 on the same dataset, these models still predominantly relied on single-modality modeling. They failed to effectively integrate the advantages of textual and graphical modalities, limiting their potential in DTA prediction.
In summary, previous deep learning-based methods for drug–target affinity prediction have relied either on textual sequences of drugs and proteins (such as the Simplified Molecular Input Line Entry System (SMILES [
12]) for drugs and amino acid sequences for proteins) or on simple graph structures (where nodes represent atoms and amino acid residues, and edges represent chemical bonds and spatial proximities). However, these methods are limited in their ability to capture the intricate molecular interactions and complex structural dependencies that influence drug–target binding. Therefore, integrating various data types through cross-modal fusion methods to capture multidimensional information and provide a more comprehensive perspective will offer new possibilities for drug–target affinity prediction research.
To address the above issues, this study proposes a deep learning model, CM-DTA, based on the cross-modal fusion of text and graphs for drug–target affinity prediction. The model consists of five key components: the drug and target representation module, the text modality processing module, the graph modality processing module, the feature fusion module, and the prediction module. First, in the drug and target representation module, the input drug and target data are preprocessed to generate four types of input data: the drug’s SMILES expressions, molecular graphs (generated using the open-source software RDKit (2024.09.6)  [
13]), amino acid sequences of proteins (as the specific components of targets in this study), and protein graphs (predicted using the Pconsc4 program [
14]). By introducing these four types of input data, this study addresses the issue of traditional methods focusing on a single modality and neglecting the interactions between multiple modalities. Secondly, two feature extraction modules are constructed, one for text data and one for graph data. The text modality is processed using a module based on Gated Recurrent Units (GRU), while the graph modality is handled using a Graph Isomorphism Network (GIN) [
15] with a multi-perceptive neighborhood self-attention aggregation strategy. This allows for a deeper extraction of features from both the text and graph modalities, solving the issue of traditional methods’ inability to adequately capture graph modality feature information. The features extracted by these two modules better support the subsequent fusion of drug and target features. In the feature fusion stage, a cross-modal bidirectional adaptive guided fusion module is designed. This module integrates the drug’s text feature vector with the graph feature vector, as well as the target’s text feature vector with the graph feature vector, combining them pairwise to generate two fused feature vectors, each weighted and guided by different modalities. This strategy addresses the problem of insufficient information transfer in cross-modal feature fusion, allowing the model to more accurately capture key information between modalities. Finally, in the prediction module, an explicit prediction strategy based on multi-head collaborative attention is applied. This module performs interactions between the two fused feature vectors of the drug and the target, concatenates them, and inputs the result into a multi-layer, fully connected neural network to complete the affinity prediction. This method effectively addresses the limitation of traditional methods that rely only on simple feature concatenation, thereby enhancing the modeling of interactions between drug and target features, significantly improving prediction performance and biological interpretability.The main contributions of this study are as follows:
- Design of a cross-modal fusion model for drug-target affinity prediction: A novel prediction model is developed based on the fusion of textual and graphical features. This study is the first in the field of affinity prediction to integrate textual and graph modalities for enhanced input feature representation. 
- Proposed multi-perceptive neighborhood self-attention aggregation strategy: Applied to the GIN, this strategy extends the traditional fixed weight allocation for first-order neighborhood aggregation to simultaneously capture both first-order and second-order neighborhood information. It dynamically adjusts aggregation weights based on neighboring node features, thereby enhancing the model’s structural perception capabilities. 
- Proposed cross-modal bidirectional adaptive guided fusion strategy: This strategy is implemented in the feature fusion module to calculate interaction weights by guiding attention between the two modalities. This enables textual and graphical features to focus on each other’s relevant information, achieving efficient cross-modal information fusion. 
- Proposed explicit prediction strategy based on multi-head collaborative attention: This strategy is applied in the prediction module, leveraging similarity matrices and parallel subspaces of multi-head attention to enable deep interaction between drug and protein features. This approach not only enhances predictive performance but also improves biological interpretability by making the modeling process more intuitive. 
  3. Experiments
  3.1. Datasets
We conducted experiments using two benchmark datasets, Davis [
17] and KIBA [
18], listed in 
Table 1, to train our model and evaluate its performance. In the field of drug–target affinity prediction, these two datasets are widely used. The Davis dataset compiles specific kinase proteins and their corresponding inhibitors, containing 68 drugs, 442 proteins, and 30,056 drug–target interactions. The average length of drug SMILES strings is 64, and the average length of protein amino acid sequences is 788. This dataset represents the affinity between drugs and targets using the kinase dissociation constant. Furthermore, the 
 values are transformed into 
 values with the calculation formula as follows:
The KIBA dataset gathers the bioactivity data of kinase inhibitors, including inhibition constant, dissociation constant, and half-maximal inhibitory concentration. It contains 2111 drugs, 229 proteins, and 118,254 drug–target interactions. The average length of drug SMILES strings is 58, and the average length of protein amino acid sequences is 728.
Additionally, we introduce a multi-fold cross-validation mechanism. The dataset is divided into five equal-sized subsets. Whenever one subset is selected as the validation set, the remaining four subsets serve as the training set. This mechanism effectively enhances data utilization and, by reducing the evaluation bias caused by data partitioning, helps to provide a more accurate estimate of the model’s generalization performance.
  3.2. Data Preprocessing
In the drug representation section, this study conducts modeling and analysis based on the textual representation and molecular graph of drug molecules, as shown in 
Figure 6. The SMILES expression is used as the textual representation of drug molecules in this study. SMILES expressions represent complex chemical structures, such as heavy atoms or valence electrons, using strings composed of letters or numbers, making it suitable for feature extraction by deep learning models such as convolutional neural networks. To restore and extract the true structural information of drug molecules, the open-source cheminformatics software RDKit is used to convert SMILES expressions into corresponding molecular graphs. In these graphs, each node is represented by a multi-dimensional feature vector that includes five types of information: atomic symbol, the number of adjacent atoms, the number of adjacent hydrogen atoms, the atom’s implicit valence, and whether the atom is part of an aromatic structure. This intuitive topological information can be effectively extracted by deep learning models such as graph convolutional networks.
In the target representation section, this study conducts modeling and analysis based on protein amino acid textual sequences and protein graphs, as shown in 
Figure 4. Each target protein is expressed through its corresponding amino acid sequence via the translation process. ASCII characters are used as the textual expression for amino acid sequences, where each type of amino acid is encoded by its corresponding English letter through a unique integer. Similar to drug SMILES expressions, these amino acid textual sequences can be effectively extracted by deep learning models such as convolutional neural networks. To acquire protein structural information, such as the angles and distances between different residue pairs, the prediction tool Pconsc4 is used to predict residue contact maps from amino acid sequences. After filtering to select contact information with higher confidence, the adjacency matrix of the protein graph is ultimately constructed.
The contact map is a matrix form expression of the residue contact graph, not the final protein graph itself. The matrix size is , where L is the length of the protein sequence. The element  of the matrix indicates whether the i-th and j-th residues in the sequence are in contact. The definition of contact is typically understood as follows: if the Euclidean distance between the  atoms of two residues ( for glycine, as it does not have a ) is less than a specified threshold, they are considered to be in contact.
  3.3. Evaluation Metrics
Drug–target affinity prediction is considered a regression task. To ensure a fair performance comparison with previous models, we selected three identical evaluation metrics to assess our model, including Mean Squared Error (MSE), Concordance Index (CI), and Regression towards the Mean Index ().
MSE is a metric used to measure the deviation between predicted values and actual values, calculated using the squared loss function. The specific calculation formula is as follows:
        where 
 represents the predicted value, 
 is the actual value, and 
n is the total number of drug–target pairs in the dataset.
CI is used to evaluate whether the ranking of predicted values for two randomly selected drug–target pairs is consistent with the ranking of their actual values. The specific calculation formula is as follows:
        where 
 is a step function, 
 represents the predicted value corresponding to the higher affinity, and 
 represents the predicted value corresponding to the lower affinity. 
Z is a normalization constant representing the total number of drug–target pairs involved in the analysis.
The 
 metric is used to assess the external predictive capability of the model. The specific calculation formula is as follows:
        where 
 is the squared correlation coefficient between the actual values and the predicted values including the intercept, and 
 is the squared correlation coefficient between the actual values and the predicted values without including the intercept.
  3.4. Hyperparameter Settings and Tuning Experiments
Detailed information on the model’s hyperparameter settings is provided in 
Table 2.
This study conducted tuning experiments on the batch size, learning rate, 
, and the drug–target interaction modeling approach (mentioned in the previous chapter).To optimize the model’s performance, we conducted hyperparameter tuning experiments on batch size and learning rate, with the detailed results shown in 
Table 3. Through these experiments, we confirmed the significant impact of batch size and learning rate on model performance and selected the optimal configuration (batch size of 512 and learning rate of 0.001) for further model training.
The study conducted experiments with different values of the learnable parameter 
, as shown in 
Table 4. When 
 = 0.5, the model exhibited the best performance, indicating that smaller values of 
 help to enhance the capture of drug–target interaction features and further optimize the model’s prediction accuracy.
This study conducted experiments using both softmax inner product similarity and cosine similarity, as shown in 
Table 5. The results indicate that the softmax method performs better in drug–target interaction modeling. These findings validate the choice of inner product similarity over cosine similarity for this task, as inner product similarity more effectively captures the interactions between the drug and target.
To ensure a fair comparison, all network models in the experiment are implemented and executed in the Pytorch framework. The hardware configuration used in this study is as follows: the operating system is Ubuntu 20.04, the CPU model is Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10 GHz, the GPU model is NVIDIA GeForce RTX 4090, the GPU memory is 24 GIB, the RAM is 64 GB DDR5 (32 × 2 GB), and the solid-state drive capacityis 4 TB. The deep learning framework used is Pytorch 2.2.2, the CUDA version is 12.2, the cuDNN version is 8.7, and the Python version is 3.9.19. The details of the experimental environment settings are shown in 
Table 6.
  3.5. Performance Comparison
To verify the performance of CM-DTA in the DTA domain, we conducted a detailed comparison with existing models, such as DeepDTA, GraphDTA, DGraphDTA, ProLiGraph [
19], TransVAEDTA [
20], and Tag-DTA [
21] based on experiments on the Davis and KIBA datasets using three identical evaluation methods: MSE, CI, and 
. The specific performance metrics of the model are reproduced from previously published papers, and the final experimental results are shown in 
Figure 7.
Based on the Davis dataset, our model significantly outperforms other models on this dataset, with an MSE of 0.182, CI of 0.929, and  of 0.786. Compared to the SOTA, the MSE decreased by 1.6%, CI increased by 0.6%, and  increased by 0.2%. Based on the KIBA dataset, our model significantly outperforms other models on this dataset as well, with an MSE of 0.114, CI of 0.917, and  of 0.806. Compared to SOTA, the MSE decreased by 0.7%, CI increased by 1.5%, and  increased by 0.8%.
Figure 8 illustrates the comparison between true affinity and predicted values for both the Davis and KIBA datasets. The 
x-axis represents the ground truth, and the 
y-axis represents the predictions. The variance between the expected affinity and the actual value is shown by the vertical distance 
 between each point and 
. The distribution of the expected and actual affinities is shown by the histograms at the edges. The results show that the data points in both datasets tend to be symmetric around 
, with a denser distribution around 
 in the KIBA dataset. The 
y-axis displays the predicted value for each data point, and the 
x-axis displays the actual value for each data point. Each sample’s vertical distance 
 from 
 shows the difference between the predicted and actual values of its affinity.
 The experimental results demonstrate that by leveraging the characteristics of different data types, adopting suitable network modeling strategies, and introducing appropriate cross-modal feature fusion and final prediction strategies, the overall performance of the model is significantly improved. Specifically, the multi-perceptive neighborhood self-attention aggregation strategy enables the graph modality to capture the structural information of drugs and targets with greater precision, thereby enhancing the model’s structure-awareness capabilities. Meanwhile, the cross-modal bidirectional adaptive guided fusion strategy achieves bidirectional information guidance between the text and graph modalities, allowing features from both modalities to focus on key information. This improves the complementarity between modalities and enriches the expressiveness of feature representations. In addition, the explicit prediction strategy based on multi-head collaborative attention facilitates deep interaction between drug and protein features through a similarity matrix and parallel subspaces of multi-head collaboration, enhancing both prediction accuracy and biological interpretability. Compared to existing models, CM-DTA demonstrates superior performance on the Davis and KIBA datasets. These results indicate that the advantages of CM-DTA in feature representation and information complementarity make it a more competitive approach for drug–target affinity prediction.
  3.6. Ablation Study
To evaluate the impact of each module on model performance, this study conducted a systematic ablation experiment on the Davis and KIBA datasets, with the results presented in 
Table 7. In the experimental setup, MA refers to the GIN model enhanced by the multi-perceptive neighborhood self-attention aggregation strategy, MB represents the cross-modal bidirectional adaptive guided fusion module, and SMB denotes the cross-modal single-guided adaptive fusion module (including SMB1, where text guides the graph, and SMB2, where the graph guides the text). MC stands for the Multi-Head Collaborative Attention Explicit Prediction Strategy, while RE represents the introduction of residual connections, applied to the GIN, MA, and MB modules. The key findings are summarized as follows:
Single-Modality Baselines (Models 1 and 2): Using GRU for text modality (Model 1) and GIN for graph modality (Model 2) independently achieves limited performance, with higher MSE and lower and CI values. This highlights that single-modality features cannot fully capture the complex interactions between drugs and targets.
Effect of MA (Model 5): Incorporating the MA in GIN significantly improves performance over the basic GIN (Model 2), reducing MSE and enhancing CI values. This demonstrates the importance of capturing both first-order and second-order neighborhood information for better structural representation.
Basic Cross-Modal Fusion (Model 6): Combining text and graph features using GRU and GIN improves performance compared to single modalities (Models 1 and 2). However, this basic concatenation approach lacks deep interaction modeling, leading to limited gains compared to advanced fusion strategies.
Advanced Cross-Modal Fusion (Models 7–9): Introducing MB (Model 8) outperforms SMA (Model 9) and basic fusion (Model 6). The bidirectional mechanism enables more effective interaction between modalities, enhancing complementary information exchange. Effect of Explicit Prediction (Model 10): Adding the MC further improves performance, as it deepens the interaction between drug and protein features while increasing biological interpretability.
Final Model (Model 13): The integration of all modules (GRU+MA+MB+RE+MC) achieves the best results on both datasets, with the lowest MSE and highest CI values. The RE effectively stabilizes training and enhances generalization, while the combination of MA, MB, and MC captures fine-grained interactions and improves predictive performance. In summary, each module contributes to model performance, with the full integration of MA, MB, MC, and RE achieving the strongest results by capturing both intra- and inter-modality relationships and effectively leveraging complementary information.
  4. Discussion
  4.1. Comparison Between Existing Methods and Cross-Modal Fusion of Text and Graph Methods
Cross-modal methods offer significant advantages over traditional experimental methods and computational methods. Traditional experimental methods rely on experimental data, have low computational efficiency, are suitable for small-scale datasets, and face high costs and long timeframes. While computational methods improve computational efficiency, they depend on manual annotation and feature extraction, which limits their applicability. In contrast, cross-modal deep learning methods can handle large-scale datasets, and, by integrating text and graph features, capture complex structural relationships and interactions. These methods offer greater flexibility and prediction accuracy, achieving better performance in a wider range of tasks, as shown in 
Table 8.
  4.2. Strengths and Limitations
This study proposes an efficient cross-modal feature fusion model, CM-DTA, for drug–target affinity prediction. CM-DTA is based on the multi-perceptive neighborhood self-attention aggregation strategy, which captures both first-order and second-order neighborhood information to enhance the structural perception of the graph modality. It also incorporates the cross-modal bidirectional adaptive guided fusion strategy, establishing effective interactions between the text and graph modalities, enabling features to focus on each other’s key information for efficient cross-modal information fusion. Furthermore, the model leverages the explicit prediction strategy based on multi-head collaborative attention, which deeply explores the complex relationships between drug and target features, improving predictive performance and biological interpretability. Experimental results on the Davis and KIBA datasets demonstrate that CM-DTA significantly outperforms the current SOTA models. However, the model’s performance may be influenced by the quality and size of the training data, as it relies heavily on large, well-labeled datasets for accurate predictions. Additionally, while the model effectively captures cross-modal interactions, it may still struggle with extremely noisy or incomplete data, potentially affecting its robustness.
  4.3. Potential Application Areas
The cross-modal feature fusion approach proposed in CM-DTA has significant potential in various domains of drug discovery and bioinformatics. Primarily, it can be applied to drug–target affinity prediction, where its ability to integrate textual and graphical information from drug molecules and protein sequences enables more accurate and interpretable predictions compared to traditional methods. Beyond drug–target interactions, the methodology can extend to other areas such as drug repurposing, where identifying similarities between existing drugs and new disease targets is crucial. Furthermore, the model’s capacity to handle multi-modal data makes it applicable to complex biomedical tasks, such as protein–protein interaction prediction, biomarker discovery, and personalized medicine, where integrating diverse data sources is essential for making informed decisions. Given its robustness in capturing intricate relationships across different data types, CM-DTA offers a versatile tool for advancing precision medicine and accelerating the drug development process.
  4.4. Future Directions
Although the proposed CM-DTA model demonstrates promising results, there are several areas for further improvement and exploration. First, the model’s reliance on large-scale, well-labeled datasets may limit its performance in scenarios where data are sparse or noisy. Future research could focus on developing more robust methods for training with limited or noisy data, possibly through semi-supervised learning or transfer learning techniques, to enhance its generalization ability. Secondly, the current model primarily focuses on drug–target affinity prediction, but its applicability could be extended to other biomedical tasks, such as drug combination prediction, adverse drug reaction prediction, and patient-specific treatment recommendations. Integrating additional biological data, such as genomic and proteomic information, could further improve the model’s performance and make it more versatile. Furthermore, improving the interpretability of the model is a key area for future development. Developing methods that allow users to better understand and visualize the decision-making process of the model would help to foster trust in its predictions, especially in clinical settings. To this end, developing an intuitive visualization interface as soon as possible could be considered to help users better understand and use the model by displaying the prediction process and results. Finally, considering the computational complexity of the model and optimizing its efficiency and scalability will be an important direction for the future. This will not only improve the model’s ability to work with larger datasets but also ensure its efficiency and stability in real-world applications.