1. Introduction
Bringing a novel therapy to market remains expensive and risky, with median capitalized R&D costs often estimated at approximately the USD 1 billion range per approval once failures and the cost of capital are accounted for [
1,
2]. Oncology case studies, in particular, continue to highlight very high single-asset expenditures [
3]. In this context, computational models that can prioritize promising drug–protein pairs before wet-lab screening are an attractive way to reduce costs and accelerate iteration cycles. Regression-based drug–target binding affinity (DTBA) modeling is a cornerstone of this field, offering a continuous measure of interaction strength that integrates seamlessly into workflows for hit ranking, lead optimization, and virtual screening.
Over the past ten years, a handful of kinase-focused benchmarks have become the go-to standards for assessing DTBA algorithms. The Davis dataset [
4] includes 68 inhibitors and 442 kinases, covering 30,056 experimentally determined Kd values, and is typically converted to pKd (−log10 Kd [M]) for regression purposes. The KIBA benchmark [
5] spans 2111 drugs and 229 kinases across 118,254 interactions, with the KIBA score consolidating Ki, Kd, and IC50 measurements into a unified, monotonic affinity metric. In both datasets, small-molecule structures are generally drawn from PubChem [
6] while protein sequences are sourced from UniProt [
7], ensuring high-quality identifiers and sequences well-suited for large-scale featurization. While these datasets provide a controlled setting for comparing methods, they also highlight recurring challenges in DTBA research: data scarcity, heterogeneous input modalities, and strong structural similarities among compounds and targets. From a methodological point of view, DTBA modeling spans several families. Similarity- and kernel-based methods, such as KronRLS, operate on precomputed drug and target similarity matrices and use regularized kernel ridge regression to learn from sparse interaction tables [
8]. Feature-based boosting methods such as SimBoost construct rich libraries of entity and network descriptors and train gradient boosting machines on top of them [
9]. Deep sequence models encode SMILES and FASTA directly via convolutional, recurrent, or Transformer architectures (e.g., DeepDTA, MolTrans, and TransformerCPI), while graph neural networks treat ligands as atom–bond graphs and learn structure-aware representations (e.g., GraphDTA) [
10,
11,
12,
13]. More recently, multi-modal pipelines have begun to integrate hand-crafted descriptors, similarity features, and pretrained embeddings within a unified learning framework, often relying on tree ensembles or shallow neural networks to handle the resulting tabular space.
At the same time, advances in protein language modeling have provided powerful new representations for targets. Large-scale models such as ESM-1b and ESM-2 are trained on hundreds of millions of sequences and produce contextual embeddings that implicitly capture structural and functional information [
14,
15]. When combined with traditional sequence descriptors (e.g., composition, autocorrelations, physicochemical profiles) and rich ligand fingerprints and descriptors, these embeddings offer a promising basis for multi-view DTBA modeling. However, naively concatenating all available features leads to extremely high-dimensional input spaces, which can exacerbate overfitting, slow training, and make it difficult to interpret model behavior, especially on comparatively small benchmarks such as Davis and KIBA.
DMFF-DTA [
16], a dual-modality deep learning framework that integrates sequence-based representations and binding-site-aware structural graph features derived from AlphaFold2 to improve drug–target affinity prediction on benchmark datasets such as Davis and KIBA, demonstrating competitive performance and enhanced interpretability. However, the authors evaluate their model using a five-fold cross-validation protocol in which the test set is randomly generated from the entire dataset in each fold, resulting in repeated random sampling of test instances.
This evaluation approach differs from the one used in DeepDTA-based studies and our own work, where a fixed, predefined test split is applied consistently across all methods to ensure a standardized and fair benchmark. Including this model in direct comparisons with ours would therefore be inappropriate, given the variability introduced by its randomly sampled test sets and its reliance on the structural information of enzymes.
DeepDTAGen [
17], a multitask deep learning framework built to jointly predict drug–target binding affinity and generate target-aware drug molecules within a single unified architecture. The framework draws on shared feature representations extracted from drug molecular graphs and protein sequences. For evaluation, the authors follow a cross-validation protocol in which each dataset is partitioned into folds and one-fold is used as the test set while the remaining folds form the training set, ensuring that testing samples are separated from training data but still derived from the overall dataset distribution. This evaluation strategy contrasts with DeepDTA-based studies and our approach, where a predefined independent test split is consistently maintained across methods, enabling direct comparability under a fixed testing benchmark rather than repeated dataset-derived splits. Also, this model is difficult to interpret as it is multitasked deep learning framework. Still, when we compare our results with this method, we find that our results in the Davis dataset are much better, while comparable results are obtained in the KIBA dataset.
In this work, I introduce DTBAffinity, a multi-modal regression framework that operates squarely in this “small-n, large-p” regime. DTBAffinity integrates (i) chemically meaningful ligand descriptors (RDKit and Mordred) [
18,
19], structural keys (MACCS and FP4), circular fingerprints (ECFP/Morgan) [
20,
21,
22], and SMILES-derived [
23] features from iFeatureOmega-Drug [
24], with (ii) thousands of sequence-derived protein [
25,
26,
27] descriptors from iFeatureOmega-Protein and contextual embeddings from ESM-1b/ESM-2. The resulting modality-specific feature matrices are sanitized, variance-filtered, z-score scaled, and subjected to univariate screening before concatenation and modeling with regularized XGBoost ensembles [
28,
29,
30,
31]. We evaluate DTBAffinity on Davis and KIBA under standard splits and metrics, and show that it achieves state-of-the-art or competitive performance against widely used baselines such as KronRLS, SimBoost, DeepDTA, and GraphDTA, while retaining a relatively simple and interpretable learning core. To make the role of DTBAffinity within the broader DTBA landscape explicit, we next formalize the problem setting and summarize the main contributions of this study.
Problem Definition and Contributions
Let
denote a set of drug–protein pairs, where
is a small-molecule ligand,
is a protein (typically a kinase) and
is a continuous affinity label (pKd in Davis or KIBA score in KIBA). The aim of DTBA prediction is to learn a function parameterized by
that minimizes a regression loss over observed interactions and generalizes to unseen drug–target combinations. In practice,
is trained on a finite subset of labeled pairs and evaluated on held-out pairs using metrics such as mean squared error (MSE), coefficient of determination (
), Pearson and Spearman correlations, Concordance Index (CI), and r
2m [
32,
33,
34].
Within this setting, our work makes four main contributions: (i) a multi-modal, chemically grounded featurization scheme in which we construct a comprehensive feature library that combines ligand fingerprints (MACCS, FP4, and ECFP/Morgan), 2D physicochemical and topological descriptors (RDKit and Mordred) and SMILES-derived iFeatureOmega-Drug descriptors with protein sequence descriptors from iFeatureOmega-Protein and contextual embeddings from ESM-1b/ESM-2, yielding a unified multi-view representation of both interaction partners; (ii) a scalable feature-selection and boosting pipeline that performs modality-specific sanitization, variance filtering, z-score scaling and univariate SelectKBest screening, followed by optional global capping and regularized XGBoost regression, explicitly targeting the high-dimensional “small-n, large-p” nature of Davis and KIBA while remaining easy to implement with open-source tools; (iii) an extensive evaluation and ablation study on the standard Davis and KIBA benchmarks, including intra-model ablations over feature combinations and comparisons against widely cited baselines (KronRLS, SimBoost, DeepDTA, GraphDTA, and others) under a common evaluation protocol, where DTBAffinity achieves state-of-the-art or competitive performance across MSE, CI, rm2 and AUPR; and (iv) the public release of all code for data preprocessing, feature generation, model training and evaluation, together with pre-computed feature matrices and train/validation/test splits for both datasets, enabling future work to benchmark new models against DTBAffinity using exactly the same inputs and splits.
4. Results and Discussion
The results were evaluated for DTBAffinity on the Davis and KIBA datasets under the same standard test set used, and the performance was reported across multiple regression and ranking metrics.
Table 5 and
Table 6 summarize the results for each feature combination on the KIBA and Davis datasets, respectively, while
Figure 2 and
Figure 3 provide visualizations of predicted vs. actual affinities and residuals for both datasets. Unless otherwise noted, the baseline results reported in
Table 6 and
Table 7 were obtained from the respective original publications and were not re-run within our experimental pipeline. Consequently, the same test dataset is used across all methods because the DeepDTA method employs predefined dataset indices, where one index is used for the training set and another for the test set. Therefore, all methods included in our comparison utilized the independent test set defined by DeepDTA, with the exception of SimBoost and KronRLS, which were originally evaluated under this setup for comparison with DeepDTA. Accordingly, our comparison using the specified test dataset is valid, as it represents a standard benchmark adopted by prior studies, ensuring fair and consistent evaluation.
4.1. KIBA Dataset
On the KIBA dataset, DTBAffinity consistently surpasses baseline methods in both regression accuracy and ranking performance. As shown in
Table 5, the best-performing configuration—combining iFeatureOmega-Protein, ESM embeddings, and a comprehensive ligand feature set encompassing RDKit, Mordred, and ECFP/Morgan descriptors—achieves the following results: MSE = 0.1540, Pearson’s r = 0.8799, Spearman’s ρ = 0.8648, CI = 0.8686, and AUPR = 0.8361. This configuration proves competitive with or superior to other state-of-the-art models, including KronRLS, SimBoost, and DeepDTA, across all evaluated metrics. These findings confirm that the multi-view feature combination paired with a regularized XGBoost ensemble is highly effective for DTBA prediction on large-scale datasets (see
Table 5).
4.2. Davis Dataset
In the same vein, DTBAffinity achieves competitive and strong performance on the Davis dataset. As shown in
Table 6, the best configuration, which combines iFeatureOmega-Protein and ESM-1b/ESM-2 embeddings with a rich set of ligand descriptors, achieves MSE = 0.1885, Pearson’s
, Spearman’s
, CI = 0.9102, and AUPR = 0.8112. These results significantly outperform prior methods, including DeepDTA, SimBoost, and GraphDTA, particularly in terms of MSE and AUPR. The improvement in CI and AUPR indicates that DTBAffinity not only predicts binding affinities with high accuracy but also ranks active and inactive drug–target pairs more effectively than the baselines.
Figure 4 and
Figure 5 further illustrate the prediction quality of DTBAffinity, showing that the predicted values closely match the actual affinity values (predicted vs. actual) and exhibit small residuals (residuals vs. actual), with only a slight increase in variance at higher affinities (see
Table 6).
4.3. Results of Ablation Studies
We also perform an ablation study to analyze the contribution of each feature family. As shown in
Table 5 and
Table 6, the performance consistently improves as additional feature families (e.g., ESM embeddings and iFeatureOmega-Drug) are added to the model. The most notable performance gains are observed when iFeatureOmega-Protein is combined with ESM embeddings, as this pairing yields a more comprehensive representation of protein sequences. Ligand features also play a meaningful role in boosting performance, with the incorporation of circular Morgan fingerprints and RDKit descriptors contributing the most significant improvements (see
Table 4 and
Table 5).
We begin by examining calibration and residual behavior (
Figure 2 and
Figure 3), followed by a comparative analysis against baseline methods across MSE, CI, RM
2, and AUPR metrics (
Figure 4 and
Figure 5). The corresponding numerical results are detailed in
Table 4 through
Table 7. As illustrated in
Figure 5, DTBAffinity’s predictions on the KIBA dataset align closely with the identity line (see
Table 8), with residuals approaching near-zero values and only a slight increase in variance at higher affinity levels—a pattern that is consistent with the trends reported in
Table 5.
4.4. Performance Assessment in Cold-Start Scenarios
Table 9 shows the performance of the DTBAffinity method when using a fixed split as DeepDTA vs. cold-drug vs. cold target for the KIBA dataset.
Table 10 shows the performance of the DTBAffinity method when using a fixed split vs. cold-drug vs. cold target for the Davis dataset.
To more thoroughly evaluate the generalization capability of the proposed DTBAffinity model beyond the conventional fixed split setting, we carried out additional experiments under two cold-start scenarios: cold-drug and cold-target evaluation. In the cold-drug setting, 10% of unique drugs were exclusively reserved for testing, ensuring that no interactions involving these compounds were seen during training. Likewise, in the cold-target setting, 10% of unique targets were held out as an independent test set, representing protein targets not encountered during model development. These configurations simulate realistic drug discovery conditions in which models must predict binding affinities for entirely novel compounds or previously uncharacterized protein targets.
Table 9 and
Table 10 summarize DTBAffinity’s performance across the fixed split (DeepDTA protocol), cold-drug, and cold-target settings on the KIBA and Davis datasets, respectively. As anticipated, the fixed split consistently yields the best performance across all metrics, reflecting the overlapping chemical and biological distributions shared between the training and test data. In contrast, both cold-start scenarios result in a noticeable decline in performance, confirming the heightened difficulty of predicting interactions that involve entirely unseen entities. Notably, the cold-target setting achieves comparatively stronger results than the cold-drug setting across both datasets, suggesting that the model generalizes more effectively across protein space than chemical space under the current feature representation. Despite this performance gap, DTBAffinity retains meaningful predictive correlations and concordance indices under cold-start conditions, demonstrating solid extrapolation capability and supporting its applicability to real-world drug discovery workflows involving novel compounds and targets.
4.5. Feature Importance Analysis and Model Interpretability
To improve model interpretability and gain a clearer understanding of how drug and enzyme features each contribute to prediction performance, we carried out a feature importance analysis using the XGBoost model. Unlike many deep learning architectures that function as black-box predictors, XGBoost offers built-in interpretability through gain-based feature importance scores, allowing for the systematic identification of the most influential descriptors driving model predictions.
Figure 6 and
Figure 7 display the top ten most important features for the KIBA and Davis datasets under the fixed split setting, respectively. The results reveal that both enzyme-derived representations—such as ESM embeddings and TPC descriptors—and ligand-related descriptors make substantial contributions to predictive performance, underscoring the complementary value of integrating multimodal features. This analysis demonstrates that XGBoost not only delivers competitive predictive accuracy but also provides a transparent framework for uncovering biologically meaningful patterns and identifying the key molecular and protein features most relevant to drug–target affinity prediction.
4.6. Discussion
The results obtained on both the Davis and KIBA datasets indicate that DTBAffinity provides a competitive and interpretable solution for drug–target binding affinity prediction. The integration of multi-modal feature engineering—combining chemically meaningful ligand descriptors with sequence-derived protein features and contextual embeddings from protein language models—has proven highly effective. This hybrid strategy, which draws on both handcrafted features and learned representations, is particularly advantageous in scenarios where data is scarce or inconsistent across modalities.
A key strength of DTBAffinity lies in its interpretability. Through the use of well-established descriptors such as RDKit, Mordred, and MACCS, alongside contextual embeddings from ESM-1b and ESM-2, the model offers meaningful insights into the molecular features most influential in predicting drug–target interactions. Moreover, the regularized gradient boosting framework at the core of DTBAffinity provides a robust mechanism for managing high-dimensional data and mitigating overfitting, particularly in the “small-n, large-p” settings that characterize both the Davis and KIBA datasets.
Nevertheless, certain challenges persist. While DTBAffinity performs well under standard fixed split conditions, further testing under more demanding scenarios would be valuable—such as cold-start problems involving novel scaffolds or kinase families, or kinase-family-specific splits that require the model to generalize to previously unseen targets. Additionally, although the incorporation of ESM-1b and ESM-2 embeddings meaningfully enhances performance, further gains could potentially be achieved by integrating three-dimensional protein structure features or more sophisticated sequence–structure relationship models. Future work will also investigate multi-task learning strategies capable of simultaneously leveraging data from multiple datasets, with the aim of improving overall generalizability and robustness.
5. Conclusions and Limitations
In this study, we introduced DTBAffinity, a robust multi-modal framework for drug–target binding affinity prediction that brings together chemically meaningful ligand descriptors, sequence-level protein features, and contextual embeddings derived from protein language models. By drawing on a diverse range of feature families—including physicochemical, topological, and structural fingerprints, as well as protein sequence descriptors and embeddings—DTBAffinity offers a comprehensive, interpretable, and scalable approach to DTBA prediction. The model achieves competitive performance on both the Davis and KIBA datasets, outperforming state-of-the-art baselines across MSE, R2, Concordance Index (CI), r_m2, and AUPR, while providing meaningful insights into the underlying molecular features that govern binding affinity predictions. Our findings suggest that combining handcrafted features with learned embeddings within a regularized boosting framework can yield highly effective models, even under the “small-n, large-p” conditions that are characteristic of DTBA tasks.
That said, several limitations and opportunities for future improvement remain. The current work focuses primarily on 2D ligand descriptors and sequence-level protein features. Incorporating more sophisticated, 3D-aware descriptors—such as binding pocket features, docking scores, and 3D graph or structure encoders—could further enhance model performance, particularly for interactions involving conformational changes or allosteric binding sites. Furthermore, while DTBAffinity performs well on the Davis and KIBA benchmarks, its generalization to cold-start scenarios—such as novel scaffolds or kinase families—warrants additional validation through scaffold- or family-specific splits to assess how effectively the model extrapolates to unseen chemical space. Another promising direction involves exploring multi-task learning across multiple datasets, which could bolster generalizability and overall model robustness. Lastly, integrating uncertainty calibration into the framework would offer valuable information regarding prediction confidence, especially when handling novel or ambiguous drug–target pairs. Collectively, these avenues for improvement present exciting opportunities to advance the field of DTBA prediction and broaden its applicability to more complex biological systems.