DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks

Alazmi, Meshari

doi:10.3390/computers15030182

Open AccessArticle

DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks

by

Meshari Alazmi

College of Computer Science and Engineering, University of Ha’il, Ha’il 81411, Saudi Arabia

Computers 2026, 15(3), 182; https://doi.org/10.3390/computers15030182

Submission received: 17 January 2026 / Revised: 24 February 2026 / Accepted: 2 March 2026 / Published: 10 March 2026

(This article belongs to the Special Issue AI in Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

An accurate prediction of how strongly a drug binds to its target (where the drug will have the desired effect) is very important for drug discovery. It helps select the most promising compounds and saves money by doing fewer experiments. We present DTBAffinity, a multi-modal regression framework that integrates chemically meaningful ligand descriptors with diverse protein sequence features in a unified gradient-boosting model. The representation of ligands includes physicochemical and topological descriptors (RDKit and Mordred), structural keys (MACCS and FP4), circular fingerprints (ECFP/Morgan), and SMILES-derived features from iFeatureOmega. For proteins, thousands of sequence-derived descriptors (composition, autocorrelations, physicochemical profiles, and evolutionary indices) from iFeatureOmega are used, together with contextual embeddings from large protein language models (ESM-1b, ESM-2). The feature matrices are cleaned up, variance filtered, z-score scaled, and univariate selected before being concatenated and modeled with regularized XGBoost ensembles. We evaluate DTBAffinity on two kinase-centric datasets that are commonly used: Davis (30,056 interactions: pKd values) and KIBA (118,254 interactions: integrated affinity scores). Various metrics are used to measure the performance, such as MSE, R², Pearson/Spearman correlations, Concordance Index (CI), r_m², and AUPR. On Davis, DTBAffinity yields MSE = 0.1885, CI = 0.9102, and AUPR = 0.8112, and on KIBA, it gives MSE = 0.1540, CI = 0.8686, and AUPR = 0.8361; thus, it is better than the state-of-the-art baselines such as KronRLS, SimBoost, DeepDTA, and GraphDTA. The findings here imply that the combination of interpretable descriptors and contextual embeddings in a robust boosting framework is a great way to realize accurate, interpretable, and generalizable DTBA prediction.

Keywords:

drug–target binding affinity; DTBAffinity; XGBoost; iFeatureOmega; ESM embeddings; Davis dataset; KIBA dataset

1. Introduction

Bringing a novel therapy to market remains expensive and risky, with median capitalized R&D costs often estimated at approximately the USD 1 billion range per approval once failures and the cost of capital are accounted for [1,2]. Oncology case studies, in particular, continue to highlight very high single-asset expenditures [3]. In this context, computational models that can prioritize promising drug–protein pairs before wet-lab screening are an attractive way to reduce costs and accelerate iteration cycles. Regression-based drug–target binding affinity (DTBA) modeling is a cornerstone of this field, offering a continuous measure of interaction strength that integrates seamlessly into workflows for hit ranking, lead optimization, and virtual screening.

Over the past ten years, a handful of kinase-focused benchmarks have become the go-to standards for assessing DTBA algorithms. The Davis dataset [4] includes 68 inhibitors and 442 kinases, covering 30,056 experimentally determined Kd values, and is typically converted to pKd (−log10 Kd [M]) for regression purposes. The KIBA benchmark [5] spans 2111 drugs and 229 kinases across 118,254 interactions, with the KIBA score consolidating Ki, Kd, and IC50 measurements into a unified, monotonic affinity metric. In both datasets, small-molecule structures are generally drawn from PubChem [6] while protein sequences are sourced from UniProt [7], ensuring high-quality identifiers and sequences well-suited for large-scale featurization. While these datasets provide a controlled setting for comparing methods, they also highlight recurring challenges in DTBA research: data scarcity, heterogeneous input modalities, and strong structural similarities among compounds and targets. From a methodological point of view, DTBA modeling spans several families. Similarity- and kernel-based methods, such as KronRLS, operate on precomputed drug and target similarity matrices and use regularized kernel ridge regression to learn from sparse interaction tables [8]. Feature-based boosting methods such as SimBoost construct rich libraries of entity and network descriptors and train gradient boosting machines on top of them [9]. Deep sequence models encode SMILES and FASTA directly via convolutional, recurrent, or Transformer architectures (e.g., DeepDTA, MolTrans, and TransformerCPI), while graph neural networks treat ligands as atom–bond graphs and learn structure-aware representations (e.g., GraphDTA) [10,11,12,13]. More recently, multi-modal pipelines have begun to integrate hand-crafted descriptors, similarity features, and pretrained embeddings within a unified learning framework, often relying on tree ensembles or shallow neural networks to handle the resulting tabular space.

At the same time, advances in protein language modeling have provided powerful new representations for targets. Large-scale models such as ESM-1b and ESM-2 are trained on hundreds of millions of sequences and produce contextual embeddings that implicitly capture structural and functional information [14,15]. When combined with traditional sequence descriptors (e.g., composition, autocorrelations, physicochemical profiles) and rich ligand fingerprints and descriptors, these embeddings offer a promising basis for multi-view DTBA modeling. However, naively concatenating all available features leads to extremely high-dimensional input spaces, which can exacerbate overfitting, slow training, and make it difficult to interpret model behavior, especially on comparatively small benchmarks such as Davis and KIBA.

DMFF-DTA [16], a dual-modality deep learning framework that integrates sequence-based representations and binding-site-aware structural graph features derived from AlphaFold2 to improve drug–target affinity prediction on benchmark datasets such as Davis and KIBA, demonstrating competitive performance and enhanced interpretability. However, the authors evaluate their model using a five-fold cross-validation protocol in which the test set is randomly generated from the entire dataset in each fold, resulting in repeated random sampling of test instances.

This evaluation approach differs from the one used in DeepDTA-based studies and our own work, where a fixed, predefined test split is applied consistently across all methods to ensure a standardized and fair benchmark. Including this model in direct comparisons with ours would therefore be inappropriate, given the variability introduced by its randomly sampled test sets and its reliance on the structural information of enzymes.

DeepDTAGen [17], a multitask deep learning framework built to jointly predict drug–target binding affinity and generate target-aware drug molecules within a single unified architecture. The framework draws on shared feature representations extracted from drug molecular graphs and protein sequences. For evaluation, the authors follow a cross-validation protocol in which each dataset is partitioned into folds and one-fold is used as the test set while the remaining folds form the training set, ensuring that testing samples are separated from training data but still derived from the overall dataset distribution. This evaluation strategy contrasts with DeepDTA-based studies and our approach, where a predefined independent test split is consistently maintained across methods, enabling direct comparability under a fixed testing benchmark rather than repeated dataset-derived splits. Also, this model is difficult to interpret as it is multitasked deep learning framework. Still, when we compare our results with this method, we find that our results in the Davis dataset are much better, while comparable results are obtained in the KIBA dataset.

In this work, I introduce DTBAffinity, a multi-modal regression framework that operates squarely in this “small-n, large-p” regime. DTBAffinity integrates (i) chemically meaningful ligand descriptors (RDKit and Mordred) [18,19], structural keys (MACCS and FP4), circular fingerprints (ECFP/Morgan) [20,21,22], and SMILES-derived [23] features from iFeatureOmega-Drug [24], with (ii) thousands of sequence-derived protein [25,26,27] descriptors from iFeatureOmega-Protein and contextual embeddings from ESM-1b/ESM-2. The resulting modality-specific feature matrices are sanitized, variance-filtered, z-score scaled, and subjected to univariate screening before concatenation and modeling with regularized XGBoost ensembles [28,29,30,31]. We evaluate DTBAffinity on Davis and KIBA under standard splits and metrics, and show that it achieves state-of-the-art or competitive performance against widely used baselines such as KronRLS, SimBoost, DeepDTA, and GraphDTA, while retaining a relatively simple and interpretable learning core. To make the role of DTBAffinity within the broader DTBA landscape explicit, we next formalize the problem setting and summarize the main contributions of this study.

Problem Definition and Contributions

Let

D = {(d_{i}, p_{j}, y_{i j})}

denote a set of drug–protein pairs, where

d_{i}

is a small-molecule ligand,

p_{j}

is a protein (typically a kinase) and

y_{i j}

is a continuous affinity label (pKd in Davis or KIBA score in KIBA). The aim of DTBA prediction is to learn a function parameterized by

θ

that minimizes a regression loss over observed interactions and generalizes to unseen drug–target combinations. In practice,

f_{θ}

is trained on a finite subset of labeled pairs and evaluated on held-out pairs using metrics such as mean squared error (MSE), coefficient of determination (

R^{2}

), Pearson and Spearman correlations, Concordance Index (CI), and r²_m [32,33,34].

f_{θ} : (d_{i}, p_{j}) \mapsto {\hat{y}}_{i j}

(1)

Within this setting, our work makes four main contributions: (i) a multi-modal, chemically grounded featurization scheme in which we construct a comprehensive feature library that combines ligand fingerprints (MACCS, FP4, and ECFP/Morgan), 2D physicochemical and topological descriptors (RDKit and Mordred) and SMILES-derived iFeatureOmega-Drug descriptors with protein sequence descriptors from iFeatureOmega-Protein and contextual embeddings from ESM-1b/ESM-2, yielding a unified multi-view representation of both interaction partners; (ii) a scalable feature-selection and boosting pipeline that performs modality-specific sanitization, variance filtering, z-score scaling and univariate SelectKBest screening, followed by optional global capping and regularized XGBoost regression, explicitly targeting the high-dimensional “small-n, large-p” nature of Davis and KIBA while remaining easy to implement with open-source tools; (iii) an extensive evaluation and ablation study on the standard Davis and KIBA benchmarks, including intra-model ablations over feature combinations and comparisons against widely cited baselines (KronRLS, SimBoost, DeepDTA, GraphDTA, and others) under a common evaluation protocol, where DTBAffinity achieves state-of-the-art or competitive performance across MSE, CI, r_m² and AUPR; and (iv) the public release of all code for data preprocessing, feature generation, model training and evaluation, together with pre-computed feature matrices and train/validation/test splits for both datasets, enabling future work to benchmark new models against DTBAffinity using exactly the same inputs and splits.

2. Background and Related Work

Drug–target binding affinity (DTBA) prediction has been studied using a variety of machine learning paradigms that differ in how they represent molecules and proteins and how they learn from the resulting features. Broadly, existing approaches can be grouped into similarity/kernel methods, feature-based ensembles, and deep sequence/graph models.

2.1. Similarity- and Kernel-Based Approaches

Early work on DTBA and drug–target interaction (DTI) prediction relied heavily on similarity and kernel methods that operate on precomputed drug and target similarity matrices. KronRLS is a representative example: it constructs Kronecker products of drug and protein similarity kernels and applies lasso-regularized kernel ridge regression to learn from sparse interaction matrices [8]. This approach enables prior knowledge embedded in similarity measures to be directly leveraged, resulting in models that are comparatively straightforward to train and interpret. Expanding on this concept, SimBoost introduced a feature-driven gradient boosting framework that retains the use of similarity information while shifting to a tabular data representation. SimBoost constructs a rich collection of entity- and network-level features—such as similarity profiles, interaction counts, and graph-based statistics—and applies gradient boosting machines to predict continuous affinity scores on the KIBA dataset. While these methods benefit from being relatively data-efficient and computationally lightweight, they rely heavily on carefully crafted similarity functions and feature libraries, and may fall short when it comes to capturing fine-grained structural details of ligands and binding sites.

2.2. Deep Sequence- and Graph-Based Models

The rise of deep learning marked a significant shift away from similarity matrices and manually engineered features toward direct modeling of raw sequences and molecular graphs. Among the earliest deep architectures for drug-target affinity prediction is DeepDTA, which employs one-dimensional convolutional neural networks (CNNs) to encode SMILES strings for compounds and FASTA sequences for proteins, merges the resulting latent vectors, and predicts affinity through fully connected layers. Later models built upon this framework by introducing alternative sequence encoders and interaction mechanisms. GraphDTA advanced ligand modeling by moving beyond linear SMILES strings to molecular graphs, representing small molecules as atom–bond graphs and using graph convolution or message passing to extract structural features, while continuing to encode proteins via CNNs applied to amino acid sequences. Concurrently, MolTrans and TransformerCPI harnessed Transformer architectures to capture high-order interactions between substructures or tokens derived from drugs and proteins, enabling these models to learn more complex dependencies than traditional CNN or RNN-based encoders. While these deep learning approaches can acquire task-specific representations end-to-end and have delivered strong results on established benchmarks like Davis and KIBA, they typically demand considerable computational resources, meticulous hyperparameter tuning, and significant implementation effort.

2.3. Multi-Modal Feature-Based Ensembles and Protein Language Models

Beyond purely end-to-end deep learning architectures, a number of studies have investigated hybrid pipelines that bring together complementary feature families—such as fingerprints, physicochemical descriptors, similarity profiles, and pretrained embeddings—and subsequently train relatively shallow learners, including gradient boosting models, on the resulting tabular representations. SimBoost is an early example of this feature-driven strategy, demonstrating that a well-constructed feature library paired with gradient boosting can match or even outperform kernel-based methods on the KIBA benchmark. More recent studies have extended this idea by incorporating richer descriptors for ligands and proteins, and by focusing on time-efficient feature selection and model training. Parallel advances in protein representation learning have introduced powerful pretrained embeddings for targets. Large protein language models such as ESM-1b and ESM-2 are trained on hundreds of millions of sequences and produce contextual embeddings that implicitly encode aspects of structure and function [14,15]. These embeddings have been shown to transfer well to downstream protein-related tasks and provide a complementary alternative to classical sequence-derived descriptors (e.g., compositions, autocorrelations, physicochemical profiles). Platforms such as iFeatureOmega further facilitate systematic construction of large feature sets for both proteins and compounds, including descriptors, fingerprints, and profiles derived from sequences and SMILES [24].

However, integrating all of these feature families naively leads to extremely high-dimensional input spaces, particularly when thousands of iFeatureOmega descriptors are combined with protein language-model embeddings and multiple ligand fingerprints and descriptor sets. This “small-n, large-p” setting is characteristic of Davis and KIBA and poses challenges for both deep and shallow learners, including overfitting, increased training time, and reduced interpretability. Against this backdrop, DTBAffinity positions itself as a multi-modal, feature-based ensemble: it systematically integrates chemically meaningful ligand descriptors, iFeatureOmega-derived sequence features, and ESM embeddings, while explicitly controlling dimensionality via modality-specific filtering and univariate feature selection, and then uses regularized XGBoost regression as a robust learner on the resulting tabular space. It is worth comparing DTBAffinity directly with SimBoost [9], the closest precedent in the gradient-boosted DTBA literature. SimBoost derives its features from pairwise drug–drug and target–target similarity matrices, meaning its feature space scales with the size of the training set and necessitates a complete recomputation of these matrices whenever new entities are introduced. By contrast, DTBAffinity builds per-entity feature vectors independently of other drugs or targets, using Chemoinformatics libraries and protein language models. This decoupled design scales to large compound/target libraries without recomputing pairwise similarities, and the integration of ESM-1b and ESM-2 embeddings captures evolutionarily informed contextual representations unavailable to SimBoost. We have not only added these embeddings, but also added many features and then performed feature selection approaches.

3. Datasets and Methods

3.1. Data Overview and Composition

We evaluate on two canonical kinase-focused DTBA benchmarks. The Davis dataset comprises 68 drugs × 442 kinases with 30,056 measured Kd values; we adopt pKd (=−log10 Kd [M]) as the regression target [4]. The second dataset is KIBA, which has 2111 drugs × 229 proteins with 118,254 interactions; the KIBA score integrates Ki/Kd/IC50 into a monotonic unitless affinity metric [5] (Table 1). Small-molecule ligands are encoded using both fingerprint and descriptor views. The substructure fingerprints span PubChem-like RDKFingerprints (path-based), MACCS structural keys (166 bits) [18], and FP4 keys (307 SMARTS patterns in Open Babel) [20]. Circular Morgan/ECFP fingerprints are used to characterize atom-centered neighborhoods and thus are very suitable as general, purpose encodings for bioactivity modeling [19]. Concurrently, we are calculating 2D physicochemical and topological descriptors with RDKit, e.g., logP, TPSA, EState/partitioned surface areas, topological indices, and we are also filling the gaps with non-redundant 2D families from Mordred (information, theoretical, connectivity, and autocorrelation indices). The multi-view representation can thus be thought of as a single unified tabular schema, which can locally fragment motifs, neighborhood topology, and global molecular properties [21,22].

3.2. Molecule and Protein Sources

For both benchmarks, small-molecule ligands are associated with chemical structures retrievable from PubChem using compound identifiers provided in the original studies [6]. We standardize these structures using RDKit to ensure consistent treatment prior to featurization: salts and small inorganic fragments are stripped where appropriate, aromaticity and bond types are normalized, hydrogens are handled consistently, and canonical SMILES strings are generated. Protein targets are mapped to UniProt accessions using the identifiers supplied in the Davis and KIBA releases [4,5,7]. UniProt provides curated protein sequences along with cross-references to other resources, making it an appropriate backbone for sequence-based feature generation. We use the canonical UniProt sequences for all kinases and verify that they are compatible with both iFeatureOmega and the ESM models.

Starting from the official Davis and KIBA files, we apply a series of preprocessing steps to obtain machine-learning-ready interaction tables. First, interactions with missing, non-numeric, or obviously erroneous affinity values are discarded; for Davis, only entries with valid Kd measurements in molar units are retained, whereas for KIBA, we use the pre-computed KIBA scores provided by the original authors without further modification [5]. Second, in Davis, Kd values are converted to pKd via

- {l o g}_{10} (Kd [M])

, which compresses the dynamic range, reduces skewness, and aligns the regression targets with those used in prior DTBA work on this dataset [4,8,9,10,35,36,37]. Third, for each dataset, we construct an interaction list

(d_{i}, p_{j}, a n d y_{i j})

, where

d_{i}

and

p_{j}

index unique ligands and kinases, respectively, and

y_{i j}

is the associated affinity (pKd or KIBA score); only entries with valid molecular structures and protein sequences are kept, ensuring that every interaction can be featurized on both the drug and the protein side.

3.3. Activity Thresholds, Splitting Strategy, and Evaluation Setup

Although DTBAffinity is trained in a regression setting, it is often useful to view performance through a classification-style lens, particularly when comparing with methods that report ROC-AUC or AUPR. To this end, we adopt standard activity thresholds for post hoc analysis: on Davis, interactions with pKd ≥ 7 are treated as active (high affinity), while those with pKd < 7 are considered inactive; on KIBA, interactions with a KIBA score ≥ 12.1 are treated as active, following prior work on this benchmark. These thresholds are used only for computing classification metrics such as ROC-AUC and AUPR, and the models themselves are always trained on continuous labels. The resulting label distributions are moderately imbalanced under these cutoffs, with high-affinity interactions forming the minority class, which motivates reporting both regression metrics (MSE,

R^{2}

, CI, r²m) and AUPR in later sections. To enable direct comparison with existing methods, we use the same test dataset utilized in the literature for both Davis and KIBA: each dataset has been partitioned into training, validation and test sets using a fixed random seed, where the training set is used to fit the XGBoost models, the validation set supports hyperparameter tuning and early design decisions (e.g., selecting feature combinations and regularization strength), and the test set is held out completely for final evaluation and baseline comparison. All feature-scaling and feature-selection steps are fit exclusively on the training data and then applied to validation and test data using the same parameters in order to avoid information leakage. While more challenging cold-start splits (e.g., drug-scaffold or kinase-family splits) are an important direction for future work, we focus here on the same test dataset to maintain comparability with widely reported baselines.

3.4. Input Representations

3.4.1. Ligands (Drugs)

Small-molecule ligands are first standardized and encoded as canonical SMILES strings [23]. From these standardized structures, DTBAffinity derives a multi-view 2D representation that combines fingerprints and descriptors. Physicochemical and topological descriptors are computed with RDKit (e.g., MolWt, TPSA, cLogP, hydrogen-bond donor/acceptor counts, rotatable bonds, ring/system counts, topological indices) and extended with Mordred’s richer library of information-theoretic, connectivity, and autocorrelation indices [18,19]. In parallel, we generate key-based structural fingerprints including MACCS-166 keys, FP4 fingerprints from Open Babel, and a 2048-bit RDKit path-based fingerprint, which encode the presence or absence of curated substructures and functional groups [20,21]. Circular Morgan/ECFP fingerprints are also computed to capture local atomic neighborhoods and substructural environments that are known to be predictive in QSAR and bioactivity modeling [22]. Finally, we compute SMILES-derived descriptors and fingerprints using iFeatureOmega-Drug, providing an additional sequence-like view of the ligands [23,24]. All ligand feature families (binary fingerprints and continuous descriptors) are concatenated into a single compound-level feature vector that jointly captures substructure motifs, local neighborhoods, and global physicochemical properties.

3.4.2. Proteins (Targets)

Protein targets are represented by combining large collections of sequence-derived descriptors from iFeatureOmega-Protein with contextual embeddings from the ESM-1b and ESM-2 protein language models [14,15,24]. iFeatureOmega-Protein provides thousands of handcrafted features that encode amino-acid, dipeptide, and tripeptide compositions (AAC, DPC, TPC), composition–transition–distribution (CTD) patterns, physicochemical property summaries, autocorrelation descriptors (Moran, Geary, Moreau–Broto), and sequence-order or pseudo-amino-acid composition (PseAAC) variants based on substitution matrices and AAindex-derived properties [25,26,27,28]. These descriptors capture global composition, local trends, and long-range dependencies in an interpretable manner. To complement them with learned representations, we obtain 1280-dimensional embeddings from ESM-1b and ESM-2 by feeding each kinase sequence to the corresponding Transformer model and mean-pooling residue-level hidden states. The final protein feature vector is formed by concatenating the iFeatureOmega-Protein descriptors with both ESM embeddings, yielding a continuous multi-modal representation that integrates handcrafted sequence statistics with context-rich, language-model-derived features and is later aligned with the ligand features in the DTBAffinity pipeline (Table 2).

3.5. Preprocessing

For each modality (ligands and proteins), DTBAffinity first constructs dense feature matrices by stacking the corresponding descriptor and fingerprint vectors for all unique entities. These matrices are converted to numeric type, and any non-finite entries (Inf, −Inf, NaN) are mapped to 0.0. Columns that are identically zero or exhibit zero variance across all entities are removed, as they do not contribute any discriminative information and can adversely affect numerical stability. The remaining features are then standardized by z-score scaling (subtracting the mean and dividing by the standard deviation), computed on the training data and subsequently applied to validation and test sets, all scaling and selection parameters—including the per-entity mean affinity used as proxy target in SelectKBest—are fit on the training data exclusively and applied to held-out sets without refitting, thereby ensuring no information from the validation or test sets influences feature selection. The per-entity mean affinity for a given drug (or protein) is defined as the average of all affinity labels associated with that entity in the training fold only. Thus, the heterogeneous descriptors with different original ranges become comparable prior to feature selection and boosting [29]. To address the extremely high dimensionality of the concatenated feature space, we apply univariate feature selection separately to the drug and protein matrices before pairing. Specifically, we use SelectKBest (f_regression) with the per-entity mean affinity as target and retain up to 10,000 features per side, thereby discarding descriptors that show little linear association with affinity and reducing noise in each modality. For every observed drug–protein pair

(i, j)

, the selected drug and protein vectors are then concatenated to form a single feature vector

x_{i j}

with the corresponding label

y_{i j}

(pKd for Davis or KIBA score for KIBA). Optionally, we apply an additional global SelectKBest (with

k = 10,000

) on the concatenated space to impose an overall cap on dimensionality before feeding the features into XGBoost. This two-stage screening strategy—modality-specific selection followed by an optional global cap—helps to control the “small-

n

, large-

p

” regime typical of Davis and KIBA while preserving the most predictive components of the multi-modal feature library.

3.6. Proposed Methodology

3.6.1. Learning Algorithm

DTBAffinity employs a regularized gradient boosting model—specifically XGBoost—for regression. XGBoost has demonstrated strong performance in handling heterogeneous, high-dimensional data, making it well-suited for the multi-modal feature sets generated for both ligands and proteins (see Figure 1).

Hyperparameter tuning for the XGBoost model was carried out using a predefined search space, with the goal of identifying configurations that strike an optimal balance between predictive accuracy and generalization capability. Specifically, the L1 regularization coefficient (reg_alpha) was explored over the range {0, 0.1, 1, 5, 10}, while the L2 regularization coefficient (reg_lambda) was evaluated across {1, 10, 50, 100}. The number of boosting rounds was explored across a range of 100 to 10,000 to identify an appropriate level of model capacity. The learning rate was jointly tuned alongside tree depth, with the final configuration settling on a learning rate of 0.1 and a maximum tree depth of 8, determined based on validation performance. Default sampling parameters—subsample and colsample_bytree, both set to 1.0—were kept unchanged to ensure the full dataset and feature space were utilized during training. This search strategy facilitated robust parameter selection while guarding against excessive model complexity.

The XGBoost model was configured with the following hyperparameters: 2500 estimators, a learning rate of 0.1, and a tree depth ranging from 6 to 8, along with regularization parameters

r e g_α

\in

{0.1, 1.0}

and

r e g_λ

=

10

. The tree construction method was set to “hist,” which speeds up training through histogram-based decision tree building, with GPU acceleration enabled where available. All experiments were run on Google Colab high-RAM environments with GPU support (NVIDIA A100 or T4, depending on availability) and approximately 25–50 GB of system memory. Training DTBAffinity took between 15 and 30 min across datasets and evaluation splits, while inference on the complete test set completed in just a few seconds, highlighting the computational efficiency of the proposed approach. Additional parameters, including subsample and colsample_bytree, were left at their default values of 1.0. Mean squared error (MSE) was adopted as the loss function, consistent with its widespread use in regression tasks involving continuous outputs such as drug–target binding affinity. XGBoost’s built-in regularization (L1 and L2) helps mitigate overfitting, especially in high-dimensional spaces. The learning process incorporates shrinkage, which updates the model in small steps to reduce variance and improve stability. The final model is an ensemble of decision trees, with each tree trained to correct the errors of the previous one. We select hyperparameters using grid search on the validation set to achieve the best trade-off between bias and variance.

3.6.2. Model Training, Splits, and Metrics

We use the standard splits used in the DeepDTA tool for training, validation, and testing, where interactions in each dataset are divided into two independent sets using indices to ensure comparison fairness. The training set is used for model fitting, the validation set for hyperparameter tuning, and the test set is held out for final evaluation. To ensure a fair comparison with existing methods, the same fixed test set is used across all models, as this represents the standard evaluation protocol. Otherwise, comparisons would be inconsistent. Model performance is evaluated using several regression and ranking metrics: mean squared error (MSE) and

R^{2}

for overall prediction accuracy and goodness of fit, Pearson’s

r

and Spearman’s

ρ

for linear and rank-based associations between predicted and actual affinities, Concordance Index (CI) to assess the model’s ability to rank drug–target pairs in terms of binding affinity,

r_{m}^{2}

(RM2) for predictive stability and robustness, and Area Under the Precision–Recall Curve (AUPR) for classification-style evaluations, where active and inactive interactions are defined using standard cutoffs (pKd ≥ 7 for Davis and KIBA score ≥ 12.1 for KIBA). These metrics are reported for each dataset under fixed validation and test splits to allow direct comparison with baseline methods. To complement the regression evaluation, classification-oriented metrics such as ROC-AUC and AUPR are derived by applying standard activity cutoff thresholds to the predicted affinity values (Table 3). Table 3 presents the final XGBoost hyperparameter settings used for both the Davis and KIBA datasets, selected through systematic tuning to achieve a balance between predictive performance, generalization, and computational efficiency. A fixed random seed of 42 was applied throughout to ensure reproducibility across all experimental runs. The number of boosting rounds was set to 2500 for both datasets to provide sufficient model capacity, while histogram-based tree construction combined with CUDA acceleration was employed to enhance training speed and scalability when working with high-dimensional inputs. A moderate learning rate of 0.1 was selected to provide stable gradient updates, with dataset-specific tree depths (6 for Davis and 8 for KIBA) reflecting differences in dataset complexity. Default sampling parameters were retained (subsample and colsample_bytree = 1.0), allowing the model to leverage all available samples and features. Regularization was introduced through dataset-specific L1 penalties (stronger for Davis than KIBA) and consistent L2 regularization (λ = 10) to control model complexity and mitigate overfitting. Collectively, these settings yielded a robust and reproducible configuration suitable for multimodal drug–target affinity prediction.

3.6.3. Cross-Validation and Robustness

To further validate robustness, we performed k-fold cross-validation on both the Davis and KIBA datasets using k = 5 splits, ensuring that the model’s performance is not dependent on any particular train/test partition. For each fold, the training data was divided into training and validation subsets, and final performance scores were averaged across all five folds. This cross-validation procedure helps evaluate the stability and generalizability of the model across varying data partitions while reducing potential bias introduced by random splitting. Table 4 presents the cross-validation results on both the validation and test sets.

On the Davis dataset, DTBAffinity exhibits strong and consistent predictive performance. Across the 5-fold cross-validation, the model attains a mean MSE of 0.2140 (±0.0103) and an R² of 0.7325 (±0.0034), with Pearson and Spearman correlation coefficients of 0.8571 (±0.0017) and 0.7057 (±0.0040), respectively. The consistently low standard deviations across all metrics reflect the model’s stability and reliability across different data partitions. On the held-out test set, performance improves further, yielding an MSE of 0.1885, an R² of 0.7649, a Pearson correlation of 0.8761, and a Spearman correlation of 0.7297—indicating that the model generalizes effectively beyond the training distribution without signs of overfitting.

On the larger and more diverse KIBA benchmark, DTBAffinity continues to produce reliable and consistent results. Cross-validation yields a mean MSE of 0.1662 (±0.0051) and an R² of 0.7641 (±0.0071), with strong rank-based agreement reflected in Pearson and Spearman correlation coefficients of 0.8749 (±0.0043) and 0.8561 (±0.0022), respectively. The narrow standard deviations observed across all metrics once again underscore the model’s consistency and stability throughout the cross-validation folds. On the test set, DTBAffinity achieves an MSE of 0.1540, R² of 0.7734, Pearson of 0.8799, and Spearman of 0.8648—all of which are on par with or exceed the cross-validation estimates, reinforcing that the model is well-calibrated and does not overfit to any particular data split.

Overall, the close agreement between cross-validation and test set results across both benchmarks demonstrates that DTBAffinity generalizes reliably under the standard random-split protocol, with test performance consistently matching or surpassing validation estimates—a strong indicator of model stability.

4. Results and Discussion

The results were evaluated for DTBAffinity on the Davis and KIBA datasets under the same standard test set used, and the performance was reported across multiple regression and ranking metrics. Table 5 and Table 6 summarize the results for each feature combination on the KIBA and Davis datasets, respectively, while Figure 2 and Figure 3 provide visualizations of predicted vs. actual affinities and residuals for both datasets. Unless otherwise noted, the baseline results reported in Table 6 and Table 7 were obtained from the respective original publications and were not re-run within our experimental pipeline. Consequently, the same test dataset is used across all methods because the DeepDTA method employs predefined dataset indices, where one index is used for the training set and another for the test set. Therefore, all methods included in our comparison utilized the independent test set defined by DeepDTA, with the exception of SimBoost and KronRLS, which were originally evaluated under this setup for comparison with DeepDTA. Accordingly, our comparison using the specified test dataset is valid, as it represents a standard benchmark adopted by prior studies, ensuring fair and consistent evaluation.

4.1. KIBA Dataset

On the KIBA dataset, DTBAffinity consistently surpasses baseline methods in both regression accuracy and ranking performance. As shown in Table 5, the best-performing configuration—combining iFeatureOmega-Protein, ESM embeddings, and a comprehensive ligand feature set encompassing RDKit, Mordred, and ECFP/Morgan descriptors—achieves the following results: MSE = 0.1540, Pearson’s r = 0.8799, Spearman’s ρ = 0.8648, CI = 0.8686, and AUPR = 0.8361. This configuration proves competitive with or superior to other state-of-the-art models, including KronRLS, SimBoost, and DeepDTA, across all evaluated metrics. These findings confirm that the multi-view feature combination paired with a regularized XGBoost ensemble is highly effective for DTBA prediction on large-scale datasets (see Table 5).

4.2. Davis Dataset

In the same vein, DTBAffinity achieves competitive and strong performance on the Davis dataset. As shown in Table 6, the best configuration, which combines iFeatureOmega-Protein and ESM-1b/ESM-2 embeddings with a rich set of ligand descriptors, achieves MSE = 0.1885, Pearson’s

r = 0.8761

, Spearman’s

ρ = 0.7297

, CI = 0.9102, and AUPR = 0.8112. These results significantly outperform prior methods, including DeepDTA, SimBoost, and GraphDTA, particularly in terms of MSE and AUPR. The improvement in CI and AUPR indicates that DTBAffinity not only predicts binding affinities with high accuracy but also ranks active and inactive drug–target pairs more effectively than the baselines. Figure 4 and Figure 5 further illustrate the prediction quality of DTBAffinity, showing that the predicted values closely match the actual affinity values (predicted vs. actual) and exhibit small residuals (residuals vs. actual), with only a slight increase in variance at higher affinities (see Table 6).

4.3. Results of Ablation Studies

We also perform an ablation study to analyze the contribution of each feature family. As shown in Table 5 and Table 6, the performance consistently improves as additional feature families (e.g., ESM embeddings and iFeatureOmega-Drug) are added to the model. The most notable performance gains are observed when iFeatureOmega-Protein is combined with ESM embeddings, as this pairing yields a more comprehensive representation of protein sequences. Ligand features also play a meaningful role in boosting performance, with the incorporation of circular Morgan fingerprints and RDKit descriptors contributing the most significant improvements (see Table 4 and Table 5).

We begin by examining calibration and residual behavior (Figure 2 and Figure 3), followed by a comparative analysis against baseline methods across MSE, CI, RM², and AUPR metrics (Figure 4 and Figure 5). The corresponding numerical results are detailed in Table 4 through Table 7. As illustrated in Figure 5, DTBAffinity’s predictions on the KIBA dataset align closely with the identity line (see Table 8), with residuals approaching near-zero values and only a slight increase in variance at higher affinity levels—a pattern that is consistent with the trends reported in Table 5.

4.4. Performance Assessment in Cold-Start Scenarios

Table 9 shows the performance of the DTBAffinity method when using a fixed split as DeepDTA vs. cold-drug vs. cold target for the KIBA dataset.

Table 10 shows the performance of the DTBAffinity method when using a fixed split vs. cold-drug vs. cold target for the Davis dataset.

To more thoroughly evaluate the generalization capability of the proposed DTBAffinity model beyond the conventional fixed split setting, we carried out additional experiments under two cold-start scenarios: cold-drug and cold-target evaluation. In the cold-drug setting, 10% of unique drugs were exclusively reserved for testing, ensuring that no interactions involving these compounds were seen during training. Likewise, in the cold-target setting, 10% of unique targets were held out as an independent test set, representing protein targets not encountered during model development. These configurations simulate realistic drug discovery conditions in which models must predict binding affinities for entirely novel compounds or previously uncharacterized protein targets.

Table 9 and Table 10 summarize DTBAffinity’s performance across the fixed split (DeepDTA protocol), cold-drug, and cold-target settings on the KIBA and Davis datasets, respectively. As anticipated, the fixed split consistently yields the best performance across all metrics, reflecting the overlapping chemical and biological distributions shared between the training and test data. In contrast, both cold-start scenarios result in a noticeable decline in performance, confirming the heightened difficulty of predicting interactions that involve entirely unseen entities. Notably, the cold-target setting achieves comparatively stronger results than the cold-drug setting across both datasets, suggesting that the model generalizes more effectively across protein space than chemical space under the current feature representation. Despite this performance gap, DTBAffinity retains meaningful predictive correlations and concordance indices under cold-start conditions, demonstrating solid extrapolation capability and supporting its applicability to real-world drug discovery workflows involving novel compounds and targets.

4.5. Feature Importance Analysis and Model Interpretability

To improve model interpretability and gain a clearer understanding of how drug and enzyme features each contribute to prediction performance, we carried out a feature importance analysis using the XGBoost model. Unlike many deep learning architectures that function as black-box predictors, XGBoost offers built-in interpretability through gain-based feature importance scores, allowing for the systematic identification of the most influential descriptors driving model predictions. Figure 6 and Figure 7 display the top ten most important features for the KIBA and Davis datasets under the fixed split setting, respectively. The results reveal that both enzyme-derived representations—such as ESM embeddings and TPC descriptors—and ligand-related descriptors make substantial contributions to predictive performance, underscoring the complementary value of integrating multimodal features. This analysis demonstrates that XGBoost not only delivers competitive predictive accuracy but also provides a transparent framework for uncovering biologically meaningful patterns and identifying the key molecular and protein features most relevant to drug–target affinity prediction.

4.6. Discussion

The results obtained on both the Davis and KIBA datasets indicate that DTBAffinity provides a competitive and interpretable solution for drug–target binding affinity prediction. The integration of multi-modal feature engineering—combining chemically meaningful ligand descriptors with sequence-derived protein features and contextual embeddings from protein language models—has proven highly effective. This hybrid strategy, which draws on both handcrafted features and learned representations, is particularly advantageous in scenarios where data is scarce or inconsistent across modalities.

A key strength of DTBAffinity lies in its interpretability. Through the use of well-established descriptors such as RDKit, Mordred, and MACCS, alongside contextual embeddings from ESM-1b and ESM-2, the model offers meaningful insights into the molecular features most influential in predicting drug–target interactions. Moreover, the regularized gradient boosting framework at the core of DTBAffinity provides a robust mechanism for managing high-dimensional data and mitigating overfitting, particularly in the “small-n, large-p” settings that characterize both the Davis and KIBA datasets.

Nevertheless, certain challenges persist. While DTBAffinity performs well under standard fixed split conditions, further testing under more demanding scenarios would be valuable—such as cold-start problems involving novel scaffolds or kinase families, or kinase-family-specific splits that require the model to generalize to previously unseen targets. Additionally, although the incorporation of ESM-1b and ESM-2 embeddings meaningfully enhances performance, further gains could potentially be achieved by integrating three-dimensional protein structure features or more sophisticated sequence–structure relationship models. Future work will also investigate multi-task learning strategies capable of simultaneously leveraging data from multiple datasets, with the aim of improving overall generalizability and robustness.

5. Conclusions and Limitations

In this study, we introduced DTBAffinity, a robust multi-modal framework for drug–target binding affinity prediction that brings together chemically meaningful ligand descriptors, sequence-level protein features, and contextual embeddings derived from protein language models. By drawing on a diverse range of feature families—including physicochemical, topological, and structural fingerprints, as well as protein sequence descriptors and embeddings—DTBAffinity offers a comprehensive, interpretable, and scalable approach to DTBA prediction. The model achieves competitive performance on both the Davis and KIBA datasets, outperforming state-of-the-art baselines across MSE, R², Concordance Index (CI), r_m², and AUPR, while providing meaningful insights into the underlying molecular features that govern binding affinity predictions. Our findings suggest that combining handcrafted features with learned embeddings within a regularized boosting framework can yield highly effective models, even under the “small-n, large-p” conditions that are characteristic of DTBA tasks.

That said, several limitations and opportunities for future improvement remain. The current work focuses primarily on 2D ligand descriptors and sequence-level protein features. Incorporating more sophisticated, 3D-aware descriptors—such as binding pocket features, docking scores, and 3D graph or structure encoders—could further enhance model performance, particularly for interactions involving conformational changes or allosteric binding sites. Furthermore, while DTBAffinity performs well on the Davis and KIBA benchmarks, its generalization to cold-start scenarios—such as novel scaffolds or kinase families—warrants additional validation through scaffold- or family-specific splits to assess how effectively the model extrapolates to unseen chemical space. Another promising direction involves exploring multi-task learning across multiple datasets, which could bolster generalizability and overall model robustness. Lastly, integrating uncertainty calibration into the framework would offer valuable information regarding prediction confidence, especially when handling novel or ambiguous drug–target pairs. Collectively, these avenues for improvement present exciting opportunities to advance the field of DTBA prediction and broaden its applicability to more complex biological systems.

Funding

This research received no external funding.

Data Availability Statement

The Davis and KIBA benchmark datasets serve as the foundation for drug-target binding affinity modeling in this study. All code used for data preprocessing, feature generation, model training, and result reproduction is available publicly at https://github.com/misharisaud/DTBAffinity (accessed on 27 August 2025). Processed data necessary for reproducing the analyses and figures are available publicly at https://figshare.com/articles/dataset/DTBAffinity_data/29997982 (accessed on 27 August 2025). The original benchmark datasets are being openly distributed by the respective maintainers and can be accessed as per the references in the manuscript. There are no limitations placed on the availability of the code and processed ‌data.

Reproducibility Notes

Each and every step from sanitization, scaling, per-modality and global univariate selection, concatenation, XGBoost configuration, and metric computation, up to plotting, is performed via a fixed-seed and verbose shape logging script. Pictures are produced at a high DPI; AUPR operations also employ the reference Java tool to facilitate the comparisons with the previous work that are most accurate.

Conflicts of Interest

The author declares no conflicts of interest.

References

Wouters, O.J.; McKee, M.; Luyten, J. Estimated research and development investment needed to bring a new medicine to market. JAMA 2020, 323, 844–853. [Google Scholar] [CrossRef]
DiMasi, J.A.; Grabowski, H.G.; Hansen, R.W. Innovation in the pharmaceutical industry: New estimates of R&D costs. J. Health Econ. 2016, 47, 20–33. [Google Scholar] [PubMed]
Prasad, V.; Mailankody, S. Research and development spending to bring a single cancer drug to market. JAMA Intern. Med. 2017, 177, 1569–1575. [Google Scholar] [CrossRef] [PubMed]
Davis, M.I.; Hunt, J.P.; Herrgard, S.; Ciceri, P.; Wodicka, L.M.; Pallares, G.; Hocker, M.; Treiber, D.K.; Zarrinkar, P.P. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 2011, 29, 1046–1051. [Google Scholar] [CrossRef] [PubMed]
Tang, J.; Szwajda, A.; Shakyawar, S.; Xu, T.; Hintsanen, P.; Wennerberg, K.; Aittokallio, T. Making sense of large-scale kinase inhibitor bioactivity data sets. J. Chem. Inf. Model. 2014, 54, 735–743. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Xiao, J.; Suzek, T.O.; Zhang, J.; Wang, J.; Bryant, S.H. PubChem: A public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37, W623–W633. [Google Scholar] [CrossRef]
The UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D619. [Google Scholar] [CrossRef]
Pahikkala, T.; Airola, A.; Pietilä, S.; Shakyawar, S.; Szwajda, A.; Tang, J.; Aittokallio, T. Toward more realistic drug–target interaction predictions with lasso-regularized kernel ridge regression (KronRLS). Brief. Bioinform. 2015, 16, 325–337. [Google Scholar] [CrossRef]
He, T.; Heidemeyer, M.; Ban, F.; Cherkasov, A.; Ester, M. SimBoost: A read-across approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. 2017, 9, 24. [Google Scholar] [CrossRef]
Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 2018, 34, i821–i829. [Google Scholar] [CrossRef]
Nguyen, T.; Le, H.; Quinn, T.P.; Nguyen, T.; Le, T.D.; Venkatesh, S. GraphDTA: Predicting drug–target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140–1147. [Google Scholar] [CrossRef]
Huang, K.; Xiao, C.; Glass, L.M.; Sun, J. MolTrans: Molecular interaction transformer for drug–target interaction prediction. Bioinformatics 2021, 37, 830–836. [Google Scholar] [CrossRef]
Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M. TransformerCPI: Improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2021, 36, 4406–4414. [Google Scholar] [CrossRef]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
He, H.; Chen, G.; Tang, Z.; Chen, C.Y.C. Dual modality feature fused neural network integrating binding site information for drug target affinity prediction. NPJ Digit. Med. 2025, 8, 67. [Google Scholar] [CrossRef] [PubMed]
Shah, P.M.; Zhu, H.; Lu, Z.; Wang, K.; Tang, J.; Li, M. DeepDTAGen: A multitask deep learning framework for drug-target affinity prediction and target-aware drugs generation. Nat. Commun. 2025, 16, 5021. [Google Scholar] [CrossRef] [PubMed]
RDKit. RDKit: Open-Source Cheminformatics. 2025. Available online: https://www.rdkit.org/ (accessed on 12 March 2024).
Moriwaki, H.; Tian, Y.S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef]
O’Boyle, N.M.; Banck, M.; James, C.A.; Morley, C.; Vandermeersch, T.; Hutchison, G.R. Open Babel: An open chemical toolbox. J. Cheminform. 2011, 3, 33. [Google Scholar] [CrossRef]
Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef]
Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef] [PubMed]
Weininger, D. SMILES, a chemical language and information system. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Chen, Z.; Liu, X.; Zhao, P.; Li, C.; Wang, Y.; Li, F.; Akutsu, T.; Bain, C.; Gasser, R.B.; Li, J.; et al. iFeatureOmega: An integrative platform for feature engineering, visualization and analysis of molecular data. Nucleic Acids Res. 2022, 50, W434–W446. [Google Scholar] [CrossRef] [PubMed]
Henikoff, S.; Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 1992, 89, 10915–10919. [Google Scholar] [CrossRef]
Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28, 374. [Google Scholar] [CrossRef]
Chou, K.C. Prediction of protein subcellular locations by pseudo amino acid composition. J. Theor. Biol. 2001, 214, 1–16. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Harrell, F.E., Jr.; Lee, K.L.; Mark, D.B. Multivariable prognostic models: Issues and methods. Stat. Med. 1996, 15, 361–387. [Google Scholar] [CrossRef]
Roy, K.; Chakraborty, P.; Mitra, I.; Ojha, P.K.; Kar, S.; Das, R.N. Some case studies on application of “r_m²” metrics for judging quality of quantitative structure–activity relationship predictions: Emphasis on scaling of response data. J. Comput. Chem. 2013, 34, 1071–1082. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The relationship between Precision–Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Zhao, L.; Wang, J.; Pang, L.; Liu, Y.; Zhang, J. GANsDTA: Predicting drug-target binding affinity using GANs. Front. Genet. 2020, 10, 1243. [Google Scholar] [CrossRef]
Shim, J.; Hong, Z.Y.; Sohn, I.; Hwang, C. Prediction of drug–target binding affinity using similarity-based convolutional neural network. Sci. Rep. 2021, 11, 4416. [Google Scholar] [CrossRef]
Liyaqat, T.; Ahmad, T.; Saxena, C. TeM-DTBA: Time-efficient drug target binding affinity prediction using multiple modalities with Lasso feature selection. J. Comput. Aided Mol. Des. 2023, 37, 573–584. [Google Scholar] [CrossRef]

Figure 1. DTBAffinity pipeline (overview). Steps: (1) Inputs; (2) featurization (drug and protein); (3) preprocessing; (4) feature selection and pairing; (5) XGBoost regression; and (6) evaluation and outputs.

Figure 2. Evaluation of binding affinity prediction: predicted vs. actual and residuals—KIBA dataset.

Figure 3. Evaluation of binding affinity prediction: predicted vs. actual and residuals—Davis dataset.

Figure 4. Davis results across metrics: DTBAffinity attains the lowest MSE and the highest CI and AUPR, with RM2 competitive, in agreement with Table 7.

Figure 5. Performance of drug–target affinity models across MSE, CI, RM2, and AUPR—KIBA dataset.

Figure 6. Top 10 most important features—KIBA dataset.

Figure 7. Top 10 most important features—Davis dataset.

Table 1. Composition and size of the benchmark datasets.

Dataset	Drugs	Proteins	Interactions	Reference
KIBA	2111	229	118,254	[5]
Davis	68	442	30,056	[4]

Table 2. Feature families and dimensionalities used in DTBAffinity.

Category	Feature Family/Model	Dimensionality	Description
Proteins	iFeatureOmega descriptors [24]	23,743	Handcrafted sequence-based features (composition, autocorrelation, physicochemical indices, and evolutionary profiles).
	ESM-1b [14]	1280	Transformer embeddings trained on UniRef50.
	ESM-2 [15]	1280	Larger-scale transformer trained on UniRef50/UniRef90.
Compounds	PubChem-like RDKFingerprint	2048 (bits)	Path-based structural fingerprint approximating PubChem (implemented in RDKit).
	MACCS keys [18]	166 (bits)	Common substructure presence/absence patterns.
	FP4 fingerprints [20]	307 (bits)	Functional group-based OpenBabel fingerprints.
	Morgan/ECFP [21]	668 (bits)	Circular fingerprints capturing atomic neighborhoods.
	RDKit descriptors [18]	190	Physicochemical and topological molecular descriptors.
	Mordred descriptors [19]	672	Extended 2D descriptor library (information-theoretic, connectivity, and autocorrelation).
	iFeatureOmega descriptors [24]	1016	SMILES-derived descriptor/fingerprint set computed by iFeatureOmega.

Table 3. Chosen XGBoost parameters for Davis and KIBA.

Parameter	Value for Davis	Value for KIBA	Default	Notes
random_state	42	42	0	Fixed seed for reproducibility.
n_estimators	2500	2500	100	Number of boosting rounds.
tree_method	Hist	Hist	auto	Histogram-based method (fast and parallel).
Device	Cuda	Cuda	—	GPU-enabled build when available.
learning_rate	0.1	0.1	0.3	Step size shrinkage.
max_depth	6	8	6	Default depth.
subsample	1.0	1.0	1.0	Use all samples (default).
colsample_bytree	1.0	1.0	1.0	Use all features (default).
reg_alpha	1.0	0.1	0.0	L1 regularization.
reg_lambda	10	10	1.0	L2 regularization.
objective	reg:squarederror	reg:squarederror	reg:squarederror	Loss function.

Table 4. The average performance of 5-folds cross validation of the DTBAffinity tool.

Dataset	MSE (±SD)	R² (±SD)	Pearson (±SD)	Spearman (±SD)
Davis’ validation set	0.2140 ± 0.0103	0.7325 ± 0.0034	0.8571 ± 0.0017	0.7057 ± 0.0040
Davis’ testing set	0.1885	0.7649	0.8761	0.7297
KIBA’s validation set	0.1662 ± 0.0051	0.7641 ± 0.0071	0.8749 ± 0.0043	0.8561 ± 0.0022
KIBA’s testing set	0.1540	0.7734	0.8799	0.8648

Table 5. KIBA: Performance for feature combinations (proteins/drugs).

Protein Features	Compound Features	MSE	Pearson	Spearman	C-Index	Rm2	R²
iFeatureOmega	IFeatureOmega	0.1641	0.8715	0.8577	0.8636	0.7295	0.7586
iFeatureOmega	iFeatureOmega + Babel_Chemicals	0.1592	0.8755	0.8615	0.8659	0.7398	0.7657
iFeatureOmega + ESM	iFeatureOmega + Babel_Chemicals	0.1538	0.8801	0.8638	0.8677	0.7473	0.7737
iFeatureOmega + ESM	iFeatureOmega + Babel + Morgan_ECFP + Mordred	0.1540	0.8799	0.8648	0.8686	0.7455	0.7734

Table 6. Davis: Performance for feature combinations (proteins/drugs).

Protein Features	Compound Features	MSE	Pearson	Spearman	C-Index	Rm2	R²
iFeatureOmega	IFeatureOmega	0.2022	0.8660	0.7235	0.9055	0.7067	0.7477
iFeatureOmega	iFeatureOmega + Babel_Chemicals	0.1978	0.8696	0.7233	0.9062	0.7073	0.7532
iFeatureOmega + ESM	iFeatureOmega + Babel_Chemicals	0.1932	0.8725	0.7209	0.9049	0.7167	0.7589
iFeatureOmega + ESM	iFeatureOmega + Babel + Morgan_ECFP + Mordred	0.1885	0.8761	0.7297	0.9102	0.7200	0.7649

Table 7. Davis: Baselines vs. our DTBAffinity. Bold vlaues represent the current study.

Method	MSE	CI	Rm2	AUPR	Published Year
KronRLS	0.3790	0.8710	0.4070	0.6610	2015 [8]
SimBoost	0.2820	0.8720	0.6440	0.7090	2017 [9]
DeepDTA	0.2610	0.8780	0.6300	0.7140	2018 [10]
GANsDTA	0.2760	0.8810	0.6530	0.6910	2020 [35]
SimCNN-DTA	0.3059	0.8552	0.5952	0.6572	2021 [36]
TeM-DTBA	0.2319	0.8878	0.7160	0.7137	2023 [37]
DTBAffinity (ours)	0.1885	0.9102	0.7200	0.8112	Current study

Table 8. KIBA: Baselines vs. our DTBAffinity. Bold vlaues represent the current study.

Method	MSE	CI	Rm2	AUPR	Published Year
KronRLS	0.411	0.782	0.342	0.635	2015 [8]
SimBoost	0.222	0.836	0.629	0.76	2017 [9]
DeepDTA	0.194	0.863	0.673	0.788	2018 [10]
GANsDTA	0.224	0.866	0.675	0.753	2020 [35]
SimCNN-DTA	0.2576	0.8216	0.5734	0.7213	2021 [36]
TeM-DTBA	0.188	0.868	0.736	0.771	2023 [37]
DTBAffinity (ours)	0.1540	0.8686	0.7455	0.8361	Current study

Table 9. Shows the performance on KIBA dataset based on fixed split vs. cold split.

Metric	Fixed Split	Cold-Drug	Cold-Target
MSE	0.154	0.361	0.3603
R2	0.7734	0.4521	0.5529
PEARSON	0.8799	0.6773	0.7489
SPEARMAN	0.8648	0.6742	0.7098
CI	0.8686	0.7657	0.7821
RM2	0.7455	0.4348	0.5297

Table 10. Shows the performance on Davis dataset based on fixed split vs. cold split.

Metric	Fixed Split	Cold-Drug	Cold-Target
MSE	0.1885	0.4639	0.2724
R2	0.7649	0.0805	0.677
PEARSON	0.8761	0.5022	0.8241
SPEARMAN	0.7297	0.4464	0.6755
CI	0.9102	0.7846	0.8782
RM2	0.72	0.2133	0.6575

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alazmi, M. DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks. Computers 2026, 15, 182. https://doi.org/10.3390/computers15030182

AMA Style

Alazmi M. DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks. Computers. 2026; 15(3):182. https://doi.org/10.3390/computers15030182

Chicago/Turabian Style

Alazmi, Meshari. 2026. "DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks" Computers 15, no. 3: 182. https://doi.org/10.3390/computers15030182

APA Style

Alazmi, M. (2026). DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks. Computers, 15(3), 182. https://doi.org/10.3390/computers15030182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks

Abstract

1. Introduction

Problem Definition and Contributions

2. Background and Related Work

2.1. Similarity- and Kernel-Based Approaches

2.2. Deep Sequence- and Graph-Based Models

2.3. Multi-Modal Feature-Based Ensembles and Protein Language Models

3. Datasets and Methods

3.1. Data Overview and Composition

3.2. Molecule and Protein Sources

3.3. Activity Thresholds, Splitting Strategy, and Evaluation Setup

3.4. Input Representations

3.4.1. Ligands (Drugs)

3.4.2. Proteins (Targets)

3.5. Preprocessing

3.6. Proposed Methodology

3.6.1. Learning Algorithm

3.6.2. Model Training, Splits, and Metrics

3.6.3. Cross-Validation and Robustness

4. Results and Discussion

4.1. KIBA Dataset

4.2. Davis Dataset

4.3. Results of Ablation Studies

4.4. Performance Assessment in Cold-Start Scenarios

4.5. Feature Importance Analysis and Model Interpretability

4.6. Discussion

5. Conclusions and Limitations

Funding

Data Availability Statement

Reproducibility Notes

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI