1. Introduction
Acute Myeloid Leukemia (AML) is a highly aggressive and heterogeneous hematological malignancy. In 2021, the global burden of acute myeloid leukemia included approximately 145,000 new cases and 130,000 deaths, highlighting its major public health impact [
1]. In the United States alone, an estimated 20,050 new cases and 11,540 deaths were reported in the following year (2022) [
2]. Its treatment remains a major clinical challenge due to the extensive variability in patient-specific responses to therapies [
3,
4,
5,
6]. Despite advances in precision oncology and the availability of ex vivo drug screening data, selecting the most effective therapy for each patient remains an unresolved and pressing problem.
Machine learning (ML) offers significant promise in addressing this challenge by helping identify effective treatments. However, the structure of precision oncology problems deviates from conventional ML tasks like classification or regression. Instead, it presents a complex assignment problem, where the goal is to recommend the most effective drug—or drug combination—for an individual based on high-dimensional molecular and clinical data.
From a supervised learning standpoint, this challenge goes beyond estimating general drug efficacy. It requires making a targeted recommendation, selecting the single most effective drug from a pool of candidates for each patient [
7]. In principle, if drug efficacy could be modeled precisely through regression, the assignment problem would reduce to selecting the drug with the highest predicted response. However, this theoretical simplicity often breaks down in practice.
Most traditional approaches rely on regression-based models to estimate continuous drug response metrics such as IC50 or AUC for each drug–cell line pair [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17]. While informative, these models do not directly optimize for the ultimate clinical goal: identifying the most effective drug for a patient [
18]. Furthermore, regression models tend to perform best around average response values but struggle at the extremes [
19], which is particularly problematic in clinical settings where the aim is to identify highly sensitive drugs.
To better align modeling objectives with clinical decision-making, some researchers have reframed the task as a binary classification problem, distinguishing between “sensitive” and “resistant” responses. Models like CDSML [
20] and others [
21,
22,
23] have shown strong performance using both traditional algorithms (e.g., Random Forest, KNN) and deep learning architectures (e.g., RefDNN). Yet, this binary classification approach remains overly simplistic—while identifying a drug as “sensitive” provides useful information, it does not prioritize among multiple effective options. As a result, it offers limited utility in guiding optimal treatment decisions where ranking efficacy is crucial.
Hybrid strategies offer more granularity. For example, SAURON-RF integrates classification and regression within a Random Forest framework to prioritize effective drugs while accounting for data imbalance [
19]. Meanwhile, ranking-based approaches attempt to model the relative efficacy of drugs rather than absolute values. Kernelized Rank Learning (KRL), for instance, optimizes a ranking loss function to approximate drug sensitivities using kernelized linear regression [
18]. Similarly, Ref. [
14] proposed neural ranking strategies—Pair-PushC to prioritize effective drugs, List-One to identify the best drug, and List-All to rank all effective options. These methods aim to improve model alignment with the task of drug prioritization.
Reinforcement learning (RL) introduces another promising avenue. Rather than directly predicting response values, RL-based models treat drug assignment as a sequential decision-making problem, optimizing a policy that maximizes long-term outcomes. Methods like PPO-Rank [
24] use a Markov Decision Process to rank drugs in a way that maximizes a clinical reward signal. A broader review of RL applications in oncology underscores its potential to develop adaptive treatment strategies focused on maximizing therapeutic benefit [
25].
Unsupervised learning approaches, while less common, also show promise. When appropriate features are selected, patients who respond optimally to similar drugs may cluster, suggesting potential for precision drug matching based on intrinsic structure in the data [
26,
27]. Altogether, these diverse methodologies—classification, regression, ranking, RL, and clustering—highlight the multifaceted nature of precision medicine and its deep ties to multiple areas of machine learning.
In this study, we address the drug assignment problem by developing and evaluating tree-based models for multiclass drug assignment, where the target labels are the most effective drugs for a given tumor. This formulation shifts away from classifying drugs as merely sensitive or resistant and instead treats drug recommendation as a multiclass prediction task.
We leverage the well-known advantages of ensemble models like Random Forest and XGBoost, which offer robustness, interpretability, and strong performance with tabular multi-omics data. However, these off-the-shelf models are not inherently designed to distinguish the best drug among many—they typically predict outcomes per instance, rather than optimizing over a comparative set.
To overcome this limitation, we introduce SEATS (Systematic Efficacy Assignment with Treatment Seats), a novel method that repurposes standard classifiers for personalized drug recommendation. SEATS works by assigning a fixed number of “treatment seats” to each sample, distributed among candidate drugs in proportion to their measured or estimated efficacy. These weighted seat allocations guide training by adjusting the label structure, ensuring that both the top drug and other highly effective candidates contribute to the learning process. This framework enables tree-based models to better handle multiclass drug selection while preserving their interpretability and ease of use.
In addition to SEATS, we include an Optimal Decision Tree (ODT) model in our evaluation to represent a more transparent, rule-based approach to treatment assignment. While SEATS adapts existing models (RF, XGBoost) to solve the assignment model, ODT is a newly tailored model to address directly this problem. We evaluate all models based on their ability to correctly identify the optimal drug, their generalizability to unseen patients, and their practical deployability in real-world clinical contexts. Validation is conducted on the BeatAML2 dataset (Waves 1 and 2 for training, Waves 3 and 4 for testing) and externally tested on AML cell lines from GDSC.
Through this work, we demonstrate that enhanced tree-based models, guided by our SEATS framework, can bridge the gap between predictive modeling and actionable treatment recommendation—delivering a scalable, interpretable, and clinically relevant solution for precision oncology. In this work, we make two main contributions: (1) the introduction of SEATS (Systematic Efficacy Assignment with Treatment Seats), which adapts standard tree-based models such as Random Forest and XGBoost to the drug assignment problem; (2) the evaluation of an Optimal Decision Tree (ODT), a novel interpretable framework tailored to multiclass drug selection. Together, these contributions address both performance and interpretability, key challenges in precision oncology.
2. Materials and Methods
2.1. Datasets
The models were trained using the BeatAML2 dataset, a comprehensive resource encompassing AML patient specimen with integrated molecular, clinical, and drug response data. The dataset includes ex vivo drug sensitivity profiles for an extensive panel of compounds, detailed clinical annotations, and DNA and RNA sequencing data. Drug sensitivity was quantified by fitting dose–response curves using probit analysis across seven concentration points, measuring the relationship between drug concentration and cell viability. Quality control procedures were applied to ensure the reliability of measurements, particularly in cases with replicate data points [
28].
The full dataset includes responses from 389 patient-derived cell lines treated with 166 experimental and approved oncology drugs. Genomic profiles consist of 2279 nonsynonymous mutations and 22,843 gene expression features. To focus on biologically relevant features and reduce noise, we retained only mutations present in at least 1% of patients and selected the top 5000 most variable genes by expression. We further excluded drugs that were either highly toxic or insufficiently tested (i.e., administered to fewer than 70% of patients), and removed cell lines treated with fewer than 80% of the available drugs.
For model development and internal validation, we used Waves 1 and 2 of the BeatAML2 cohort, comprising drug response data from 257 AML cell lines. For external testing, we used Waves 3 and 4, which include 142 additional cell lines. After filtering, the resulting dataset included 119 drugs, 70 coding-region mutations, and 5000 gene expression features per cell line.
For external validation, we used the Genomics of Drug Sensitivity in Cancer (GDSC) dataset, a large-scale pharmacogenomics resource containing drug response profiles and genomic features across numerous cancer cell lines [
29]. We identified 23 AML cell lines with available IC50 values for 53 drugs that overlapped with those in the BeatAML2 dataset. Due to the differences in experimental protocols, validation on GDSC was restricted to mutational features only. We chose to focus on mutations for this external test because they can be reliably derived from formalin-fixed paraffin-embedded samples. DNA is less prone to degradation than RNA, making mutation profiles more stable and transferrable across datasets.
2.2. Modeling Drug Response: Computation of IC50*
As we framed drug assignment as a multiclass classification problem, each patient or cell line must be assigned to a class corresponding to the most effective drug. Effectiveness was determined using the IC50* metric, a normalized version of the traditional IC50 (half-maximal inhibitory concentration) described in [
30]. IC50* is computed by subtracting the mean
across patients for each drug from the individual
value:
This transformation centers the IC50 values of each drug around zero, allowing for comparisons across drugs while reducing biases due to systematic potency differences. Drugs with high variability in IC50* across patients are more informative for identifying personalized responses. The drug with the lowest IC50* for each patient is designated as the oracle drug, representing the most effective treatment. These oracle assignments serve as the ground truth for supervised training but are later refined using the SEATS method.
2.3. Machine Learning Models
2.3.1. Off-the-Shelf Models
Random Forest (RF) is a tree-based ensemble model that constructs multiple decision trees using bootstrap samples and random feature subsets at each node. This two-stage randomness reduces overfitting and enhances model generalization. RF models are well-suited for high-dimensional genomic data due to their ability to capture nonlinear feature interactions and handle multicollinearity [
31].
XGBoost is a gradient-boosted decision tree algorithm that builds models sequentially, with each tree minimizing the errors of its predecessor. It introduces regularization to avoid overfitting, making it particularly robust in noisy or imbalanced settings [
32]. While XGBoost has been widely applied in other domains, its application in drug response prediction has been limited, warranting further exploration.
Both RF and XGBoost models were first trained using the oracle labels and later retrained using SEATS-derived labels to compare performance.
2.3.2. The SEATS Method
To better reflect the complexity inherent in biological systems, we developed SEATS (Systematic Efficacy Assignment with Treatment Seats), a label generation method inspired by voting systems. Rather than assigning a single optimal drug per patient, SEATS allocates a fixed number of “seats” across multiple drugs based on their relative efficacy, allowing the model to learn from a probabilistic label distribution.
The process begins by converting drug response scores (e.g., IC50*) into pseudoprobabilities using a softmax-like transformation:
where
is the IC50* for drug
in patient
, and
is a tunable hyperparameter controlling the sharpness of the distribution. A high
enforces a “winner-takes-all” dynamic, favoring the best-performing drug, while a low
allows a more proportional distribution among several candidates (
Figure 1).
As shown in
Figure 2, next, each patient is assigned a fixed number of seats (e.g., 3), which are proportionally distributed among drugs according to the pseudoprobabilities. These fractional values are rounded to integers, and any discrepancy from the desired total number of seats is corrected via an adjustment process: excess seats are removed from drugs with the highest over-assignment, and deficits are filled by those with the largest under-assignment.
The final allocation determines the expanded training labels. For instance, if a patient assigns two seats to Drug A and one to Drug B, the dataset is expanded with three rows—two labeled as Drug A and one as Drug B. This expanded dataset is then used to train the RF and XGBoost models, effectively emphasizing drugs with partial efficacy while preserving the best-performing drug as the dominant signal. Tuning γ and the number of seats allows SEATS to control the granularity of the drug selection process, enabling more personalized and robust drug recommendation strategies.
2.3.3. Optimal Decision Trees
We also included Optimal Decision Trees, a model introduced in our previous work [
33], specifically tailored for the drug assignment task. Unlike standard tree-based models, which aim to optimize label prediction, ODT explicitly selects both a splitting variable and a treatment at each node to maximize drug efficacy. This makes the model uniquely suited for clinical decision-making where the goal is not just prediction, but actionable treatment assignment.
At each decision point, the algorithm evaluates all candidate biomarkers (e.g., mutations or gene expression levels) and assigns different drugs to the resulting branches, depending on the patient subgroup (e.g., presence or absence of a mutation). The model optimizes for the total drug sensitivity within each subgroup, recursively splitting until no further gain can be made or a minimum number of patients per group is reached. This process yields compact, interpretable trees that directly link patient features to optimal therapies.
Figure 3 illustrates the framework of ODT, detailing how it assigns drugs based on biomarker profiles.
For this study, we built ODT models using the same BeatAML2 training and validation splits as for the other models.
Hyperparameter tuning was performed for all models using 5-fold cross-validation. Only the best-performing models are reported, with their corresponding hyperparameters detailed in
Table A1,
Table A2 and
Table A3.
2.4. Evaluation
To systematically compare the proposed algorithms, we evaluated each model based on four key criteria: accuracy, multi-omics suitability, explainability, and implementability. Accuracy was assessed using five-fold cross-validation on the BeatAML2 dataset with Waves 1 and 2. The dataset was divided into five subsets, ensuring that no patient in the training set was also included in the test set. This process was repeated five times, allowing each subset to serve as the test fold once. By training on some patients and testing on others not included in the training, this approach effectively simulates conditions for extrapolating data to untested patients, which is essential for drug repositioning and precision medicine. After cross-validation, the models were tested on an independent set comprising Waves 3 and 4, using models trained on the entirety of Waves 1 and 2. This evaluates the model’s performance on unseen patients.
External validation was performed using AML cell lines from the GDSC dataset, which provided an opportunity to test the generalizability of the models in an independent cohort subject to different experimental protocols, novel cell lines, and measurement platforms.
To assess multi-omics suitability, we examined the predictive performance of models trained separately with either gene expression or mutation data. This allowed us to evaluate the versatility of each algorithm in integrating different types of molecular information, which is critical for personalized treatment strategies.
Explainability was evaluated by analyzing the number of variables used by each model and determining the ease with which predictions could be interpreted. This involved assessing whether the model logic could be transparently visualized—as with decision trees—or if informative feature importance metrics could be extracted. Finally, implementability was assessed by measuring training times, considering prediction latency, and evaluating the potential to convert model outputs into practical clinical tools, such as visual decision aids or biomarker panels.
4. Discussion
Personalized treatment selection in AML or other cancers remains a major challenge due to the disease’s complexity and heterogeneity. This study addresses that challenge by introducing SEATS (Systematic Efficacy Assignment with Treatment Seats), a novel strategy that improves drug recommendation performance by enhancing how machine learning models are trained. Our findings support the hypothesis that drug selection can be treated as a multiclassification task, where the model learns to predict the single most effective therapy for each patient based on their molecular features.
SEATS redefines the learning signal by allocating multiple top-performing drugs—or “seats”—to each patient during training, rather than just the single best option. This allows models like Random Forest and XGBoost to capture richer, more generalizable relationships between molecular features and drug efficacy. Across both internal validation and external testing, SEATS consistently led to improvements in predictive performance, particularly when models were trained on mutations data.
Results emphasize the critical role of data modality. Models trained on gene expression data outperformed those using mutation data, reinforcing previous findings [
28] that transcriptomic profiles capture essential biological signals such as differentiation state and lineage identity in AML. This highlights the importance of leveraging dynamic molecular information in predictive modeling for drug response.
Model generalizability was evaluated through external validation on AML cell lines from the GDSC dataset. Despite differences in biological context and experimental protocols, the models maintained reasonable predictive accuracy, with those incorporating the SEATS method consistently outperforming their counterparts.
Tree-based approaches offered a balance between accuracy and interpretability. SEATS is especially valuable in this context because it augments standard ensemble models without requiring complex modifications. However, ODT stands out for its simplicity and transparency. As shown, the decision path within the ODT clearly links biomarker status to treatment decisions, facilitating intuitive understanding for clinicians. Remarkably, ODTExp achieved strong predictive performance using as few as five genes, demonstrating its capacity to support cost-effective and actionable clinical workflows.
In terms of clinical applicability, both SEATS-enhanced models and ODT demonstrated practical advantages. SEATS-compatible methods can offer biological insight through feature importance metrics. Additionally, all models demonstrated efficient training and inference times, typically under a minute, reinforcing their suitability for rapid clinical decision support. SEATS did not substantially increase computational complexity, which makes it an appealing addition to traditional algorithms.
While this work is focused on AML, the underlying methodology—particularly the SEATS strategy and the ODT framework—is generalizable to other cancers and diseases where treatment choices can be driven by patient-level omics data. It is important to clarify that the claim of clinical feasibility in this study is based solely on computational performance and interpretability. No formal validation with oncologists or prospective clinical trials was performed; such efforts represent a necessary direction for future work. Future directions should also include extension to other cancer types, or experimental validation of the proposed drug assignments. Integrating additional data types such as proteomics or single-cell transcriptomics may also enhance model precision.
In conclusion, this study introduces SEATS as a robust enhancement to traditional machine learning models for precision treatment assignment. Combined with evidence supporting the value of gene expression features, SEATS lays the groundwork for more effective, transparent, and clinically relevant predictive models.