1. Introduction
Drug repurposing—identifying new therapeutic indications for existing compounds—offers a faster and often safer strategy to treatment development compared with de novo discovery [
1].
Cancers are heterogeneous, and subtypes often respond differently to chemotherapy, requiring tailored treatment strategies [
2]. Establishing robust links between genomic/epigenomic features and clinical response is essential for precision therapies that match drugs to a patient’s molecular context [
2].
Nevertheless, predicting drug benefit from molecular and clinical profiles remains a major challenge.
Transcriptomic profiling provides a scalable way to characterize drug action at the systems level, enabling data-driven prioritization of candidates that mimic disease-relevant cellular states [
3]. Among available resources, the LINCS L1000 compendium aggregates large-scale perturbational signatures across thousands of compounds and genes, creating a challenge for computational approaches [
4].
A broad spectrum of methods has been explored to analyze these data, from signature identification and correlation-based methods to network propagation and supervised learning [
5,
6,
7].
However, high dimensionality and heterogeneity of transcriptomic data across cell types can obscure relevant biological signals and complicate interpretability. Dimensionality-reduction methods such as Uniform Manifold Approximation and Projection (UMAP), alongside classical Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), make these data more interpretable by projecting them into lower dimensions while preserving meaningful patterns [
8].
For example, a recent work [
8] has shown that projecting high-dimensional transcriptomes into low-dimensional representations can present some advantages. In yeast deletion profiles, UMAP grouped genes into known protein complexes and pathways and outperformed correlation/PCA for recovering validated interactions, indicating that spatial proximity in the reduced space reflects shared biological function [
9]. Similar advantages have been reported in bulk human transcriptomics, where UMAP improved batch separation and highlighted biologically coherent groups than PCA/MDS [
10]. In addition, these studies motivate using local neighborhoods in transcriptomic representations.
However, dimensionality reduction (DR) techniques in the studies of drug-induced transcriptomic data are limited [
11]. Kwon et al. compared the DR methods on drug-induced transcriptomes and showed that t-SNE, and UMAP better preserved structure (local/global) and captured biological signals such as dose effects [
11].
Network-based studies have previously used neighborhoods of mutated proteins to identify new cancer-related genes [
12,
13].
However, to our knowledge, no work has specifically examined the first-neighbor compounds around anticancer drugs—or evaluated these immediate neighbors as potential drugs in cancer. This approach is based on the principle that the compounds close to known anticancer drugs could share mechanisms, reducing the search of new drugs to a set of candidates.
The k-nearest neighbors (k-NN) algorithm is one of the most common machine learning strategy that finds the closest points in a dataset. This procedure can be applied to compute a distance between pairs of drugs, and for each anticancer drug, it retrieves the k closest non-anticancer compounds based on their transcriptomic signatures [
14,
15].
Here, we propose an unsupervised strategy to prioritize repurposing candidates for oncology. Starting from LINCS L1000 consensus drug signatures (1170 compounds × 7467 genes), we project the data with UMAP and retrieve, for each known anticancer drug, the k-nearest non-anticancer neighbors in the 2D space. By counting how often a compound recurs as a neighbor across all anticancer queries, we derive a frequency-based ranking of candidate drugs. We then repeat the entire procedure with PCA and t-SNE to test the robustness of the results.
By using dimensionality reduction and local neighborhood analysis on transcriptomic perturbation data, we provide a practical strategy to support drug repurposing in oncology. Our study is not limited to methodological evaluation, but contributes to applied sciences by offering a data-driven tool to accelerate the identification of candidate anticancer compounds.
2. Materials and Methods
2.1. Drug-Induced Gene Expression Signatures
Consensus drug perturbation profiles were obtained from the LINCS L1000 dataset, specifically the consensi-drugbank.tsv.bz2 file [
16]. This dataset, consisting of 1170 compounds across 7467 genes, contains gene expression z-scores summarizing the transcriptional effects of drug treatments across multiple cell lines. The file was downloaded and imported into R using the readr package. Each row represents a drug, and each column corresponds to a gene identified by Entrez Gene ID.
2.2. Annotation of Anticancer Drugs
To label anticancer compounds, we used the Comparative Toxicogenomics Database (CTD), specifically the CTD_chemicals_diseases.tsv.gz file, which contains curated chemical–disease associations [
17]. The dataset was filtered for cancer-related diseases. Indeed, a drug was considered anticancer if it appeared in association with any cancer-related terms.
2.3. Dimensionality Reduction and Visualization
We applied UMAP for dimensionality reduction from ~7000 genes to two components using the umap package in R [
8]. The 2D layout was plotted using ggplot2 to visualize clustering patterns among drugs.
UMAP is a manifold-learning method from topological data analysis that embeds high-dimensional profiles into a low-dimensional space while preserving local structure, often yielding well-delineated clusters. In this study, we used n_neighbors = 15, min_dist = 0.1, n_epochs = 200, metric = “euclidean”, and set_op_mix_ratio = 1.
2.4. Unsupervised Clustering
K-means clustering was applied on the 2D UMAP to group drugs into 6 clusters (k = 6) [
18]. The default R settings were used: Hartigan–Wong algorithm, iter.max = 10, nstart = 1 (random initialization from the data). In k-means, the number of clusters (k) should be assigned in advance; the algorithm iteratively estimates k centroids and assigns each sample to the nearest centroid.
2.5. Nearest Neighbor Analysis and Candidate Identification
Using the FNN package, we computed the k-nearest neighbors for each anticancer compound. Namely, we found the non-anticancer drugs closest to each cancer drug.
For each anticancer drug, we computed the Euclidean distance to all non-anticancer compounds and retrieved the k = 5 closest neighbors.
Candidate drugs were then ranked by recurrence frequency, defined as the number of times a compound appeared among the five nearest neighbors of different anticancer drugs. The frequency of appearance across neighbors was used as a prioritization score. The choice of k = 5 was motivated by conventions in prior literature, where small neighborhood sizes are typically employed to capture local transcriptomic structure without introducing excessive noise [
4,
11,
15]. We therefore selected k = 5 as a balance between ensuring stable neighbor identification and maintaining specificity in the detected associations.
2.6. Validation
To assess the robustness of the UMAP-based findings, we performed the dimensionality reduction using PCA and t-SNE [
19]. The same input matrix used throughout the study was analyzed to ensure comparability with the UMAP space described above. PCA was performed with prcomp (R base stats), using centering and scaling. Indeed, before decomposition, each feature is centered and scaled to unit variance, so the PCA is effectively run on z-scored variables.
t-SNE was computed with Rtsne, using a default perplexity of 30.
In both PCA and t-SNE two-dimensional spaces, we replicated the nearest-neighbor candidate definition used for UMAP. Specifically, for each anticancer drug, we retrieved the k = 5 nearest neighbors among non-anticancer drugs using the same procedure previously described for UMAP.
In addition, for each candidate anticancer drug, we computed the K-nearest neighbors (k = 50) using Euclidean distance on the two embedding coordinates to assess whether its local neighborhood was enriched for anticancer drugs across the multiple representations (UMAP, PCA, and t-SNE). For each neighborhood we performed a right-tailed hypergeometric test to evaluate enrichment of anticancer labels relative to the global baseline prevalence (number of anticancer drugs K over the population size M = N − 1, excluding the candidate). Resulting p-values were adjusted for multiple testing using Benjamini–Hochberg (BH) correction per candidate across the three tests (UMAP, PCA, and t-SNE).
In addition, we extracted the set of the Top-25 nearest non-anticancer neighbors (Euclidean distance in 2D) for each candidate. We then computed the Jaccard index between pairs of methods (UMAP–PCA, UMAP–t-SNE, PCA–t-SNE). Higher values indicate greater stability/overlap of the non-anticancer neighborhood across dimensionality reductions.
2.7. PROGENy Pathway Activity Inference
We quantified signaling pathway activity per compound using PROGENy (R package), which infers pathway activity from gene expression “footprints” learned from perturbation experiments [
20].
Pathway activities were computed with progeny, which standardizes genes internally and uses all available genes to estimate activities for 14 canonical signaling pathways (e.g., Androgen, EGFR, Estrogen, Hypoxia, JAK–STAT, MAPK, NF-κB, PI3K, p53, TGF-β, TNF-α). The function returns a continuous activity score for each pathway and drug, where higher values indicate higher inferred pathway activation.
For comparability across compounds, pathway scores were converted to z-scores per pathway (subtracting the mean and dividing by the standard deviation across all drugs). These standardized scores were used to summarize the top up-regulated (highest z) and down-regulated (lowest z) pathways for each of the drugs of interest and to identify pathways consistently modulated across drugs.
3. Results
3.1. UMAP of Drug Perturbation Profiles
To visualize the structure of drug-induced gene expression perturbations, we applied UMAP to the high-dimensional expression data. The resulting two-dimensional embedding is shown in
Figure 1.
The UMAP reveals that anticancer and non-anticancer compounds are largely intermixed in the reduced space. No clear clusters or separable regions corresponding to the two classes emerge from the embedding. This suggests that, in the reduced UMAP space, the transcriptional signatures of anticancer compounds are not trivially distinguishable from those of non-anticancer compounds.
3.2. Clustering of Drugs in UMAP Space
To explore potential substructure in the UMAP embedding, we applied k-means clustering to the two-dimensional UMAP, specifying six clusters. The resulting cluster assignments are shown in
Figure 2, where each color represents a distinct cluster.
We then examined the distribution of anticancer and non-anticancer compounds within each cluster (see
Table 1).
Clusters vary in size and class composition, but no single cluster is composed exclusively of anticancer compounds or non-anticancer drugs. Clusters 1 and 6 contain the largest number of anticancer compounds (216 and 242, respectively), but both also include a number of non-anticancer drugs.
These results suggest that the clusters are not class-pure and anticancer compounds remain dispersed across multiple clusters. This highlights the complexity of the underlying gene expression signatures and supports the need for more advanced classification or similarity-based methods.
3.3. Nearest Neighbor Analysis
To identify potential repositioning candidates, we applied a k-nearest neighbors search in the UMAP-reduced gene expression space. Each known anticancer drug was used as a query to retrieve its five nearest non-anticancer neighbors based on Euclidean distance.
The top-ranked compounds—such as Ergometrine (DB01253) and Mupirocin (DB00410)—were retrieved as neighbors 31 and 29 times, respectively, suggesting a potentially shared expression-based signature with several anticancer agents.
Table 2 lists the top 10 most frequently retrieved compounds and their associated DrugBank identifiers.
These candidates represent promising molecules for further experimental validation, considering their consistent proximity to multiple known anticancer compounds in the embedding space.
As shown in
Figure 3, each of these candidates (blue nodes) forms local hubs surrounded by multiple red nodes (anticancer drugs), suggesting shared expression-based characteristics.
3.4. Concordance of Candidate Discovery
To validate the candidates obtained with UMAP, we performed PCA and t-SNE followed by the nearest-neighbor analysis. Despite only moderate global agreement between methods, we observed several common results. Two drugs were shared by UMAP and PCA (Ergometrine and (S)-blebbistatin; UMAP frequencies 31/27; PCA 22/27), and one by UMAP and t-SNE (Mupirocin; 29 and 25). These results suggest the role of Ergometrine, (S)-blebbistatin, and Mupirocin as potential anticancer drugs, regardless of the method applied.
In addition, we performed a local neighborhood enrichment around each candidate. Regarding Ergometrine (DB01253), in UMAP, 44/50 neighbors were anticancer (88.0% vs. global baseline 70.5%; hypergeometric BH-adjusted p = 0.0039). In PCA this was 41/50 (82.0%; p_adj = 0.1282), in t-SNE 38/50 (76.0%; p_adj = 0.2405). Jaccard index considering the Top-25 nearest non-anticancer neighbors for each dimensionality reduction technique was UMAP–PCA = 0.471, UMAP–t-SNE = 0.562, and PCA–t-SNE = 0.429.
Considering (S)-blebbistatin (DB01944) in UMAP, 41/50 neighbors were anticancer (82.0% vs. 70.5%; BH-adjusted p = 0.0435). PCA yielded 38/50 (76.0%; p_adj = 0.2405), t-SNE 40/50 (80.0%; p_adj = 0.1282). Jaccard index was UMAP–PCA = 0.562, UMAP–t-SNE = 0.515, and PCA–t-SNE = 0.515.
Mupirocin (DB00410) in UMAP had 44/50 neighbors as anticancer (88.0% vs. 70.5%; BH-adjusted p = 0.0039). PCA showed 40/50 (80.0%; p_adj = 0.1282), t-SNE 42/50 (84.0%; p_adj = 0.0589), and the gene-space cosine 40/50 (80.0%; p_adj = 0.2564). Jaccard index was Jaccard UMAP–PCA = 0.471, UMAP–t-SNE = 0.429, PCA–t-SNE = 0.471. Overall, all three candidates show significant anticancer enrichment in UMAP (BH-adjusted p ≤ 0.0435) and moderate cross-embedding neighborhood stability (Jaccard 0.43–0.56).
3.5. Pathway Activity
Using PROGENy, we derived per-drug pathway activity z-scores and summarized the top five up-regulated and top five down-regulated pathways for each compound.
Figure 4 shows the pathway activity for Ergometrine, (S)-blebbistatin, and Mupirocin. A clear, concordant pattern emerged across the three drugs: EGFR, PI3K, MAPK, WNT, and Estrogen signaling were consistently elevated (among the top up-regulated pathways in all three cases). In contrast, NF-κB and TNF-α signaling were consistently reduced across all drugs, with Hypoxia and JAK–STAT showing decreases in two of three compounds.
Drug-specific differences accompanied these shared trends. For Ergometrine, down-regulated pathways included Hypoxia, TRAIL, NF-κB, p53, and TNF-α. (S)-blebbistatin showed reduced NF-κB, TNF-α, TGF-β, Androgen, and JAK–STAT signaling. Mupirocin displayed decreases in NF-κB, Hypoxia, JAK–STAT, TNF-α, and TRAIL. Taken together, the profile indicates a shift toward mitogenic/growth-factor programs (EGFR/PI3K/MAPK/WNT/Estrogen) with concurrent attenuation of inflammatory and stress pathways (NF-κB/TNF-α, with context-dependent suppression of Hypoxia and JAK–STAT).
4. Discussion
In this study, we analyzed 1170 drug-induced transcriptomic signatures from LINCS L1000 and investigated local neighborhoods in low-dimensional representations to identify candidates for anticancer drug repurposing.
The idea is that investigating the neighborhood of anticancer drugs in low dimensional transcriptomic representations (e.g., UMAP/PCA/t-SNE) offers a good strategy for drug discovery and repurposing. Indeed, the compounds close to anticancer drugs tend to share mechanisms of action, making them promising candidates for further investigation.
In the UMAP plot, anticancer and non-anticancer drugs were mixed together, and k-means alone did not clearly separate the two group (χ2 = 2.111, df = 5, p = 0.834). This suggests that global separation in 2D is not sufficient for discovery. Instead, we focused on local neighborhoods using k-NN recurrence score: for each anticancer drug we retrieved the five (k = 5) closest non-anticancer neighbors and ranked candidates by how often they recurred across all anticancer queries. This yielded three drugs (Ergometrine, Mupirocin, and (S)-blebbistatin) among the most frequent neighbors.
Because dimensionality-reduction methods can differ in how they preserve structure, we tested other dimensionality-reduction techniques by repeating the k-NN in PCA and t-SNE. These results confirmed the results: two drugs were shared by UMAP and PCA (Ergometrine and (S)-blebbistatin), and one by UMAP and t-SNE (Mupirocin).
Ergometrine showed significant enrichment for anticancer drug in UMAP (44/50 anticancer, 88.0% vs. global baseline 70.5%; BH-adjusted p = 0.0039), with non-significant trends in PCA/t-SNE/cosine; (S)-blebbistatin was significant in UMAP (41/50; p = 0.0435); Mupirocin was significant in UMAP (44/50; p = 0.0039). Neighborhood stability for these drugs—measured as the Jaccard index on Top-25 non-anticancer neighbors—was consistently in the 0.43–0.56 range across pairs of dimensionality-reduction methods.
Pathway activity analysis using PROGENy indicated consistent suppression of NF-κB/TNF-α and context-dependent decreases in Hypoxia/JAK–STAT/TGF-β; however, concurrent activation of EGFR/PI3K/MAPK/WNT/Estrogen argues against a uniformly anti-proliferative state, suggesting compensatory signaling rather than a canonical anticancer transcriptional response. (S)-blebbistatin showed the broadest suppression of pro-survival and inflammatory signaling, with marked decreases in NF-κB and TNF-α, together with reductions in JAK–STAT, TGF-β, and Androgen pathways. This pattern is directionally consistent with anticancer activity. Thus, (S)-blebbistatin emerges as the most favorable of the three based on pathway activity, but further validation are needed.
The three compounds have distinct pharmacological backgrounds. Ergometrine is an ergot alkaloid widely used as a uterotonic to treat or prevent postpartum hemorrhage; it increases uterine smooth-muscle tone and produces sustained contractions [
21]. Mechanistically, it acts as a partial agonist at 5-HT
2 receptors and also engages α-adrenergic and dopaminergic receptors, consistent with its contractile and vasoconstrictive effects [
22]. Clinically it is administered intramuscularly (250–500 µg) and is effective, but limited by adverse effects (e.g., hypertension/vasospasm) and stability constraints that preclude oral use [
21].
(S)-Blebbistatin is a selective inhibitor of non-muscle myosin II ATPase activity and is broadly used as a research tool to suppress actomyosin contractility; the (S)-enantiomer is active while (R) is largely inactive. Although not a therapeutic agent, it has revealed key roles for myosin II in tumor cell invasion and mechanics; myosin II inhibition reduces 3D invasion and can suppress glioma invasion in preclinical models [
23]. A practical limitation is phototoxicity/photodegradation under blue light, which has spurred the development of improved derivatives (e.g., 3′-hydroxy/3′-amino-blebbistatin).
Mupirocin (pseudomonic acid A) is a topical antibiotic produced by
Pseudomonas fluorescens that inhibits isoleucyl-tRNA synthetase, blocking bacterial protein synthesis [
24]. It is widely used for impetigo and nasal decolonization of
Staphylococcus aureus (including MRSA), typically as a 5-day intranasal regimen; resistance can emerge via mupA/mupB determinants [
25]. Systemic use is limited by rapid metabolism and toxicity, so current indications are topical/decolonization.
The three drugs have some links to cancer. Ergometrine and Mupirocin are not anticancer drugs. However, Ergometrine as serotonergic and adrenergic receptor can modulate proliferation, survival, and cytoskeletal dynamics in tumor cells and stromal compartments. That biology offers a plausible mechanistic bridge to anticancer mechanism [
26].
The connection of Mupirocin to oncology is likely indirect: (i) in vitro, strong inhibition of aminoacyl-tRNA synthetases can trigger integrated stress response/translation stress programs that sometimes resemble anticancer perturbations; (ii) in specificsettings, targeting the tumor-associated microbiome is being explored, but mupirocin’s established role is decolonization, not tumor therapy [
27].
(S)-Blebbistatin is deeply involved in tumor cell migration and invasion, mechanotransduction, and tissue architecture. In diverse preclinical systems, myosin II inhibition reduces invasive behavior and can blunt dissemination. Limitations for translation include phototoxicity/photoinstability of blebbistatin and a lack of a clinically suitable analog so far.
In our data, the k-NN frequency score in UMAP was the most informative for the three focal candidates, whereas PCA and t-SNE provided confirmatory evidence that helped filter out embedding-specific artifacts.
At the same time, comparative evaluations caution that dimensionality-reduction methods differ in how they preserve local vs. global structure and can be sensitive to parameter choices. Systematic benchmarks across PCA, t-SNE, UMAP, and related methods therefore recommend robustness checks rather than relying on a single projection, especially when downstream analyses depend on neighborhood structure. This informs our decision to repeat all neighborhood-based analyses across multiple embeddings and to quantify their agreement [
10].
We proposed a computational pipeline based on dimensionality reduction combined with neighborhood-based analysis for prioritizing drug repurposing candidates in oncology. The pipeline can be directly applied to large-scale public pharmacogenomic resources, enabling faster hypothesis generation, cost reduction in early-stage discovery, and improved prioritization of compounds for in vitro or in vivo testing. In this way, our work bridges methodological innovation with translational applications.
5. Limitations and Future Directions
This study has several limitations that should be acknowledged. First, the findings are entirely based on in silico analyses. While the computational framework provides a systematic way to highlight repurposing candidates, it cannot by itself establish therapeutic efficacy or safety. Second, our analysis relies on a single resource—the LINCS L1000 dataset—which, although comprehensive, captures drug-induced perturbations only under selected cell line and experimental conditions. This may limit the generalizability of the results to other cellular or in vivo contexts. However, although here we focused on DrugBank, the pipeline is general and could be extended to other compound classes, including natural products, provided that transcriptomic perturbation data are available.
Third, pathway activity inference showed partially ambiguous results, with concurrent activation of proliferative signaling (e.g., EGFR/PI3K/MAPK) alongside suppression of stress and inflammatory pathways. This suggests compensatory signaling and highlights the need for cautious interpretation. Finally, dimensionality-reduction methods can vary in how they preserve local versus global structure, and results may be sensitive to parameter choices. Although we addressed this by comparing UMAP, PCA, and t-SNE, further benchmarking is warranted.
Future work should aim to validate the computationally identified candidates through experimental assays. A first step could be in vitro screening of Ergometrine, Mupirocin, and (S)-Blebbistatin across diverse cancer cell lines, followed by functional studies on invasion, proliferation, and survival. From the computational side, extending the approach to integrate multi-omics (e.g., proteomics, epigenomics) or incorporating drug–target interaction networks may improve biological interpretability and robustness. Together, these steps will help translate our findings into actionable insights into oncology drug repurposing.