Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction

Cava, Claudia; Castiglioni, Isabella

doi:10.3390/app151910707

Open AccessArticle

Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction

by

Claudia Cava

^1,*

and

Isabella Castiglioni

^2,3

¹

Department of Science, Technology and Society, University School for Advanced Studies IUSS Pavia, 27100 Pavia, Italy

²

Department of Physics ‘‘Giuseppe Occhialini”, University of Milan-Bicocca, 20126 Milan, Italy

³

CDI—Centro Diagnostico Italiano S.p.A, 20011 Milan, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10707; https://doi.org/10.3390/app151910707

Submission received: 4 September 2025 / Revised: 30 September 2025 / Accepted: 2 October 2025 / Published: 3 October 2025

Download

Browse Figures

Versions Notes

Abstract

Drug repurposing refers to the systematic identification of new therapeutic uses for existing drugs. Unlike traditional de novo drug discovery, which is expensive and time-consuming, repurposing leverages compounds with already established safety, pharmacokinetic, and pharmacodynamic profiles. In this study, we propose a drug repositioning model based on low-dimensional transcriptomic representations to investigate the relationship between known anticancer drugs and non-anticancer compounds. We analyzed LINCS L1000 data (1170 drugs; 824 anticancer, 346 non-anticancer). Data were projected with UMAP, PCA, and t-SNE. For each anticancer drug and for each method, we retrieved the k = 5 nearest non-anticancer neighbors and ranked candidates by recurrence frequency across all anticancer queries. We identified Ergometrine, Mupirocin, and (S)-blebbistatin among the most frequent non-anticancer drugs with a close association with drugs known to be anticancer. In addition, we performed a local neighborhood enrichment around the three candidates. Regarding Ergometrine (DB01253), in UMAP, 44/50 neighbors were anticancer (88.0% vs. global baseline 70.5%; hypergeometric BH-adjusted p = 0.0039). Considering (S)-blebbistatin (DB01944) in UMAP, 41/50 neighbors were anticancer (82.0% vs. 70.5%; BH-adjusted p = 0.0435). Mupirocin (DB00410) in UMAP had 44/50 neighbors as anticancer (88.0% vs. 70.5%; BH-adjusted p = 0.0039). Future research should explore the three drugs with in vivo models, investigating their possible synergies.

Keywords:

drug; prediction; machine learning

1. Introduction

Drug repurposing—identifying new therapeutic indications for existing compounds—offers a faster and often safer strategy to treatment development compared with de novo discovery [1].

Cancers are heterogeneous, and subtypes often respond differently to chemotherapy, requiring tailored treatment strategies [2]. Establishing robust links between genomic/epigenomic features and clinical response is essential for precision therapies that match drugs to a patient’s molecular context [2].

Nevertheless, predicting drug benefit from molecular and clinical profiles remains a major challenge.

Transcriptomic profiling provides a scalable way to characterize drug action at the systems level, enabling data-driven prioritization of candidates that mimic disease-relevant cellular states [3]. Among available resources, the LINCS L1000 compendium aggregates large-scale perturbational signatures across thousands of compounds and genes, creating a challenge for computational approaches [4].

A broad spectrum of methods has been explored to analyze these data, from signature identification and correlation-based methods to network propagation and supervised learning [5,6,7].

However, high dimensionality and heterogeneity of transcriptomic data across cell types can obscure relevant biological signals and complicate interpretability. Dimensionality-reduction methods such as Uniform Manifold Approximation and Projection (UMAP), alongside classical Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), make these data more interpretable by projecting them into lower dimensions while preserving meaningful patterns [8].

For example, a recent work [8] has shown that projecting high-dimensional transcriptomes into low-dimensional representations can present some advantages. In yeast deletion profiles, UMAP grouped genes into known protein complexes and pathways and outperformed correlation/PCA for recovering validated interactions, indicating that spatial proximity in the reduced space reflects shared biological function [9]. Similar advantages have been reported in bulk human transcriptomics, where UMAP improved batch separation and highlighted biologically coherent groups than PCA/MDS [10]. In addition, these studies motivate using local neighborhoods in transcriptomic representations.

However, dimensionality reduction (DR) techniques in the studies of drug-induced transcriptomic data are limited [11]. Kwon et al. compared the DR methods on drug-induced transcriptomes and showed that t-SNE, and UMAP better preserved structure (local/global) and captured biological signals such as dose effects [11].

Network-based studies have previously used neighborhoods of mutated proteins to identify new cancer-related genes [12,13].

However, to our knowledge, no work has specifically examined the first-neighbor compounds around anticancer drugs—or evaluated these immediate neighbors as potential drugs in cancer. This approach is based on the principle that the compounds close to known anticancer drugs could share mechanisms, reducing the search of new drugs to a set of candidates.

The k-nearest neighbors (k-NN) algorithm is one of the most common machine learning strategy that finds the closest points in a dataset. This procedure can be applied to compute a distance between pairs of drugs, and for each anticancer drug, it retrieves the k closest non-anticancer compounds based on their transcriptomic signatures [14,15].

Here, we propose an unsupervised strategy to prioritize repurposing candidates for oncology. Starting from LINCS L1000 consensus drug signatures (1170 compounds × 7467 genes), we project the data with UMAP and retrieve, for each known anticancer drug, the k-nearest non-anticancer neighbors in the 2D space. By counting how often a compound recurs as a neighbor across all anticancer queries, we derive a frequency-based ranking of candidate drugs. We then repeat the entire procedure with PCA and t-SNE to test the robustness of the results.

By using dimensionality reduction and local neighborhood analysis on transcriptomic perturbation data, we provide a practical strategy to support drug repurposing in oncology. Our study is not limited to methodological evaluation, but contributes to applied sciences by offering a data-driven tool to accelerate the identification of candidate anticancer compounds.

2. Materials and Methods

2.1. Drug-Induced Gene Expression Signatures

Consensus drug perturbation profiles were obtained from the LINCS L1000 dataset, specifically the consensi-drugbank.tsv.bz2 file [16]. This dataset, consisting of 1170 compounds across 7467 genes, contains gene expression z-scores summarizing the transcriptional effects of drug treatments across multiple cell lines. The file was downloaded and imported into R using the readr package. Each row represents a drug, and each column corresponds to a gene identified by Entrez Gene ID.

2.2. Annotation of Anticancer Drugs

To label anticancer compounds, we used the Comparative Toxicogenomics Database (CTD), specifically the CTD_chemicals_diseases.tsv.gz file, which contains curated chemical–disease associations [17]. The dataset was filtered for cancer-related diseases. Indeed, a drug was considered anticancer if it appeared in association with any cancer-related terms.

2.3. Dimensionality Reduction and Visualization

We applied UMAP for dimensionality reduction from ~7000 genes to two components using the umap package in R [8]. The 2D layout was plotted using ggplot2 to visualize clustering patterns among drugs.

UMAP is a manifold-learning method from topological data analysis that embeds high-dimensional profiles into a low-dimensional space while preserving local structure, often yielding well-delineated clusters. In this study, we used n_neighbors = 15, min_dist = 0.1, n_epochs = 200, metric = “euclidean”, and set_op_mix_ratio = 1.

2.4. Unsupervised Clustering

K-means clustering was applied on the 2D UMAP to group drugs into 6 clusters (k = 6) [18]. The default R settings were used: Hartigan–Wong algorithm, iter.max = 10, nstart = 1 (random initialization from the data). In k-means, the number of clusters (k) should be assigned in advance; the algorithm iteratively estimates k centroids and assigns each sample to the nearest centroid.

2.5. Nearest Neighbor Analysis and Candidate Identification

Using the FNN package, we computed the k-nearest neighbors for each anticancer compound. Namely, we found the non-anticancer drugs closest to each cancer drug.

For each anticancer drug, we computed the Euclidean distance to all non-anticancer compounds and retrieved the k = 5 closest neighbors.

Candidate drugs were then ranked by recurrence frequency, defined as the number of times a compound appeared among the five nearest neighbors of different anticancer drugs. The frequency of appearance across neighbors was used as a prioritization score. The choice of k = 5 was motivated by conventions in prior literature, where small neighborhood sizes are typically employed to capture local transcriptomic structure without introducing excessive noise [4,11,15]. We therefore selected k = 5 as a balance between ensuring stable neighbor identification and maintaining specificity in the detected associations.

2.6. Validation

To assess the robustness of the UMAP-based findings, we performed the dimensionality reduction using PCA and t-SNE [19]. The same input matrix used throughout the study was analyzed to ensure comparability with the UMAP space described above. PCA was performed with prcomp (R base stats), using centering and scaling. Indeed, before decomposition, each feature is centered and scaled to unit variance, so the PCA is effectively run on z-scored variables.

t-SNE was computed with Rtsne, using a default perplexity of 30.

In both PCA and t-SNE two-dimensional spaces, we replicated the nearest-neighbor candidate definition used for UMAP. Specifically, for each anticancer drug, we retrieved the k = 5 nearest neighbors among non-anticancer drugs using the same procedure previously described for UMAP.

In addition, for each candidate anticancer drug, we computed the K-nearest neighbors (k = 50) using Euclidean distance on the two embedding coordinates to assess whether its local neighborhood was enriched for anticancer drugs across the multiple representations (UMAP, PCA, and t-SNE). For each neighborhood we performed a right-tailed hypergeometric test to evaluate enrichment of anticancer labels relative to the global baseline prevalence (number of anticancer drugs K over the population size M = N − 1, excluding the candidate). Resulting p-values were adjusted for multiple testing using Benjamini–Hochberg (BH) correction per candidate across the three tests (UMAP, PCA, and t-SNE).

In addition, we extracted the set of the Top-25 nearest non-anticancer neighbors (Euclidean distance in 2D) for each candidate. We then computed the Jaccard index between pairs of methods (UMAP–PCA, UMAP–t-SNE, PCA–t-SNE). Higher values indicate greater stability/overlap of the non-anticancer neighborhood across dimensionality reductions.

2.7. PROGENy Pathway Activity Inference

We quantified signaling pathway activity per compound using PROGENy (R package), which infers pathway activity from gene expression “footprints” learned from perturbation experiments [20].

Pathway activities were computed with progeny, which standardizes genes internally and uses all available genes to estimate activities for 14 canonical signaling pathways (e.g., Androgen, EGFR, Estrogen, Hypoxia, JAK–STAT, MAPK, NF-κB, PI3K, p53, TGF-β, TNF-α). The function returns a continuous activity score for each pathway and drug, where higher values indicate higher inferred pathway activation.

For comparability across compounds, pathway scores were converted to z-scores per pathway (subtracting the mean and dividing by the standard deviation across all drugs). These standardized scores were used to summarize the top up-regulated (highest z) and down-regulated (lowest z) pathways for each of the drugs of interest and to identify pathways consistently modulated across drugs.

3. Results

3.1. UMAP of Drug Perturbation Profiles

To visualize the structure of drug-induced gene expression perturbations, we applied UMAP to the high-dimensional expression data. The resulting two-dimensional embedding is shown in Figure 1.

The UMAP reveals that anticancer and non-anticancer compounds are largely intermixed in the reduced space. No clear clusters or separable regions corresponding to the two classes emerge from the embedding. This suggests that, in the reduced UMAP space, the transcriptional signatures of anticancer compounds are not trivially distinguishable from those of non-anticancer compounds.

3.2. Clustering of Drugs in UMAP Space

To explore potential substructure in the UMAP embedding, we applied k-means clustering to the two-dimensional UMAP, specifying six clusters. The resulting cluster assignments are shown in Figure 2, where each color represents a distinct cluster.

We then examined the distribution of anticancer and non-anticancer compounds within each cluster (see Table 1).

Clusters vary in size and class composition, but no single cluster is composed exclusively of anticancer compounds or non-anticancer drugs. Clusters 1 and 6 contain the largest number of anticancer compounds (216 and 242, respectively), but both also include a number of non-anticancer drugs.

These results suggest that the clusters are not class-pure and anticancer compounds remain dispersed across multiple clusters. This highlights the complexity of the underlying gene expression signatures and supports the need for more advanced classification or similarity-based methods.

3.3. Nearest Neighbor Analysis

To identify potential repositioning candidates, we applied a k-nearest neighbors search in the UMAP-reduced gene expression space. Each known anticancer drug was used as a query to retrieve its five nearest non-anticancer neighbors based on Euclidean distance.

The top-ranked compounds—such as Ergometrine (DB01253) and Mupirocin (DB00410)—were retrieved as neighbors 31 and 29 times, respectively, suggesting a potentially shared expression-based signature with several anticancer agents. Table 2 lists the top 10 most frequently retrieved compounds and their associated DrugBank identifiers.

These candidates represent promising molecules for further experimental validation, considering their consistent proximity to multiple known anticancer compounds in the embedding space.

As shown in Figure 3, each of these candidates (blue nodes) forms local hubs surrounded by multiple red nodes (anticancer drugs), suggesting shared expression-based characteristics.

3.4. Concordance of Candidate Discovery

To validate the candidates obtained with UMAP, we performed PCA and t-SNE followed by the nearest-neighbor analysis. Despite only moderate global agreement between methods, we observed several common results. Two drugs were shared by UMAP and PCA (Ergometrine and (S)-blebbistatin; UMAP frequencies 31/27; PCA 22/27), and one by UMAP and t-SNE (Mupirocin; 29 and 25). These results suggest the role of Ergometrine, (S)-blebbistatin, and Mupirocin as potential anticancer drugs, regardless of the method applied.

In addition, we performed a local neighborhood enrichment around each candidate. Regarding Ergometrine (DB01253), in UMAP, 44/50 neighbors were anticancer (88.0% vs. global baseline 70.5%; hypergeometric BH-adjusted p = 0.0039). In PCA this was 41/50 (82.0%; p_adj = 0.1282), in t-SNE 38/50 (76.0%; p_adj = 0.2405). Jaccard index considering the Top-25 nearest non-anticancer neighbors for each dimensionality reduction technique was UMAP–PCA = 0.471, UMAP–t-SNE = 0.562, and PCA–t-SNE = 0.429.

Considering (S)-blebbistatin (DB01944) in UMAP, 41/50 neighbors were anticancer (82.0% vs. 70.5%; BH-adjusted p = 0.0435). PCA yielded 38/50 (76.0%; p_adj = 0.2405), t-SNE 40/50 (80.0%; p_adj = 0.1282). Jaccard index was UMAP–PCA = 0.562, UMAP–t-SNE = 0.515, and PCA–t-SNE = 0.515.

Mupirocin (DB00410) in UMAP had 44/50 neighbors as anticancer (88.0% vs. 70.5%; BH-adjusted p = 0.0039). PCA showed 40/50 (80.0%; p_adj = 0.1282), t-SNE 42/50 (84.0%; p_adj = 0.0589), and the gene-space cosine 40/50 (80.0%; p_adj = 0.2564). Jaccard index was Jaccard UMAP–PCA = 0.471, UMAP–t-SNE = 0.429, PCA–t-SNE = 0.471. Overall, all three candidates show significant anticancer enrichment in UMAP (BH-adjusted p ≤ 0.0435) and moderate cross-embedding neighborhood stability (Jaccard 0.43–0.56).

3.5. Pathway Activity

Using PROGENy, we derived per-drug pathway activity z-scores and summarized the top five up-regulated and top five down-regulated pathways for each compound. Figure 4 shows the pathway activity for Ergometrine, (S)-blebbistatin, and Mupirocin. A clear, concordant pattern emerged across the three drugs: EGFR, PI3K, MAPK, WNT, and Estrogen signaling were consistently elevated (among the top up-regulated pathways in all three cases). In contrast, NF-κB and TNF-α signaling were consistently reduced across all drugs, with Hypoxia and JAK–STAT showing decreases in two of three compounds.

Drug-specific differences accompanied these shared trends. For Ergometrine, down-regulated pathways included Hypoxia, TRAIL, NF-κB, p53, and TNF-α. (S)-blebbistatin showed reduced NF-κB, TNF-α, TGF-β, Androgen, and JAK–STAT signaling. Mupirocin displayed decreases in NF-κB, Hypoxia, JAK–STAT, TNF-α, and TRAIL. Taken together, the profile indicates a shift toward mitogenic/growth-factor programs (EGFR/PI3K/MAPK/WNT/Estrogen) with concurrent attenuation of inflammatory and stress pathways (NF-κB/TNF-α, with context-dependent suppression of Hypoxia and JAK–STAT).

4. Discussion

In this study, we analyzed 1170 drug-induced transcriptomic signatures from LINCS L1000 and investigated local neighborhoods in low-dimensional representations to identify candidates for anticancer drug repurposing.

The idea is that investigating the neighborhood of anticancer drugs in low dimensional transcriptomic representations (e.g., UMAP/PCA/t-SNE) offers a good strategy for drug discovery and repurposing. Indeed, the compounds close to anticancer drugs tend to share mechanisms of action, making them promising candidates for further investigation.

In the UMAP plot, anticancer and non-anticancer drugs were mixed together, and k-means alone did not clearly separate the two group (χ² = 2.111, df = 5, p = 0.834). This suggests that global separation in 2D is not sufficient for discovery. Instead, we focused on local neighborhoods using k-NN recurrence score: for each anticancer drug we retrieved the five (k = 5) closest non-anticancer neighbors and ranked candidates by how often they recurred across all anticancer queries. This yielded three drugs (Ergometrine, Mupirocin, and (S)-blebbistatin) among the most frequent neighbors.

Because dimensionality-reduction methods can differ in how they preserve structure, we tested other dimensionality-reduction techniques by repeating the k-NN in PCA and t-SNE. These results confirmed the results: two drugs were shared by UMAP and PCA (Ergometrine and (S)-blebbistatin), and one by UMAP and t-SNE (Mupirocin).

Ergometrine showed significant enrichment for anticancer drug in UMAP (44/50 anticancer, 88.0% vs. global baseline 70.5%; BH-adjusted p = 0.0039), with non-significant trends in PCA/t-SNE/cosine; (S)-blebbistatin was significant in UMAP (41/50; p = 0.0435); Mupirocin was significant in UMAP (44/50; p = 0.0039). Neighborhood stability for these drugs—measured as the Jaccard index on Top-25 non-anticancer neighbors—was consistently in the 0.43–0.56 range across pairs of dimensionality-reduction methods.

Pathway activity analysis using PROGENy indicated consistent suppression of NF-κB/TNF-α and context-dependent decreases in Hypoxia/JAK–STAT/TGF-β; however, concurrent activation of EGFR/PI3K/MAPK/WNT/Estrogen argues against a uniformly anti-proliferative state, suggesting compensatory signaling rather than a canonical anticancer transcriptional response. (S)-blebbistatin showed the broadest suppression of pro-survival and inflammatory signaling, with marked decreases in NF-κB and TNF-α, together with reductions in JAK–STAT, TGF-β, and Androgen pathways. This pattern is directionally consistent with anticancer activity. Thus, (S)-blebbistatin emerges as the most favorable of the three based on pathway activity, but further validation are needed.

The three compounds have distinct pharmacological backgrounds. Ergometrine is an ergot alkaloid widely used as a uterotonic to treat or prevent postpartum hemorrhage; it increases uterine smooth-muscle tone and produces sustained contractions [21]. Mechanistically, it acts as a partial agonist at 5-HT₂ receptors and also engages α-adrenergic and dopaminergic receptors, consistent with its contractile and vasoconstrictive effects [22]. Clinically it is administered intramuscularly (250–500 µg) and is effective, but limited by adverse effects (e.g., hypertension/vasospasm) and stability constraints that preclude oral use [21].

(S)-Blebbistatin is a selective inhibitor of non-muscle myosin II ATPase activity and is broadly used as a research tool to suppress actomyosin contractility; the (S)-enantiomer is active while (R) is largely inactive. Although not a therapeutic agent, it has revealed key roles for myosin II in tumor cell invasion and mechanics; myosin II inhibition reduces 3D invasion and can suppress glioma invasion in preclinical models [23]. A practical limitation is phototoxicity/photodegradation under blue light, which has spurred the development of improved derivatives (e.g., 3′-hydroxy/3′-amino-blebbistatin).

Mupirocin (pseudomonic acid A) is a topical antibiotic produced by Pseudomonas fluorescens that inhibits isoleucyl-tRNA synthetase, blocking bacterial protein synthesis [24]. It is widely used for impetigo and nasal decolonization of Staphylococcus aureus (including MRSA), typically as a 5-day intranasal regimen; resistance can emerge via mupA/mupB determinants [25]. Systemic use is limited by rapid metabolism and toxicity, so current indications are topical/decolonization.

The three drugs have some links to cancer. Ergometrine and Mupirocin are not anticancer drugs. However, Ergometrine as serotonergic and adrenergic receptor can modulate proliferation, survival, and cytoskeletal dynamics in tumor cells and stromal compartments. That biology offers a plausible mechanistic bridge to anticancer mechanism [26].

The connection of Mupirocin to oncology is likely indirect: (i) in vitro, strong inhibition of aminoacyl-tRNA synthetases can trigger integrated stress response/translation stress programs that sometimes resemble anticancer perturbations; (ii) in specificsettings, targeting the tumor-associated microbiome is being explored, but mupirocin’s established role is decolonization, not tumor therapy [27].

(S)-Blebbistatin is deeply involved in tumor cell migration and invasion, mechanotransduction, and tissue architecture. In diverse preclinical systems, myosin II inhibition reduces invasive behavior and can blunt dissemination. Limitations for translation include phototoxicity/photoinstability of blebbistatin and a lack of a clinically suitable analog so far.

In our data, the k-NN frequency score in UMAP was the most informative for the three focal candidates, whereas PCA and t-SNE provided confirmatory evidence that helped filter out embedding-specific artifacts.

At the same time, comparative evaluations caution that dimensionality-reduction methods differ in how they preserve local vs. global structure and can be sensitive to parameter choices. Systematic benchmarks across PCA, t-SNE, UMAP, and related methods therefore recommend robustness checks rather than relying on a single projection, especially when downstream analyses depend on neighborhood structure. This informs our decision to repeat all neighborhood-based analyses across multiple embeddings and to quantify their agreement [10].

We proposed a computational pipeline based on dimensionality reduction combined with neighborhood-based analysis for prioritizing drug repurposing candidates in oncology. The pipeline can be directly applied to large-scale public pharmacogenomic resources, enabling faster hypothesis generation, cost reduction in early-stage discovery, and improved prioritization of compounds for in vitro or in vivo testing. In this way, our work bridges methodological innovation with translational applications.

5. Limitations and Future Directions

This study has several limitations that should be acknowledged. First, the findings are entirely based on in silico analyses. While the computational framework provides a systematic way to highlight repurposing candidates, it cannot by itself establish therapeutic efficacy or safety. Second, our analysis relies on a single resource—the LINCS L1000 dataset—which, although comprehensive, captures drug-induced perturbations only under selected cell line and experimental conditions. This may limit the generalizability of the results to other cellular or in vivo contexts. However, although here we focused on DrugBank, the pipeline is general and could be extended to other compound classes, including natural products, provided that transcriptomic perturbation data are available.

Third, pathway activity inference showed partially ambiguous results, with concurrent activation of proliferative signaling (e.g., EGFR/PI3K/MAPK) alongside suppression of stress and inflammatory pathways. This suggests compensatory signaling and highlights the need for cautious interpretation. Finally, dimensionality-reduction methods can vary in how they preserve local versus global structure, and results may be sensitive to parameter choices. Although we addressed this by comparing UMAP, PCA, and t-SNE, further benchmarking is warranted.

Future work should aim to validate the computationally identified candidates through experimental assays. A first step could be in vitro screening of Ergometrine, Mupirocin, and (S)-Blebbistatin across diverse cancer cell lines, followed by functional studies on invasion, proliferation, and survival. From the computational side, extending the approach to integrate multi-omics (e.g., proteomics, epigenomics) or incorporating drug–target interaction networks may improve biological interpretability and robustness. Together, these steps will help translate our findings into actionable insights into oncology drug repurposing.

Author Contributions

Conceptualization, C.C.; methodology, C.C.; formal analysis, C.C.; investigation, C.C.; writing—original draft preparation, C.C.; supervision, I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The drug-induced transcriptomic signatures come from the LINCS L1000 resource (consensus z-scores; drugs × genes). No new raw data were generated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cava, C.; Castiglioni, I. Drug repositioning based on mutual information for the treatment of Alzheimer’s disease patients. Med. Biol. Eng. Comput. 2025, 63, 2249–2257. [Google Scholar] [CrossRef]
Ding, Z.; Zu, S.; Gu, J. Evaluating the molecule-based prediction of clinical drug responses in cancer. Bioinformatics 2016, 32, 2891–2895. [Google Scholar] [CrossRef]
Wacker, S.A.; Houghtaling, B.R.; Elemento, O.; Kapoor, T.M. Using transcriptome sequencing to identify mechanisms of drug action and resistance. Nat. Chem. Biol. 2012, 8, 235–237. [Google Scholar] [CrossRef] [PubMed]
Samart, K.; Tuyishime, P.; Krishnan, A.; Ravi, J. Reconciling multiple connectivity scores for drug repurposing. Brief. Bioinform. 2021, 22, bbab161. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Su, J.; Yang, F.; Wei, K.; Ma, J.; Zhou, X. Compound signature detection on LINCS L1000 big data. Mol. Biosyst. 2015, 11, 714–722. [Google Scholar] [CrossRef]
Jeon, M.; Xie, Z.; Evangelista, J.E.; Wojciechowicz, M.L.; Clarke, D.J.B.; Ma’ayan, A. Transforming L1000 profiles to RNA-seq-like profiles with deep learning. BMC Bioinform. 2022, 23, 374. [Google Scholar] [CrossRef]
Hosseini-Gerami, L.; Higgins, I.A.; Collier, D.A.; Laing, E.; Evans, D.; Broughton, H.; Bender, A. Benchmarking causal reasoning algorithms for gene expression-based compound mechanism of action analysis. BMC Bioinform. 2023, 24, 154. [Google Scholar] [CrossRef]
Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2018, 37, 38–44. [Google Scholar] [CrossRef]
Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 2020, 11, 1537. [Google Scholar] [CrossRef]
Huang, H.; Wang, Y.; Rudin, C.; Browne, E.P. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun. Biol. 2022, 5, 719. [Google Scholar] [CrossRef]
Kwon, Y.; Park, S.; Park, S.; Lee, H. Benchmarking of dimensionality reduction methods to capture drug response in transcriptome data. Sci. Rep. 2025, 15, 32173. [Google Scholar] [CrossRef] [PubMed]
Ozgür, A.; Vu, T.; Erkan, G.; Radev, D.R. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 2008, 24, i277–i285. [Google Scholar] [CrossRef] [PubMed]
Creixell, P.; Reimand, J.; Haider, S.; Wu, G.; Shibata, T.; Vazquez, M.; Mustonen, V.; Gonzalez-Perez, A.; Pearson, J.; Sander, C.; et al. Pathway and network analysis of cancer genomes. Nat. Methods 2015, 12, 615–621. [Google Scholar] [CrossRef]
Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Zhu, B.; Xu, Y.; Zhao, P.; Yiu, S.M.; Yu, H.; Shi, J.Y. NNAN: Nearest Neighbor Attention Network to Predict Drug-Microbe Associations. Front. Microbiol. 2022, 13, 846915. [Google Scholar] [CrossRef]
Duan, Q.; Reid, S.P.; Clark, N.R.; Wang, Z.; Fernandez, N.F.; Rouillard, A.D.; Readhead, B.; Tritsch, S.R.; Hodos, R.; Hafner, M.; et al. L1000CDS2: LINCS L1000 characteristic direction signatures search engine. npj Syst. Biol. Appl. 2016, 2, 16015. [Google Scholar] [CrossRef]
Wiegers, T.C.; Davis, A.P.; Wiegers, J.; Sciaky, D.; Barkalow, F.; Wyatt, B.; Strong, M.; McMorran, R.; Abrar, S.; Mattingly, C.J. Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database. Database 2025, 2025, baaf013. [Google Scholar] [CrossRef]
Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. Image Signal Process. 2020, 12119, 317–325. [Google Scholar] [CrossRef]
Sarmina, B.G.; Sun, G.H.; Dong, S.H. Principal Component Analysis and t-Distributed Stochastic Neighbor Embedding Analysis in the Study of Quantum Approximate Optimization Algorithm Entangled and Non-Entangled Mixing Operators. Entropy 2023, 25, 1499. [Google Scholar] [CrossRef]
Schubert, M.; Klinger, B.; Klünemann, M.; Sieber, A.; Uhlitz, F.; Sauer, S.; Garnett, M.J.; Blüthgen, N.; Saez-Rodriguez, J. Perturbation-response genes reveal signaling footprints in cancer gene expression. Nat. Commun. 2018, 9, 20. [Google Scholar] [CrossRef]
Gallos, I.D.; Williams, H.M.; Price, M.J.; Merriel, A.; Gee, H.; Lissauer, D.; Moorthy, V.; Tobias, A.; Deeks, J.J.; Widmer, M.; et al. Uterotonic agents for preventing postpartum haemorrhage: A network meta-analysis. Cochrane Database Syst. Rev. 2018, 4, CD011689. [Google Scholar] [CrossRef]
Schiff, P.L. Ergot and its alkaloids. Am. J. Pharm. Educ. 2006, 70, 98. [Google Scholar] [CrossRef]
Ivkovic, S.; Beadle, C.; Noticewala, S.; Massey, S.C.; Swanson, K.R.; Toro, L.N.; Bresnick, A.R.; Canoll, P.; Rosenfeld, S.S. Direct inhibition of myosin II effectively blocks glioma invasion in the presence of multiple motogens. Mol. Biol. Cell 2012, 23, 533–542. [Google Scholar] [CrossRef]
Poovelikunnel, T.; Gethin, G.; Humphreys, H. Mupirocin resistance: Clinical implications and potential alternatives for the eradication of MRSA. J. Antimicrob. Chemother. 2015, 70, 2681–2692. [Google Scholar] [CrossRef] [PubMed]
Coates, T.; Bax, R.; Coates, A. Nasal decolonization of Staphylococcus aureus with mupirocin: Strengths, weaknesses and future prospects. J. Antimicrob. Chemother. 2009, 64, 9–15. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Huang, S.; Wu, X.; He, W.; Song, M. Serotonin signalling in cancer: Emerging mechanisms and therapeutic opportunities. Clin. Transl. Med. 2024, 14, e1750. [Google Scholar] [CrossRef]
Kim, Y.; Sundrud, M.S.; Zhou, C.; Edenius, M.; Zocco, D.; Powers, K.; Zhang, M.; Mazitschek, R.; Rao, A.; Yeo, C.Y.; et al. Aminoacyl-tRNA synthetase inhibition activates a pathway that branches from the canonical amino acid response in mammalian cells. Proc. Natl. Acad. Sci. USA 2020, 117, 8900–8911. [Google Scholar] [CrossRef]

Figure 1. Uniform Manifold Approximation and Projection (UMAP). Each point represents a single compound from the DrugBank dataset, projected based on its transcriptomic profile. Points are color-coded according to their known classification: red indicates compounds annotated as anticancer, while gray denotes all other drugs. 0 = non-anticancer drug, 1 = anticancer drug.

Figure 2. Clustering of drugs in UMAP space using k-means (k = 6). Each point represents a drug, positioned according to its two-dimensional UMAP based on gene expression profiles. Colors indicate cluster membership as assigned by k-means. While some spatial grouping is visible, anticancer and non-anticancer drugs are distributed across multiple clusters.

Figure 3. Network of candidate drugs connected to anticancer compounds in the UMAP space.Blue nodes represent non-anticancer candidate drugs, while red nodes indicate known anticancer drugs. An edge links a candidate to an anticancer drug if the candidate was among its five nearest neighbors in the UMAP-embedded space.

Figure 4. PROGENy pathway activity (z-score) for three DrugBank compounds.

Table 1. Non-anticancer and anticancer drugs within each cluster.

Cluster	Non-Anticancer	Anticancer
1	80	216
2	85	191
3	17	46
4	24	60
5	28	69
6	112	242

Table 2. Top 10 Most Frequent Candidate Drugs.

DrugBankID	Frequency	Common Name
DB01253	31	Ergometrine
DB00410	29	Mupirocin
DB01944	27	(S)-blebbistatin
DB09009	27	Articaine
DB00938	26	Salmeterol
DB00940	26	Methantheline
DB01064	26	Isoprenaline
DB00159	25	Icosapent
DB07350	25	(2E)-N-hydroxy-enamide
DB08039	25	(3Z)-N,N-dimethyl-enamide

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cava, C.; Castiglioni, I. Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction. Appl. Sci. 2025, 15, 10707. https://doi.org/10.3390/app151910707

AMA Style

Cava C, Castiglioni I. Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction. Applied Sciences. 2025; 15(19):10707. https://doi.org/10.3390/app151910707

Chicago/Turabian Style

Cava, Claudia, and Isabella Castiglioni. 2025. "Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction" Applied Sciences 15, no. 19: 10707. https://doi.org/10.3390/app151910707

APA Style

Cava, C., & Castiglioni, I. (2025). Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction. Applied Sciences, 15(19), 10707. https://doi.org/10.3390/app151910707

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computational Pipeline for Anticancer Drug Repurposing via Dimensionality Reduction

Abstract

1. Introduction

2. Materials and Methods

2.1. Drug-Induced Gene Expression Signatures

2.2. Annotation of Anticancer Drugs

2.3. Dimensionality Reduction and Visualization

2.4. Unsupervised Clustering

2.5. Nearest Neighbor Analysis and Candidate Identification

2.6. Validation

2.7. PROGENy Pathway Activity Inference

3. Results

3.1. UMAP of Drug Perturbation Profiles

3.2. Clustering of Drugs in UMAP Space

3.3. Nearest Neighbor Analysis

3.4. Concordance of Candidate Discovery

3.5. Pathway Activity

4. Discussion

5. Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI