1. Introduction
Mitochondria are functionally rich and diverse organelles. In addition to their primary role in energy production, mitochondria regulate various biological processes, including signal transduction, immune response, cell differentiation, and cell death [
1].
Mitochondrial toxicity refers to the damage caused to mitochondria by various substances, including drugs, chemical compounds, and environmental toxins. Mitochondrial toxicity can have wide-ranging effects, given the central role of mitochondria in energy production and other cellular functions. Understanding mitochondrial toxicity is crucial in drug development, as many drugs have been withdrawn from the market or are limited in use due to their toxic effects on mitochondria [
2]. As such, screening for mitochondrial toxicity is an important part of the safety evaluation of new pharmaceuticals. Mitochondrial toxicity remains a complex and evolving field of study, with ongoing research aimed at better understanding the mechanisms of damage and developing effective treatments.
Mitochondrial toxicity has traditionally been investigated through in vitro experiments and animal model experiments. The mitochondrial membrane potential (MMP) is an important indicator for mitochondrial function and structural integrity. Decreased MMPs can be identified through the use of lipophilic cationic fluorescent dyes, such as TMRM, Mito-MPS dye, etc [
3]. The MMP assay can be multiplexed with cell viability assay that measures ATP level to identify compounds induced concurrent cytotoxicity [
4]. The glucose-galactose (Glu-Gal) assay is another common method employed to investigate mitochondrial toxicity, based on the principle that cells with compromised mitochondrial function are unable to generate ATP using galactose as efficiently as glucose [
5,
6]. The extracellular flux assay [
5] measures changes in mitochondrial respiratory function and glycolysis through continuous measurements of the oxygen consumption rate (OCR) and extracellular acidification rate (ECAR), offering insights into cellular energy metabolism and mitochondrial function. These experimental methods have provided valuable information; however, they are also costly, time consuming, and condition dependent. Therefore, researchers are seeking new strategies and tools to enhance the accuracy of predicting drug-induced mitochondrial toxicity.
In the past, quantitative structure–activity relationship (QSAR) methods were predominantly employed for predicting mitochondrial toxicity by utilizing chemical structures [
7] or molecular descriptors [
8] as the basis [
9]. Recent advances in machine learning applications for investigating mitochondrial toxicity have involved developing models that predict the mitochondrial toxicity of chemicals [
10]. Various machine learning techniques, including SVM [
11], Bayesian methods [
12,
13], and random forest models [
8], have been applied to address classification problems associated with mitochondrial toxicity prediction. Machine learning was also applied to explore structural alerts to predict and optimize lead compounds [
14]. Machine learning has proven to be a valuable approach in drug development, enabling the understanding and prediction of mitochondrial toxicity. However, a key challenge in using machine learning is the difficulty of interpreting the underlying decision-making processes. To address this, recent efforts have focused on developing interpretable machine learning frameworks that provide insight into how models arrive at their predictions, enhancing trust and applicability in drug safety and development. For instance, Jaganathan et al. [
15] introduced XML-CIMT, an explainable machine learning model that integrates feature selection and SHapley Additive exPlanations (SHAP) for model interpretation. Li et al. [
16] provides a comprehensive survey of interpretable deep learning methods, categorizing various interpretation algorithms that aim to reveal how deep models make decisions and assess their trustworthiness.
Despite these advances, a critical gap remains in evaluating whether models trained on a single assay type generalize to structurally novel compounds or to mechanistically related but distinct endpoints. To address these limitations, this study evaluated machine learning and deep learning models for mitochondrial toxicity prediction with an explicit focus on generalizability. We curated a training dataset from MMP assay data across Tox21, PubChem, MitoTox, and literature sources [
2,
5,
6,
11,
17,
18,
19,
20,
21,
22,
23], and assembled an independent external evaluation set from non-MMP mitochondrial stress assays [
5,
6,
11,
17] to probe cross-assay robustness under endpoint shift. Six molecular representations were compared [
24,
25,
26,
27,
28] across both random and Bemis–Murcko scaffold-based splits [
29] using six classical classifiers and two deep learning architectures. Class imbalance was addressed using SMOTE [
30] and SMILES-based and stereoisomer augmentation [
31,
32], applied selectively to the minority class. For interpretability, SHAP attribution [
33] was performed on the best-performing CatBoost [
34] and Random Forest models using Mordred descriptors, and the top-ranked features were characterised through distributional analysis, substructure enrichment, and stratified odds ratio analysis to identify context-dependent physicochemical drivers of mitochondrial toxicity.
2. Materials and Methods
The study of mitochondrial toxicity (
Figure 1) begins with data curation and standardization, including removal of duplicates and harmonization of toxicity labels across assays. Molecular representations are then generated for each compound using both structural fingerprints and physicochemical descriptors. Prior to model training, exploratory structure–activity analysis is performed to characterize the dataset and identify potential associations between chemical features and toxicity.
Multiple machine learning models are trained using different feature representations, including molecular fingerprints and Mordred descriptors, and evaluated under both random and scaffold-based splitting strategies. Model performance is assessed on internal and external test sets to examine generalization across assay contexts. Data augmentation strategies are additionally applied for deep learning models to evaluate their effect on model optimization and robustness.
For interpretability, SHAP analysis is conducted on the best-performing classical models, specifically CatBoost and Random Forest, using descriptor-based representations. Global feature importance is derived from aggregated SHAP values, and key descriptors are further analyzed through statistical and distributional approaches. This combined framework enables identification of the physicochemical properties and context-dependent patterns underlying mitochondrial toxicity.
Schematic overview of the study pipeline. MMP data were split into training and internal test sets using random and scaffold strategies, and an independent non-MMP dataset was used for external evaluation. Molecular representations were generated using fingerprints, Mordred descriptors, and Mol2Vec embeddings, with optional data augmentation applied. Multiple machine learning and deep learning models were trained and evaluated across datasets. SHAP analysis was performed on the best-performing classical models to identify important descriptors, followed by statistical distributional analysis, motif enrichment, and stratified odds ratio analysis to support mechanistic interpretation of mitochondrial toxicity.
2.1. Data Collection
Mitochondrial toxicity annotations were curated from multiple public resources, including Tox21 [
18], PubChem BioAssay [
19], MitoTox [
2], and additional literature reports [
14,
20,
21,
22], with a primary focus on standardized mitochondrial membrane potential (MMP) assays. For PubChem, we queried BioAssay using the keyword “qHTS mitochondrial membrane potential”, which yielded the following assay identifiers: AID 720634, 720635, 720637, 651754, 651755, and 1347389. To further assess model generalizability beyond MMP-based labeling, we assembled an external evaluation set comprising mitochondrial toxicity-related endpoints that are not strictly MMP assays, including in vitro flux measurement [
5], Glu–Gal assays [
6], and other non-MMP mitochondrial stress readouts reported in prior studies [
11,
17]. The full data sources and class distributions are summarized in
Table 1. We defined an internal evaluation set (Test set 1) drawn from the curated MMP-centered dataset and an external evaluation set (Test set 2) assembled from additional mitochondrial stress assays that include non-MMP endpoints. Because the external evaluation set includes mitochondrial stress assays beyond MMP, its labels are not strictly equivalent to the MMP-based definition used for model training. We therefore interpret performance on Test set 2 as a robustness assessment under endpoint shift and cross-source variation. External performance reflects both model generalization and differences in endpoint semantics across assays.
Records were standardized at the structure level using InChI identifiers. Because toxicity labels originated from heterogeneous sources and some compounds appeared in more than one dataset or assay, we integrated labels at the compound (structure) level using a simple aggregation rule. Within each assay, compounds labeled as toxic and non-toxic were encoded as 1 and 0, respectively. For compounds with multiple occurrences across assays/sources, we computed an average toxicity score by summing the binary labels across occurrences and dividing by the number of occurrences. Compounds with an average score ≥0.5 were assigned as toxic (label = 1); otherwise, they were assigned as non-toxic (label = 0).
2.2. Molecular Featurization
To capture complementary structural and physicochemical properties of molecules, six distinct molecular representation strategies were employed.
Morgan fingerprints. Extended-connectivity circular fingerprints (ECFPs) were generated using RDKit with a radius of 2 and a bit length of 2048 [
26]. This representation encodes local atomic environments and is widely used for structure–activity relationship modeling.
MACCS keys. The 167-bit MACCS structural key fingerprint was computed using RDKit [
27]. This predefined key set captures the presence or absence of common functional groups and substructures, providing a complementary, interpretable representation.
Combined fingerprint. To integrate both data-driven substructure patterns and predefined chemical motifs, Morgan (2048-bit) and MACCS (167-bit) fingerprints were concatenated into a single 2215-dimensional feature vector.
Mordred descriptors. A comprehensive set of 2D molecular descriptors, including constitutional, topological, and autocorrelation features, was computed using the Mordred library [
24], excluding all 3D-dependent descriptors to ensure consistency across molecules without conformational information. Descriptor columns yielding non-numeric values were removed. The remaining features were standardized using a z-score transformation.
PubChem fingerprints. The 881-bit PubChem fingerprint was computed using the DeepChem PubChemFingerprint featurizer [
28], which retrieves 881-bit substructure key vectors via the PubChem REST API. To improve computational efficiency and ensure reproducibility, fingerprint vectors were cached using canonical SMILES as unique identifiers, thereby avoiding redundant API calls across experiments.
Mol2vec embeddings. Continuous molecular embeddings were generated using the Mol2vec approach [
25], in which molecules are represented as the sum of vector embeddings of their constituent substructures derived from a pretrained Word2Vec model. Substructures not present in the pretrained vocabulary were assigned zero vectors. In addition to aggregated embeddings, sequences of substructure tokens were retained for input into a BiLSTM-based architecture, enabling the model to capture sequential dependencies between substructures.
2.3. Data Splitting
To robustly evaluate model performance and minimize information leakage, two complementary data splitting strategies were employed.
Random split. A stratified random split was applied to the combined Training and Test 1 datasets to generate training/validation and test subsets, with a test fraction of approximately 12% (as shown in
Table 1). Stratification was performed with respect to the target variable to preserve class distributions across splits and mitigate bias arising from class imbalance. This strategy assumes that training and test samples are independently and identically distributed (i.i.d.), and serves as a baseline for evaluating model performance under standard benchmarking conditions.
Scaffold split. To more rigorously assess generalization to structurally novel compounds, a Bemis–Murcko scaffold-based splitting strategy was employed [
29]. Murcko scaffolds were computed using RDKit. Molecules sharing the same scaffold were grouped together to ensure that structurally related compounds were not distributed across training and test sets, thereby preventing scaffold-level information leakage. Scaffold groups were sorted in descending order of size and assigned to the training set in a greedy manner until the target training proportion (88%) was reached. The remaining scaffold groups were assigned to the test set. This deterministic procedure guarantees complete scaffold exclusivity between training and test sets. Compared to random splitting, scaffold-based splitting provides a more stringent and realistic evaluation of model performance in prospective applications.
2.4. SMILES Augmentation
To address class imbalance and improve model robustness to representation variability, two complementary data augmentation strategies were applied to the training set. For random SMILES augmentation, all training molecules (both toxic and non-toxic) were expanded into multiple randomized SMILES representations, with the augmented set used to increase structural diversity. For stereoisomer augmentation, expansion was applied selectively to toxic molecules only, as described below. without artificially inflating the majority (non-toxic) class, thereby reducing bias during model training.
Random SMILES enumeration. For each training molecule, up to 50 unique randomised SMILES representations were generated using RDKit (Chem.MolToSmiles(mol, doRandom=True)). Randomization was performed with molecule-specific seeds [
31] derived from a global random seed (seed = 42) to ensure full reproducibility across runs. This augmentation strategy preserves molecular identity while exposing the model to multiple equivalent string representations, thereby improving robustness to SMILES-dependent encoding artefacts. All augmented representations were included in model training regardless of toxicity class.
Stereoisomer enumeration. To account for stereochemical variability, all possible stereoisomers of each toxic molecule were enumerated using RDKit’s EnumerateStereoisomers utility [
32]. This approach introduces chemically relevant structural diversity, particularly for molecules where stereochemistry may influence biological activity. Non-toxic molecules were retained as their canonical SMILES representations to avoid disproportionate expansion of the majority class.
2.5. Model Architectures
To ensure robust model optimization and fair performance comparison, a unified training and evaluation framework was applied across all models, with specific adaptations for deep learning and classical approaches.
Fully connected neural network (FC model). A seven-layer feed-forward neural network was employed to map fixed-length molecular fingerprints to a binary toxicity probability. The network architecture followed a progressive dimensionality reduction scheme (1024 → 512 → 256 → 128 → 64 → 32 → 1). Each hidden layer consisted of a linear transformation followed by batch normalization, ReLU activation, and dropout (rate = 0.5). The final output layer applied a sigmoid activation function. Models were trained with a mini-batch size of 128 using the Adam optimizer.
Sequence-based models (BiLSTM model). For the Mol2vec-based LSTM model, an identical BiLSTM architecture was used, but the embedding layer was initialized with pretrained Mol2vec weights [
25] to incorporate chemically informed substructure representations. Molecular sequences were constructed from Morgan substructure identifiers (radius = 1), consistent with the pretrained model. The architecture consisted of an embedding layer (dimension = 64, dropout = 0.1), followed by a two-layer bidirectional LSTM (hidden size = 256, inter-layer dropout = 0.3). The final forward and backward hidden states were concatenated to form a 512-dimensional representation, which was passed to a classifier head.
Classical machine learning models. In addition to deep learning approaches, six classical classifiers were benchmarked: Random Forest (RF), k-Nearest Neighbours (KNNs), Radius Neighbour Classifier, Support Vector Machine (SVM), Multi-layer Perceptron (MLP), and CatBoost [
34]. These models were trained on the same featurized datasets as the deep learning models to enable fair comparison. Hyperparameter optimization was performed using RandomisedSearchCV with 5-fold cross-validation over 10 iterations. Model selection was based on a composite scoring metric defined as S = (MCC + F1)/2, balancing correlation-based and threshold-dependent performance measures, where Matthews Correlation Coefficient (MCC) measures the correlation between predicted and true binary labels across all four confusion matrix cells and is robust to class imbalance, and the F1 score is the harmonic mean of precision and recall, capturing the balance between false positives and false negatives.
2.6. Model Training, Validation, and Evaluation
Deep learning models were trained using stratified 5-fold cross-validation to preserve class distributions across folds and reduce variance in performance estimates. Optimisation was performed using the Adam optimizer (learning rate = 0.01, L2 weight decay = 1 × 10−5), with an additional L1 regularization term (λ = 5 × 10−4) applied to all model parameters. A dynamic learning rate schedule, ReduceLROnPlateau, was employed, reducing the learning rate by a factor of 0.5 after three consecutive epochs without improvement in validation loss. Early stopping was applied if validation loss did not decrease for 16 consecutive epochs, with a maximum training budget of 10,000 epochs. All experiments were conducted with a fixed random seed (seed = 42) to ensure reproducibility.
Class imbalance was addressed through SMOTE augmentation and data-level rebalancing. The Synthetic Minority Oversampling Technique (SMOTE) [
30] was optionally applied to augment the minority class by generating synthetic samples through interpolation between minority-class instances in feature space. Model performance was evaluated using macro-averaged F1 score, Matthews correlation coefficient (MCC), and balanced accuracy. Binary classification decisions were obtained using a default probability threshold of 0.5.
2.7. SHAP Attribution and Distribution Analysis
To quantify and interpret the contribution of molecular descriptors to model predictions, SHAP (SHapley Additive exPlanations) values were computed [
33] for the best-performing classical machine learning models, specifically CatBoost and Random Forest. SHAP analysis was performed on both the internal and external test sets to ensure that feature importance reflected generalizable patterns rather than dataset-specific biases. For each descriptor, mean absolute SHAP values were aggregated across all test instances to obtain a global importance ranking. The top 20 Mordred descriptors were selected for downstream analyses.
To further characterize these SHAP-derived features, statistical distributional analyses were performed separately on Test 1 and Test 2. Mordred descriptors were recomputed directly from raw InChI strings using RDKit and the Mordred calculator to ensure independence from prior preprocessing. For each descriptor and dataset, per-class summary statistics were computed, and differences between toxic and non-toxic compounds were assessed using the two-sided Mann–Whitney U test, with effect sizes quantified using the rank-biserial correlation. p-values were adjusted using the Benjamini–Hochberg procedure.
2.8. Substructure Enrichment Analysis
To identify chemically interpretable structural motifs associated with toxicity, a curated panel of 16 substructure classes was defined using a combination of SMARTS pattern matching and programmatic structural rules implemented in RDKit. Selected motifs include quinones [
35] and nitroaromatics (associated with redox cycling and ROS generation), phenols and nitrophenols (protonophoric uncouplers that disrupt membrane potential) [
36,
37], Michael acceptors (electrophilic agents capable of covalent protein modification) [
38,
39], polycyclic aromatics and long aliphatic chains (lipophilic membrane-accumulating structures) [
40], aromatic amines and heteroaromatics [
41], as well as broader structural categories and functional groups included as reference features. Enrichment was quantified as the ratio of motif prevalence in toxic versus non-toxic molecules, and statistical significance was assessed using two-sided Fisher’s exact tests with Benjamini–Hochberg FDR correction.
2.9. Stratified Odds Ratio Analysis
To investigate whether key molecular descriptors modulate toxicity risk within specific structural contexts, a stratified odds ratio (OR) analysis was performed by jointly considering SHAP-identified descriptors and motif-defined subpopulations. For each combination of the top 20 SHAP-selected descriptors and each motif-defined stratum, a 2 × 2 contingency table was constructed. Binary descriptors were defined using fixed thresholds, while continuous Mordred descriptors were binarized at their dataset-wide median. Odds ratios and 95% confidence intervals were estimated using the Woolf log-odds method, with Benjamini–Hochberg correction applied across all descriptor–motif combinations.
2.10. Computational Environment
All computations were performed on the Taiwan Computing Cloud (TWCC) high-performance computing (HPC) platform, Nano5. Model training was conducted on GPU-enabled nodes equipped with NVIDIA H100 GPUs and Intel Xeon processors (Platinum 8480CL). The software environment was based on Python (version 3.11.14), with key libraries including PyTorch (version 2.10.0), scikit-learn (version 1.8.0), RDKit (version 2025.03.6), Mordred (version 1.2.0), and SHAP (version 0.49.1). All experiments were conducted with a fixed random seed (seed = 42) to ensure reproducibility.
3. Results
3.1. Dataset Analysis and Tanimoto Analysis
To obtain an overview of the physicochemical properties of the datasets used in this study, a graphical analysis of LogP and molecular weight was performed (
Figure 2A). LogP is a metric describing the ability of a compound to partition between water and oil, also known as hydrophobicity. The molecular weight reflects the size and complexity of compounds and is of great significance for understanding the physical properties and biological activities of compounds.
While the molecular weights are distributed mainly in the range of 0 to 2000, the hydrophobicity of the compounds is distributed roughly between −20 and 20. The positive toxic compounds in the training set are mostly located in the LogP range of −5 to 10, while the positive compounds in the test set span a smaller range of 0 to 10. Among toxicants, compounds with higher molecular weights often exhibit increased structural complexity and frequently harbor multiple functional groups capable of specific interactions with biomolecules. In addition, hydrophobicity (LogP) is closely related to the lipid solubility and cell permeability of the compounds, influencing their ability to accumulate in cellular membranes and mitochondria and thereby trigger a toxic response.
Tanimoto analysis was used to analyze the structural similarity between positive and negative compounds in the training set. Among the molecular fingerprints tested, PubChem had the highest Tanimoto similarity score, followed by the MACCSkeys fingerprint, indicating that structural similarity scores are strongly correlated with the algorithm used to synthesize the molecular fingerprints (
Figure 2B). Both PubChem and MACCSkeys algorithms encode structural information based on the functional groups of the compound as binary vectors, making high similarity scores expected for these representations.
To characterize the chemical diversity of the datasets and evaluate structural overlap between evaluation sets, dimensionality reduction analysis was performed on molecular fingerprint representations. Principal component analysis (PCA) and, t-SNE (t-distributed stochastic neighbour embedding) were applied to Morgan (2048-bit), MACCS keys (167-bit), and PubChem (881-bit) fingerprint matrices for all compounds in the internal MMP test set (Test 1) and the external non-MMP test set (Test 2) (
Figure 2C,D). Both PCA and t-SNE showed that Test 1 and Test 2 compounds were broadly intermingled across the projected chemical space, with no clear separation into non-overlapping clusters, consistent with the two evaluation sets sharing similar overall physicochemical and structural characteristics.
3.2. Classical Machine Learning Benchmarks
Six classical machine learning models were benchmarked across multiple feature representations, data augmentation strategies, and splitting schemes. Performance was evaluated on Test 1 and Test 2 to assess both in-distribution performance and cross-assay generalization.
Under the random split, several classical models achieved strong performance on Test 1, with F1 scores exceeding 0.80 and MCC values around 0.70, but performance dropped substantially on Test 2 across all models (
Figure 3A). The best-performing configuration under random split was KNN with SMOTE and combined features, reaching F1 = 0.8436 and MCC = 0.7081 on Test 1, but only F1 = 0.2984 and MCC = 0.2375 on Test 2 (
Table 2). Similar behavior was observed for Random Forest and CatBoost, where high Test 1 performance (F1 ≈ 0.83) did not translate to strong Test 2 performance.
Feature representation had a clear impact, with Mordred descriptors consistently yielding the highest Test 2 performance, including the best overall result achieved by Random Forest (F1 = 0.3483, MCC = 0.2990) under random split. This trend remained under scaffold splitting, where Test 1 performance decreased substantially (F1 ≈ 0.62–0.64), while Test 2 performance remained relatively stable (F1 ≈ 0.32–0.34) (
Figure 3D;
Table 2). The effect of SMOTE was modest, providing slight improvements in Test 1 but no consistent gain on Test 2 (
Figure 3C). Across all configurations, Test 2 performance remained constrained within a narrow range regardless of model or feature choice (
Figure 3E), demonstrating a systematic generalization gap between assays, with Mordred descriptors providing the most transferable signals among the evaluated features.
3.3. Deep Learning Performance Across Augmentation Strategies
Deep learning models were evaluated across feature representations, augmentation strategies, and splitting schemes to assess their ability to improve predictive performance and generalization.
Under random split, the fully connected model (FC model) achieved strong performance on Test 1, with F1 scores reaching approximately 0.81 depending on feature type, but substantially lower performance on Test 2 across all configurations (
Figure 4A;
Table 3). SMILES-based augmentation consistently improved Test 1 performance, particularly for fingerprint-based features, whereas stereoisomer augmentation showed smaller or inconsistent gains. However, neither strategy improved Test 2 performance, which remained constrained within a narrow range.
The Mol2vec-based BiLSTM model showed moderate performance on Test 1 (F1 score = 0.56–0.65) and similarly low performance on Test 2 (0.25–0.28), with only minor differences between augmentation strategies (
Figure 4B). Compared to the FCModel, the BiLSTM did not provide performance gains and showed similar limitations in cross-assay generalisation. Under scaffold splitting, Test 1 performance decreased substantially (0.45–0.60), and the effect of augmentation was reduced, whereas Test 2 performance remained similar to that observed under random split (
Figure 4D). The best Test 2 performance achieved by deep learning models remained below that of the best classical models, indicating that increased model complexity and data augmentation do not overcome the generalisation gap.
3.4. Internal Versus External Generalisation Reveals a Systematic Performance Gap
Across all models, feature representations, and training strategies, a consistent performance gap was observed between Test 1 and Test 2 (
Figure 5). The distribution of F1 differences shows a mean gap of approximately 0.39, with 55% of configurations exceeding 0.30 and 42% exceeding 0.40 (
Figure 5A). In addition, more than half of the configurations produced Test 2 F1 scores below 0.30, indicating limited predictive performance on the external assay.
This gap was consistent across model architectures. All models, including SVM, KNN, Random Forest, CatBoost, and neural networks, showed similar mean differences ranging from approximately 0.33 to 0.39 (
Figure 5B), indicating that the generalization gap is not driven by model choice. Data splitting strategy influenced the magnitude of the gap: random split produced higher Test 1 performance but similar Test 2 performance compared to scaffold split, resulting in a larger gap (
Figure 5C). Data augmentation had limited impact on cross-assay generalization—SMOTE and SMILES-based augmentation improved Test 1 performance but did not improve Test 2 performance (
Figure 5D). Feature representation had little effect on Test 2 performance, which remained constrained across all feature types, although Mordred descriptors showed slightly higher values on average (
Figure 5E).
3.5. Descriptor Importance and Distributional Analysis
SHAP analysis of the best-performing classical models, CatBoost and Random Forest, revealed a consistent set of important descriptors across both datasets (
Figure 6). Mean absolute SHAP values were aggregated to obtain a global ranking, from which the top 20 Mordred descriptors were selected (
Table 4). Several descriptors, including C3SP3, AATS2s, and GATS3s, were consistently ranked among the most important features across both models, indicating a robust consensus signal. Most top-ranked descriptors belong to the Broto–Moreau and Geary autocorrelation families, indicating that model predictions are driven primarily by global physicochemical organisation rather than isolated structural motifs.
SHAP feature importance aggregated across model type (Random Forest and CatBoost), data splitting strategy (random and scaffold), and test set (Test 1: MMP assay; Test 2: non-MMP assay). Bars represent proportional SHAP importance, defined as the fraction of total absolute SHAP values within each condition. Color intensity indicates the consistency of each descriptor across conditions, ranging from presence in one condition to all four conditions within each assay.
Despite its dominant SHAP importance, C3SP3 showed no significant difference between toxic and non-toxic compounds in either dataset, with highly overlapping distributions (
Figure 7). This indicates that C3SP3 does not act as a marginal discriminator but instead contributes through interactions with other descriptors. In contrast, ATS5i showed higher values in toxic compounds in both datasets and was statistically significant in both cases, while AATS2s and GATS3s were lower in toxic compounds. The descriptor nAcid showed consistently higher values in non-toxic compounds in both datasets, indicating that acidic functionality is negatively associated with toxicity at the marginal level but has limited standalone predictive power.
Violin plots showing the distribution of five representative descriptors (C3SP3, nAcid, ATS5i, AATS2s, and GATS3s) stratified by toxicity class for both Test 1 (MMP assay) and Test 2 (non-MMP assay). Individual data points are overlaid to illustrate distribution density.
C3SP3 shows substantial overlap between toxic and non-toxic compounds in both assays, indicating a lack of univariate separation despite high SHAP importance. In contrast, ATS5i exhibits higher values in toxic compounds, while AATS2s and GATS3s show lower values in toxic compounds across both assays. The descriptor nAcid is enriched in non-toxic compounds. These patterns distinguish descriptors with marginal discriminative power from those contributing through interaction effects.
3.6. Structural Motif Analysis and Context-Dependent Effects
Substructure enrichment analysis identified several motifs associated with toxicity in the MMP assay, but these associations were not consistently preserved in the non-MMP assay (
Figure 8). In the MMP assay, multiple motifs including quinones, nitroaromatics, polycyclic aromatics, aromatic amines, and phenols were enriched in toxic compounds, whereas the non-MMP assay showed fewer significant motifs and weaker enrichment patterns. These differences indicate that motif-level associations are strongly assay-dependent.
Stratified odds ratio analysis further revealed context-dependent effects for key descriptors. The descriptor nAcid showed a global negative association with toxicity, with an odds ratio of 0.44 (q = 0.005) in Test 1 and 0.77 (q = 0.798) in Test 2 (
Figure 9), although stratified analysis showed that this relationship can vary across structural contexts. C3SP3 showed odds ratios close to 1 across most strata in both assays, with only a single significant association, confirming its role as an interaction-dependent descriptor (
Figure 10). ATS5i displayed positive associations with toxicity in multiple contexts, particularly in the carboxylic acid stratum (OR = 3.20, q = 0.024 in Test 1 and OR = 3.96, q = 0.006 in Test 2) and the polycyclic aromatic stratum (OR = 3.12, q = 0.001 in Test 2) (
Figure 11). AATS2s showed a consistent negative association with toxicity across multiple structural strata in the non-MMP assay, particularly within aromatic and polycyclic aromatic compounds, suggesting it captures a transferable physicochemical signal beyond the MMP training context (
Figure 12). Together, these results demonstrate that mitochondrial toxicity cannot be explained by single structural motifs or individual descriptors, but instead arises from interactions between molecular features.
4. Discussion
Mitochondrial toxicity is a multifactorial phenomenon that has been linked to diverse mechanisms, including disruption of the electron transport chain, redox cycling, and membrane depolarization. In this study, we evaluated machine learning models for mitochondrial toxicity prediction and combined model interpretation with statistical analysis to identify physicochemical determinants of toxicity.
A central finding is the pronounced generalization gap between the internal MMP test set and the external non-MMP assay (comprising flux and Glu/Gal data). All models achieved strong performance on the MMP-based test set but showed substantial degradation on the external assay, regardless of model type or feature representation. This finding has a mechanistic basis: MMP assays are most sensitive to membrane depolarization and protonophoric uncoupling, whereas Glu/Gal and flux assays detect the broader consequence of impaired oxidative phosphorylation under metabolic stress, with different sensitivity profiles for ETC complex inhibitors, fatty acid oxidation inhibitors, and ionophores [
5,
6]. Classical ETC complex inhibitors such as rotenone and antimycin A, for example, produce modest or no MMP signal under standard glucose conditions, yet are reliably detected as mitochondrial toxicants under galactose-sensitized conditions [
42]. Conversely, uncouplers such as FCCP depolarize the membrane and are readily captured by MMP assays but may produce a qualitatively different cytotoxicity profile in the Glu/Gal assay. This mechanistic divergence suggests that a classifier trained on MMP labels may capture assay-associated physicochemical patterns enriched among membrane-depolarizing or uncoupling-like compounds. These patterns may not transfer equally well to compounds whose mitochondrial effects are primarily reflected by ETC inhibition, fatty acid oxidation impairment, or metabolic stress responses. Thus, an MMP-trained model should be interpreted as learning MMP-associated chemical profiles rather than a fully generalizable representation of mitochondrial toxicity. This observation is consistent with reported limitations of single-assay toxicity models under distribution shift [
14,
23], and underscores the broader point that no single in vitro assay comprehensively represents mitochondrial toxicity as a biological phenomenon. The reduced performance under scaffold splitting further indicates that part of the predictive signal arises from structural similarity rather than transferable features. In contrast to fingerprint-based representations, descriptor-based features provided more stable performance across datasets. Mordred descriptors outperformed structural fingerprints in cross-assay settings, indicating that global physicochemical properties are more informative for generalizable prediction. The cross-assay transferability of Mordred descriptors may reflect the physicochemical properties of mitochondrial toxicants. Unlike receptor-mediated endpoints where specific structural motifs govern activity through a defined lock-and-key mechanism, mitochondrial toxicity is thought to be influenced by global molecular properties, such as membrane partitioning, charge distribution, and electron transfer capacity. Fingerprint-based models, by contrast, may learn scaffold-specific associations that are coincidental to the training distribution and may therefore generalize less reliably across mechanistically distinct assay endpoints, particularly when the external set probes a different aspect of mitochondrial dysfunction, such as oxidative phosphorylation capacity under metabolic stress rather than membrane depolarization. This is consistent with previous studies showing that mitochondrial toxicity is associated with properties such as lipophilicity, molecular size, and electronic characteristics, which influence membrane permeability and subcellular accumulation [
7,
8,
11].
Interpretability analysis provides further insight into these patterns. SHAP attribution revealed that model predictions are dominated by autocorrelation descriptors, which encode the spatial distribution of atomic properties across the molecular graph, capturing global organization of physicochemical features rather than local functional groups. Notably, the most important descriptor, C3SP3, showed no significant difference between toxic and non-toxic compounds, indicating that its contribution arises from interactions with other features rather than independent predictive power. This distinction highlights a limitation of interpreting feature importance without considering feature interactions.
Descriptors with consistent marginal effects suggest specific physicochemical trends. ATS5i, which reflects ionization potential-related autocorrelation, was higher in toxic compounds across both assays, suggesting a possible role for electronic properties in toxicity. This association may be broadly compatible with mechanisms involving redox cycling or interference with the electron transport chain, though the observed pattern reflects a statistical tendency across structurally diverse compounds rather than a direct mechanistic link. In contrast, descriptors such as AATS2s and GATS3s were lower in toxic compounds, indicating that changes in the spatial distribution of electronic or topological properties may also contribute to toxicity.
The behavior of nAcid highlights the importance of context. Acidic functionality was negatively associated with toxicity in both datasets, indicating that polar, ionizable compounds are less likely to be toxic on average. This is consistent with reduced membrane permeability limiting mitochondrial accumulation. However, stratified analysis showed that this relationship can vary across structural contexts, suggesting that acidity may contribute to toxicity when combined with sufficient lipophilicity, like uncouplers.
Motif-level analysis further demonstrates the context-dependence of structural alerts [
43]. Several motifs, including quinones, nitroaromatics, phenols, and aromatic amines, were significantly enriched in toxic compounds in the MMP assay but showed weaker or absent enrichment in the non-MMP assay. This cross-assay inconsistency can be further illustrated by the example of 1,4-dihydropyridine (1,4-DHP) scaffold. In an initial structural alert screen, 22 compounds bearing this substructure were identified as MMP-active, including clinically used calcium channel blockers such as amlodipine, felodipine, and nicardipine [
44,
45]. These compounds are known to partition into lipophilic membranes and have been associated with inhibition of mitochondrial electron transport chain complexes, providing a plausible mechanism for their MMP activity [
44,
45,
46]. However, 1,4-DHP membership does not predict toxicity in the non-MMP external set. This pattern is consistent with previous analyses showing that structural alert generalization across different toxicological endpoints is often limited [
14,
23]. These results support a model in which toxicity arises from multivariate physicochemical interactions rather than single defining structural features.
This study has several implications for computational mitochondrial toxicity assessment. First, cross-assay evaluation on an independent endpoint is essential; random-split benchmarking on a single assay substantially overestimates prospective performance. Second, descriptor-based representations, particularly Mordred autocorrelation descriptors, provide more transferable physicochemical signals than fingerprint-based approaches in this setting. Third, interpretability analyses should account for feature interactions and structural context rather than relying solely on global importance rankings or single-motif enrichment. Future work should focus on multi-assay training strategies, integration of mechanistic pathway information, and development of applicability domain methods tailored to cross-assay prediction scenarios.
Several limitations of this study should be acknowledged. The external test set is relatively small and heavily class-imbalanced (approximately 10:1 non-toxic to toxic), which limits the precision of cross-assay performance estimates. Model training relied exclusively on MMP assay labels, a proximal measure of mitochondrial dysfunction; the external set introduces endpoint heterogeneity that the models were not trained to address. Only 2D Mordred descriptors were used; 3D conformational or graph-based representations may capture additional predictive information. No applicability domain analysis was performed to identify regions of chemical space where predictions may be less reliable. These limitations suggest that the reported cross-assay performance values represent a conservative estimate of achievable generalization given the current data, and that future improvements in dataset breadth and endpoint diversity will be important for advancing this field.