Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects

Lin, Yu-Heng; Lin, Yu-Te; Wei, An-Chi

doi:10.3390/clinbioenerg2030011

Open AccessArticle

Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects

by

Yu-Heng Lin

,

Yu-Te Lin

and

An-Chi Wei

^*

Graduate Institute for Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 10607, Taiwan

^*

Author to whom correspondence should be addressed.

Clin. Bioenerg. 2026, 2(3), 11; https://doi.org/10.3390/clinbioenerg2030011 (registering DOI)

Submission received: 19 December 2025 / Revised: 7 June 2026 / Accepted: 11 June 2026 / Published: 27 June 2026

Download

Browse Figures

Versions Notes

Abstract

Mitochondrial toxicity is a major concern in drug development and safety assessment. Machine learning models trained on mitochondrial membrane potential (MMP) assay data offer promising toxicity predictions, but cross-scaffold and cross-assay transferability remain uncharacterized. We evaluated classical machine learning and deep learning architectures across six molecular representations and both random and scaffold-based splitting strategies, assessing performance on an internal MMP test set and an independent external set comprising flux and glucose–galactose assay data. Across all model configurations, we observed a consistent cross-assay generalization gap, independent of model type, feature choice, or augmentation strategy. Mordred descriptors provided the most transferable predictive signal, outperforming fingerprint-based representations on the external set. SHAP analysis of the best CatBoost and Random Forest models identified autocorrelation-family descriptors as dominant predictors. C3SP3 contributed through feature interactions while ATS5i showed positive association with toxicity across assays. Motif-level and descriptor-level associations proved to be strongly assay-dependent, supporting a mechanism in which mitochondrial toxicity arises from multivariate physicochemical interactions rather than single structural alerts.

Keywords:

mitochondrial toxicity; machine learning; molecular descriptors; SHAP; scaffold split; generalizability; interpretability

1. Introduction

Mitochondria are functionally rich and diverse organelles. In addition to their primary role in energy production, mitochondria regulate various biological processes, including signal transduction, immune response, cell differentiation, and cell death [1].

Mitochondrial toxicity refers to the damage caused to mitochondria by various substances, including drugs, chemical compounds, and environmental toxins. Mitochondrial toxicity can have wide-ranging effects, given the central role of mitochondria in energy production and other cellular functions. Understanding mitochondrial toxicity is crucial in drug development, as many drugs have been withdrawn from the market or are limited in use due to their toxic effects on mitochondria [2]. As such, screening for mitochondrial toxicity is an important part of the safety evaluation of new pharmaceuticals. Mitochondrial toxicity remains a complex and evolving field of study, with ongoing research aimed at better understanding the mechanisms of damage and developing effective treatments.

Mitochondrial toxicity has traditionally been investigated through in vitro experiments and animal model experiments. The mitochondrial membrane potential (MMP) is an important indicator for mitochondrial function and structural integrity. Decreased MMPs can be identified through the use of lipophilic cationic fluorescent dyes, such as TMRM, Mito-MPS dye, etc [3]. The MMP assay can be multiplexed with cell viability assay that measures ATP level to identify compounds induced concurrent cytotoxicity [4]. The glucose-galactose (Glu-Gal) assay is another common method employed to investigate mitochondrial toxicity, based on the principle that cells with compromised mitochondrial function are unable to generate ATP using galactose as efficiently as glucose [5,6]. The extracellular flux assay [5] measures changes in mitochondrial respiratory function and glycolysis through continuous measurements of the oxygen consumption rate (OCR) and extracellular acidification rate (ECAR), offering insights into cellular energy metabolism and mitochondrial function. These experimental methods have provided valuable information; however, they are also costly, time consuming, and condition dependent. Therefore, researchers are seeking new strategies and tools to enhance the accuracy of predicting drug-induced mitochondrial toxicity.

In the past, quantitative structure–activity relationship (QSAR) methods were predominantly employed for predicting mitochondrial toxicity by utilizing chemical structures [7] or molecular descriptors [8] as the basis [9]. Recent advances in machine learning applications for investigating mitochondrial toxicity have involved developing models that predict the mitochondrial toxicity of chemicals [10]. Various machine learning techniques, including SVM [11], Bayesian methods [12,13], and random forest models [8], have been applied to address classification problems associated with mitochondrial toxicity prediction. Machine learning was also applied to explore structural alerts to predict and optimize lead compounds [14]. Machine learning has proven to be a valuable approach in drug development, enabling the understanding and prediction of mitochondrial toxicity. However, a key challenge in using machine learning is the difficulty of interpreting the underlying decision-making processes. To address this, recent efforts have focused on developing interpretable machine learning frameworks that provide insight into how models arrive at their predictions, enhancing trust and applicability in drug safety and development. For instance, Jaganathan et al. [15] introduced XML-CIMT, an explainable machine learning model that integrates feature selection and SHapley Additive exPlanations (SHAP) for model interpretation. Li et al. [16] provides a comprehensive survey of interpretable deep learning methods, categorizing various interpretation algorithms that aim to reveal how deep models make decisions and assess their trustworthiness.

Despite these advances, a critical gap remains in evaluating whether models trained on a single assay type generalize to structurally novel compounds or to mechanistically related but distinct endpoints. To address these limitations, this study evaluated machine learning and deep learning models for mitochondrial toxicity prediction with an explicit focus on generalizability. We curated a training dataset from MMP assay data across Tox21, PubChem, MitoTox, and literature sources [2,5,6,11,17,18,19,20,21,22,23], and assembled an independent external evaluation set from non-MMP mitochondrial stress assays [5,6,11,17] to probe cross-assay robustness under endpoint shift. Six molecular representations were compared [24,25,26,27,28] across both random and Bemis–Murcko scaffold-based splits [29] using six classical classifiers and two deep learning architectures. Class imbalance was addressed using SMOTE [30] and SMILES-based and stereoisomer augmentation [31,32], applied selectively to the minority class. For interpretability, SHAP attribution [33] was performed on the best-performing CatBoost [34] and Random Forest models using Mordred descriptors, and the top-ranked features were characterised through distributional analysis, substructure enrichment, and stratified odds ratio analysis to identify context-dependent physicochemical drivers of mitochondrial toxicity.

2. Materials and Methods

The study of mitochondrial toxicity (Figure 1) begins with data curation and standardization, including removal of duplicates and harmonization of toxicity labels across assays. Molecular representations are then generated for each compound using both structural fingerprints and physicochemical descriptors. Prior to model training, exploratory structure–activity analysis is performed to characterize the dataset and identify potential associations between chemical features and toxicity.

Multiple machine learning models are trained using different feature representations, including molecular fingerprints and Mordred descriptors, and evaluated under both random and scaffold-based splitting strategies. Model performance is assessed on internal and external test sets to examine generalization across assay contexts. Data augmentation strategies are additionally applied for deep learning models to evaluate their effect on model optimization and robustness.

For interpretability, SHAP analysis is conducted on the best-performing classical models, specifically CatBoost and Random Forest, using descriptor-based representations. Global feature importance is derived from aggregated SHAP values, and key descriptors are further analyzed through statistical and distributional approaches. This combined framework enables identification of the physicochemical properties and context-dependent patterns underlying mitochondrial toxicity.

Schematic overview of the study pipeline. MMP data were split into training and internal test sets using random and scaffold strategies, and an independent non-MMP dataset was used for external evaluation. Molecular representations were generated using fingerprints, Mordred descriptors, and Mol2Vec embeddings, with optional data augmentation applied. Multiple machine learning and deep learning models were trained and evaluated across datasets. SHAP analysis was performed on the best-performing classical models to identify important descriptors, followed by statistical distributional analysis, motif enrichment, and stratified odds ratio analysis to support mechanistic interpretation of mitochondrial toxicity.

2.1. Data Collection

Mitochondrial toxicity annotations were curated from multiple public resources, including Tox21 [18], PubChem BioAssay [19], MitoTox [2], and additional literature reports [14,20,21,22], with a primary focus on standardized mitochondrial membrane potential (MMP) assays. For PubChem, we queried BioAssay using the keyword “qHTS mitochondrial membrane potential”, which yielded the following assay identifiers: AID 720634, 720635, 720637, 651754, 651755, and 1347389. To further assess model generalizability beyond MMP-based labeling, we assembled an external evaluation set comprising mitochondrial toxicity-related endpoints that are not strictly MMP assays, including in vitro flux measurement [5], Glu–Gal assays [6], and other non-MMP mitochondrial stress readouts reported in prior studies [11,17]. The full data sources and class distributions are summarized in Table 1. We defined an internal evaluation set (Test set 1) drawn from the curated MMP-centered dataset and an external evaluation set (Test set 2) assembled from additional mitochondrial stress assays that include non-MMP endpoints. Because the external evaluation set includes mitochondrial stress assays beyond MMP, its labels are not strictly equivalent to the MMP-based definition used for model training. We therefore interpret performance on Test set 2 as a robustness assessment under endpoint shift and cross-source variation. External performance reflects both model generalization and differences in endpoint semantics across assays.

Records were standardized at the structure level using InChI identifiers. Because toxicity labels originated from heterogeneous sources and some compounds appeared in more than one dataset or assay, we integrated labels at the compound (structure) level using a simple aggregation rule. Within each assay, compounds labeled as toxic and non-toxic were encoded as 1 and 0, respectively. For compounds with multiple occurrences across assays/sources, we computed an average toxicity score by summing the binary labels across occurrences and dividing by the number of occurrences. Compounds with an average score ≥0.5 were assigned as toxic (label = 1); otherwise, they were assigned as non-toxic (label = 0).

2.2. Molecular Featurization

To capture complementary structural and physicochemical properties of molecules, six distinct molecular representation strategies were employed.

Morgan fingerprints. Extended-connectivity circular fingerprints (ECFPs) were generated using RDKit with a radius of 2 and a bit length of 2048 [26]. This representation encodes local atomic environments and is widely used for structure–activity relationship modeling.

MACCS keys. The 167-bit MACCS structural key fingerprint was computed using RDKit [27]. This predefined key set captures the presence or absence of common functional groups and substructures, providing a complementary, interpretable representation.

Combined fingerprint. To integrate both data-driven substructure patterns and predefined chemical motifs, Morgan (2048-bit) and MACCS (167-bit) fingerprints were concatenated into a single 2215-dimensional feature vector.

Mordred descriptors. A comprehensive set of 2D molecular descriptors, including constitutional, topological, and autocorrelation features, was computed using the Mordred library [24], excluding all 3D-dependent descriptors to ensure consistency across molecules without conformational information. Descriptor columns yielding non-numeric values were removed. The remaining features were standardized using a z-score transformation.

PubChem fingerprints. The 881-bit PubChem fingerprint was computed using the DeepChem PubChemFingerprint featurizer [28], which retrieves 881-bit substructure key vectors via the PubChem REST API. To improve computational efficiency and ensure reproducibility, fingerprint vectors were cached using canonical SMILES as unique identifiers, thereby avoiding redundant API calls across experiments.

Mol2vec embeddings. Continuous molecular embeddings were generated using the Mol2vec approach [25], in which molecules are represented as the sum of vector embeddings of their constituent substructures derived from a pretrained Word2Vec model. Substructures not present in the pretrained vocabulary were assigned zero vectors. In addition to aggregated embeddings, sequences of substructure tokens were retained for input into a BiLSTM-based architecture, enabling the model to capture sequential dependencies between substructures.

2.3. Data Splitting

To robustly evaluate model performance and minimize information leakage, two complementary data splitting strategies were employed.

Random split. A stratified random split was applied to the combined Training and Test 1 datasets to generate training/validation and test subsets, with a test fraction of approximately 12% (as shown in Table 1). Stratification was performed with respect to the target variable to preserve class distributions across splits and mitigate bias arising from class imbalance. This strategy assumes that training and test samples are independently and identically distributed (i.i.d.), and serves as a baseline for evaluating model performance under standard benchmarking conditions.

Scaffold split. To more rigorously assess generalization to structurally novel compounds, a Bemis–Murcko scaffold-based splitting strategy was employed [29]. Murcko scaffolds were computed using RDKit. Molecules sharing the same scaffold were grouped together to ensure that structurally related compounds were not distributed across training and test sets, thereby preventing scaffold-level information leakage. Scaffold groups were sorted in descending order of size and assigned to the training set in a greedy manner until the target training proportion (88%) was reached. The remaining scaffold groups were assigned to the test set. This deterministic procedure guarantees complete scaffold exclusivity between training and test sets. Compared to random splitting, scaffold-based splitting provides a more stringent and realistic evaluation of model performance in prospective applications.

2.4. SMILES Augmentation

To address class imbalance and improve model robustness to representation variability, two complementary data augmentation strategies were applied to the training set. For random SMILES augmentation, all training molecules (both toxic and non-toxic) were expanded into multiple randomized SMILES representations, with the augmented set used to increase structural diversity. For stereoisomer augmentation, expansion was applied selectively to toxic molecules only, as described below. without artificially inflating the majority (non-toxic) class, thereby reducing bias during model training.

Random SMILES enumeration. For each training molecule, up to 50 unique randomised SMILES representations were generated using RDKit (Chem.MolToSmiles(mol, doRandom=True)). Randomization was performed with molecule-specific seeds [31] derived from a global random seed (seed = 42) to ensure full reproducibility across runs. This augmentation strategy preserves molecular identity while exposing the model to multiple equivalent string representations, thereby improving robustness to SMILES-dependent encoding artefacts. All augmented representations were included in model training regardless of toxicity class.

Stereoisomer enumeration. To account for stereochemical variability, all possible stereoisomers of each toxic molecule were enumerated using RDKit’s EnumerateStereoisomers utility [32]. This approach introduces chemically relevant structural diversity, particularly for molecules where stereochemistry may influence biological activity. Non-toxic molecules were retained as their canonical SMILES representations to avoid disproportionate expansion of the majority class.

2.5. Model Architectures

To ensure robust model optimization and fair performance comparison, a unified training and evaluation framework was applied across all models, with specific adaptations for deep learning and classical approaches.

Fully connected neural network (FC model). A seven-layer feed-forward neural network was employed to map fixed-length molecular fingerprints to a binary toxicity probability. The network architecture followed a progressive dimensionality reduction scheme (1024 → 512 → 256 → 128 → 64 → 32 → 1). Each hidden layer consisted of a linear transformation followed by batch normalization, ReLU activation, and dropout (rate = 0.5). The final output layer applied a sigmoid activation function. Models were trained with a mini-batch size of 128 using the Adam optimizer.

Sequence-based models (BiLSTM model). For the Mol2vec-based LSTM model, an identical BiLSTM architecture was used, but the embedding layer was initialized with pretrained Mol2vec weights [25] to incorporate chemically informed substructure representations. Molecular sequences were constructed from Morgan substructure identifiers (radius = 1), consistent with the pretrained model. The architecture consisted of an embedding layer (dimension = 64, dropout = 0.1), followed by a two-layer bidirectional LSTM (hidden size = 256, inter-layer dropout = 0.3). The final forward and backward hidden states were concatenated to form a 512-dimensional representation, which was passed to a classifier head.

Classical machine learning models. In addition to deep learning approaches, six classical classifiers were benchmarked: Random Forest (RF), k-Nearest Neighbours (KNNs), Radius Neighbour Classifier, Support Vector Machine (SVM), Multi-layer Perceptron (MLP), and CatBoost [34]. These models were trained on the same featurized datasets as the deep learning models to enable fair comparison. Hyperparameter optimization was performed using RandomisedSearchCV with 5-fold cross-validation over 10 iterations. Model selection was based on a composite scoring metric defined as S = (MCC + F1)/2, balancing correlation-based and threshold-dependent performance measures, where Matthews Correlation Coefficient (MCC) measures the correlation between predicted and true binary labels across all four confusion matrix cells and is robust to class imbalance, and the F1 score is the harmonic mean of precision and recall, capturing the balance between false positives and false negatives.

2.6. Model Training, Validation, and Evaluation

Deep learning models were trained using stratified 5-fold cross-validation to preserve class distributions across folds and reduce variance in performance estimates. Optimisation was performed using the Adam optimizer (learning rate = 0.01, L2 weight decay = 1 × 10⁻⁵), with an additional L1 regularization term (λ = 5 × 10⁻⁴) applied to all model parameters. A dynamic learning rate schedule, ReduceLROnPlateau, was employed, reducing the learning rate by a factor of 0.5 after three consecutive epochs without improvement in validation loss. Early stopping was applied if validation loss did not decrease for 16 consecutive epochs, with a maximum training budget of 10,000 epochs. All experiments were conducted with a fixed random seed (seed = 42) to ensure reproducibility.

Class imbalance was addressed through SMOTE augmentation and data-level rebalancing. The Synthetic Minority Oversampling Technique (SMOTE) [30] was optionally applied to augment the minority class by generating synthetic samples through interpolation between minority-class instances in feature space. Model performance was evaluated using macro-averaged F1 score, Matthews correlation coefficient (MCC), and balanced accuracy. Binary classification decisions were obtained using a default probability threshold of 0.5.

2.7. SHAP Attribution and Distribution Analysis

To quantify and interpret the contribution of molecular descriptors to model predictions, SHAP (SHapley Additive exPlanations) values were computed [33] for the best-performing classical machine learning models, specifically CatBoost and Random Forest. SHAP analysis was performed on both the internal and external test sets to ensure that feature importance reflected generalizable patterns rather than dataset-specific biases. For each descriptor, mean absolute SHAP values were aggregated across all test instances to obtain a global importance ranking. The top 20 Mordred descriptors were selected for downstream analyses.

To further characterize these SHAP-derived features, statistical distributional analyses were performed separately on Test 1 and Test 2. Mordred descriptors were recomputed directly from raw InChI strings using RDKit and the Mordred calculator to ensure independence from prior preprocessing. For each descriptor and dataset, per-class summary statistics were computed, and differences between toxic and non-toxic compounds were assessed using the two-sided Mann–Whitney U test, with effect sizes quantified using the rank-biserial correlation. p-values were adjusted using the Benjamini–Hochberg procedure.

2.8. Substructure Enrichment Analysis

To identify chemically interpretable structural motifs associated with toxicity, a curated panel of 16 substructure classes was defined using a combination of SMARTS pattern matching and programmatic structural rules implemented in RDKit. Selected motifs include quinones [35] and nitroaromatics (associated with redox cycling and ROS generation), phenols and nitrophenols (protonophoric uncouplers that disrupt membrane potential) [36,37], Michael acceptors (electrophilic agents capable of covalent protein modification) [38,39], polycyclic aromatics and long aliphatic chains (lipophilic membrane-accumulating structures) [40], aromatic amines and heteroaromatics [41], as well as broader structural categories and functional groups included as reference features. Enrichment was quantified as the ratio of motif prevalence in toxic versus non-toxic molecules, and statistical significance was assessed using two-sided Fisher’s exact tests with Benjamini–Hochberg FDR correction.

2.9. Stratified Odds Ratio Analysis

To investigate whether key molecular descriptors modulate toxicity risk within specific structural contexts, a stratified odds ratio (OR) analysis was performed by jointly considering SHAP-identified descriptors and motif-defined subpopulations. For each combination of the top 20 SHAP-selected descriptors and each motif-defined stratum, a 2 × 2 contingency table was constructed. Binary descriptors were defined using fixed thresholds, while continuous Mordred descriptors were binarized at their dataset-wide median. Odds ratios and 95% confidence intervals were estimated using the Woolf log-odds method, with Benjamini–Hochberg correction applied across all descriptor–motif combinations.

2.10. Computational Environment

All computations were performed on the Taiwan Computing Cloud (TWCC) high-performance computing (HPC) platform, Nano5. Model training was conducted on GPU-enabled nodes equipped with NVIDIA H100 GPUs and Intel Xeon processors (Platinum 8480CL). The software environment was based on Python (version 3.11.14), with key libraries including PyTorch (version 2.10.0), scikit-learn (version 1.8.0), RDKit (version 2025.03.6), Mordred (version 1.2.0), and SHAP (version 0.49.1). All experiments were conducted with a fixed random seed (seed = 42) to ensure reproducibility.

3. Results

3.1. Dataset Analysis and Tanimoto Analysis

To obtain an overview of the physicochemical properties of the datasets used in this study, a graphical analysis of LogP and molecular weight was performed (Figure 2A). LogP is a metric describing the ability of a compound to partition between water and oil, also known as hydrophobicity. The molecular weight reflects the size and complexity of compounds and is of great significance for understanding the physical properties and biological activities of compounds.

While the molecular weights are distributed mainly in the range of 0 to 2000, the hydrophobicity of the compounds is distributed roughly between −20 and 20. The positive toxic compounds in the training set are mostly located in the LogP range of −5 to 10, while the positive compounds in the test set span a smaller range of 0 to 10. Among toxicants, compounds with higher molecular weights often exhibit increased structural complexity and frequently harbor multiple functional groups capable of specific interactions with biomolecules. In addition, hydrophobicity (LogP) is closely related to the lipid solubility and cell permeability of the compounds, influencing their ability to accumulate in cellular membranes and mitochondria and thereby trigger a toxic response.

Tanimoto analysis was used to analyze the structural similarity between positive and negative compounds in the training set. Among the molecular fingerprints tested, PubChem had the highest Tanimoto similarity score, followed by the MACCSkeys fingerprint, indicating that structural similarity scores are strongly correlated with the algorithm used to synthesize the molecular fingerprints (Figure 2B). Both PubChem and MACCSkeys algorithms encode structural information based on the functional groups of the compound as binary vectors, making high similarity scores expected for these representations.

To characterize the chemical diversity of the datasets and evaluate structural overlap between evaluation sets, dimensionality reduction analysis was performed on molecular fingerprint representations. Principal component analysis (PCA) and, t-SNE (t-distributed stochastic neighbour embedding) were applied to Morgan (2048-bit), MACCS keys (167-bit), and PubChem (881-bit) fingerprint matrices for all compounds in the internal MMP test set (Test 1) and the external non-MMP test set (Test 2) (Figure 2C,D). Both PCA and t-SNE showed that Test 1 and Test 2 compounds were broadly intermingled across the projected chemical space, with no clear separation into non-overlapping clusters, consistent with the two evaluation sets sharing similar overall physicochemical and structural characteristics.

3.2. Classical Machine Learning Benchmarks

Six classical machine learning models were benchmarked across multiple feature representations, data augmentation strategies, and splitting schemes. Performance was evaluated on Test 1 and Test 2 to assess both in-distribution performance and cross-assay generalization.

Under the random split, several classical models achieved strong performance on Test 1, with F1 scores exceeding 0.80 and MCC values around 0.70, but performance dropped substantially on Test 2 across all models (Figure 3A). The best-performing configuration under random split was KNN with SMOTE and combined features, reaching F1 = 0.8436 and MCC = 0.7081 on Test 1, but only F1 = 0.2984 and MCC = 0.2375 on Test 2 (Table 2). Similar behavior was observed for Random Forest and CatBoost, where high Test 1 performance (F1 ≈ 0.83) did not translate to strong Test 2 performance.

Feature representation had a clear impact, with Mordred descriptors consistently yielding the highest Test 2 performance, including the best overall result achieved by Random Forest (F1 = 0.3483, MCC = 0.2990) under random split. This trend remained under scaffold splitting, where Test 1 performance decreased substantially (F1 ≈ 0.62–0.64), while Test 2 performance remained relatively stable (F1 ≈ 0.32–0.34) (Figure 3D; Table 2). The effect of SMOTE was modest, providing slight improvements in Test 1 but no consistent gain on Test 2 (Figure 3C). Across all configurations, Test 2 performance remained constrained within a narrow range regardless of model or feature choice (Figure 3E), demonstrating a systematic generalization gap between assays, with Mordred descriptors providing the most transferable signals among the evaluated features.

3.3. Deep Learning Performance Across Augmentation Strategies

Deep learning models were evaluated across feature representations, augmentation strategies, and splitting schemes to assess their ability to improve predictive performance and generalization.

Under random split, the fully connected model (FC model) achieved strong performance on Test 1, with F1 scores reaching approximately 0.81 depending on feature type, but substantially lower performance on Test 2 across all configurations (Figure 4A; Table 3). SMILES-based augmentation consistently improved Test 1 performance, particularly for fingerprint-based features, whereas stereoisomer augmentation showed smaller or inconsistent gains. However, neither strategy improved Test 2 performance, which remained constrained within a narrow range.

The Mol2vec-based BiLSTM model showed moderate performance on Test 1 (F1 score = 0.56–0.65) and similarly low performance on Test 2 (0.25–0.28), with only minor differences between augmentation strategies (Figure 4B). Compared to the FCModel, the BiLSTM did not provide performance gains and showed similar limitations in cross-assay generalisation. Under scaffold splitting, Test 1 performance decreased substantially (0.45–0.60), and the effect of augmentation was reduced, whereas Test 2 performance remained similar to that observed under random split (Figure 4D). The best Test 2 performance achieved by deep learning models remained below that of the best classical models, indicating that increased model complexity and data augmentation do not overcome the generalisation gap.

3.4. Internal Versus External Generalisation Reveals a Systematic Performance Gap

Across all models, feature representations, and training strategies, a consistent performance gap was observed between Test 1 and Test 2 (Figure 5). The distribution of F1 differences shows a mean gap of approximately 0.39, with 55% of configurations exceeding 0.30 and 42% exceeding 0.40 (Figure 5A). In addition, more than half of the configurations produced Test 2 F1 scores below 0.30, indicating limited predictive performance on the external assay.

This gap was consistent across model architectures. All models, including SVM, KNN, Random Forest, CatBoost, and neural networks, showed similar mean differences ranging from approximately 0.33 to 0.39 (Figure 5B), indicating that the generalization gap is not driven by model choice. Data splitting strategy influenced the magnitude of the gap: random split produced higher Test 1 performance but similar Test 2 performance compared to scaffold split, resulting in a larger gap (Figure 5C). Data augmentation had limited impact on cross-assay generalization—SMOTE and SMILES-based augmentation improved Test 1 performance but did not improve Test 2 performance (Figure 5D). Feature representation had little effect on Test 2 performance, which remained constrained across all feature types, although Mordred descriptors showed slightly higher values on average (Figure 5E).

3.5. Descriptor Importance and Distributional Analysis

SHAP analysis of the best-performing classical models, CatBoost and Random Forest, revealed a consistent set of important descriptors across both datasets (Figure 6). Mean absolute SHAP values were aggregated to obtain a global ranking, from which the top 20 Mordred descriptors were selected (Table 4). Several descriptors, including C3SP3, AATS2s, and GATS3s, were consistently ranked among the most important features across both models, indicating a robust consensus signal. Most top-ranked descriptors belong to the Broto–Moreau and Geary autocorrelation families, indicating that model predictions are driven primarily by global physicochemical organisation rather than isolated structural motifs.

SHAP feature importance aggregated across model type (Random Forest and CatBoost), data splitting strategy (random and scaffold), and test set (Test 1: MMP assay; Test 2: non-MMP assay). Bars represent proportional SHAP importance, defined as the fraction of total absolute SHAP values within each condition. Color intensity indicates the consistency of each descriptor across conditions, ranging from presence in one condition to all four conditions within each assay.

Despite its dominant SHAP importance, C3SP3 showed no significant difference between toxic and non-toxic compounds in either dataset, with highly overlapping distributions (Figure 7). This indicates that C3SP3 does not act as a marginal discriminator but instead contributes through interactions with other descriptors. In contrast, ATS5i showed higher values in toxic compounds in both datasets and was statistically significant in both cases, while AATS2s and GATS3s were lower in toxic compounds. The descriptor nAcid showed consistently higher values in non-toxic compounds in both datasets, indicating that acidic functionality is negatively associated with toxicity at the marginal level but has limited standalone predictive power.

Violin plots showing the distribution of five representative descriptors (C3SP3, nAcid, ATS5i, AATS2s, and GATS3s) stratified by toxicity class for both Test 1 (MMP assay) and Test 2 (non-MMP assay). Individual data points are overlaid to illustrate distribution density.

C3SP3 shows substantial overlap between toxic and non-toxic compounds in both assays, indicating a lack of univariate separation despite high SHAP importance. In contrast, ATS5i exhibits higher values in toxic compounds, while AATS2s and GATS3s show lower values in toxic compounds across both assays. The descriptor nAcid is enriched in non-toxic compounds. These patterns distinguish descriptors with marginal discriminative power from those contributing through interaction effects.

3.6. Structural Motif Analysis and Context-Dependent Effects

Substructure enrichment analysis identified several motifs associated with toxicity in the MMP assay, but these associations were not consistently preserved in the non-MMP assay (Figure 8). In the MMP assay, multiple motifs including quinones, nitroaromatics, polycyclic aromatics, aromatic amines, and phenols were enriched in toxic compounds, whereas the non-MMP assay showed fewer significant motifs and weaker enrichment patterns. These differences indicate that motif-level associations are strongly assay-dependent.

Stratified odds ratio analysis further revealed context-dependent effects for key descriptors. The descriptor nAcid showed a global negative association with toxicity, with an odds ratio of 0.44 (q = 0.005) in Test 1 and 0.77 (q = 0.798) in Test 2 (Figure 9), although stratified analysis showed that this relationship can vary across structural contexts. C3SP3 showed odds ratios close to 1 across most strata in both assays, with only a single significant association, confirming its role as an interaction-dependent descriptor (Figure 10). ATS5i displayed positive associations with toxicity in multiple contexts, particularly in the carboxylic acid stratum (OR = 3.20, q = 0.024 in Test 1 and OR = 3.96, q = 0.006 in Test 2) and the polycyclic aromatic stratum (OR = 3.12, q = 0.001 in Test 2) (Figure 11). AATS2s showed a consistent negative association with toxicity across multiple structural strata in the non-MMP assay, particularly within aromatic and polycyclic aromatic compounds, suggesting it captures a transferable physicochemical signal beyond the MMP training context (Figure 12). Together, these results demonstrate that mitochondrial toxicity cannot be explained by single structural motifs or individual descriptors, but instead arises from interactions between molecular features.

4. Discussion

Mitochondrial toxicity is a multifactorial phenomenon that has been linked to diverse mechanisms, including disruption of the electron transport chain, redox cycling, and membrane depolarization. In this study, we evaluated machine learning models for mitochondrial toxicity prediction and combined model interpretation with statistical analysis to identify physicochemical determinants of toxicity.

A central finding is the pronounced generalization gap between the internal MMP test set and the external non-MMP assay (comprising flux and Glu/Gal data). All models achieved strong performance on the MMP-based test set but showed substantial degradation on the external assay, regardless of model type or feature representation. This finding has a mechanistic basis: MMP assays are most sensitive to membrane depolarization and protonophoric uncoupling, whereas Glu/Gal and flux assays detect the broader consequence of impaired oxidative phosphorylation under metabolic stress, with different sensitivity profiles for ETC complex inhibitors, fatty acid oxidation inhibitors, and ionophores [5,6]. Classical ETC complex inhibitors such as rotenone and antimycin A, for example, produce modest or no MMP signal under standard glucose conditions, yet are reliably detected as mitochondrial toxicants under galactose-sensitized conditions [42]. Conversely, uncouplers such as FCCP depolarize the membrane and are readily captured by MMP assays but may produce a qualitatively different cytotoxicity profile in the Glu/Gal assay. This mechanistic divergence suggests that a classifier trained on MMP labels may capture assay-associated physicochemical patterns enriched among membrane-depolarizing or uncoupling-like compounds. These patterns may not transfer equally well to compounds whose mitochondrial effects are primarily reflected by ETC inhibition, fatty acid oxidation impairment, or metabolic stress responses. Thus, an MMP-trained model should be interpreted as learning MMP-associated chemical profiles rather than a fully generalizable representation of mitochondrial toxicity. This observation is consistent with reported limitations of single-assay toxicity models under distribution shift [14,23], and underscores the broader point that no single in vitro assay comprehensively represents mitochondrial toxicity as a biological phenomenon. The reduced performance under scaffold splitting further indicates that part of the predictive signal arises from structural similarity rather than transferable features. In contrast to fingerprint-based representations, descriptor-based features provided more stable performance across datasets. Mordred descriptors outperformed structural fingerprints in cross-assay settings, indicating that global physicochemical properties are more informative for generalizable prediction. The cross-assay transferability of Mordred descriptors may reflect the physicochemical properties of mitochondrial toxicants. Unlike receptor-mediated endpoints where specific structural motifs govern activity through a defined lock-and-key mechanism, mitochondrial toxicity is thought to be influenced by global molecular properties, such as membrane partitioning, charge distribution, and electron transfer capacity. Fingerprint-based models, by contrast, may learn scaffold-specific associations that are coincidental to the training distribution and may therefore generalize less reliably across mechanistically distinct assay endpoints, particularly when the external set probes a different aspect of mitochondrial dysfunction, such as oxidative phosphorylation capacity under metabolic stress rather than membrane depolarization. This is consistent with previous studies showing that mitochondrial toxicity is associated with properties such as lipophilicity, molecular size, and electronic characteristics, which influence membrane permeability and subcellular accumulation [7,8,11].

Interpretability analysis provides further insight into these patterns. SHAP attribution revealed that model predictions are dominated by autocorrelation descriptors, which encode the spatial distribution of atomic properties across the molecular graph, capturing global organization of physicochemical features rather than local functional groups. Notably, the most important descriptor, C3SP3, showed no significant difference between toxic and non-toxic compounds, indicating that its contribution arises from interactions with other features rather than independent predictive power. This distinction highlights a limitation of interpreting feature importance without considering feature interactions.

Descriptors with consistent marginal effects suggest specific physicochemical trends. ATS5i, which reflects ionization potential-related autocorrelation, was higher in toxic compounds across both assays, suggesting a possible role for electronic properties in toxicity. This association may be broadly compatible with mechanisms involving redox cycling or interference with the electron transport chain, though the observed pattern reflects a statistical tendency across structurally diverse compounds rather than a direct mechanistic link. In contrast, descriptors such as AATS2s and GATS3s were lower in toxic compounds, indicating that changes in the spatial distribution of electronic or topological properties may also contribute to toxicity.

The behavior of nAcid highlights the importance of context. Acidic functionality was negatively associated with toxicity in both datasets, indicating that polar, ionizable compounds are less likely to be toxic on average. This is consistent with reduced membrane permeability limiting mitochondrial accumulation. However, stratified analysis showed that this relationship can vary across structural contexts, suggesting that acidity may contribute to toxicity when combined with sufficient lipophilicity, like uncouplers.

Motif-level analysis further demonstrates the context-dependence of structural alerts [43]. Several motifs, including quinones, nitroaromatics, phenols, and aromatic amines, were significantly enriched in toxic compounds in the MMP assay but showed weaker or absent enrichment in the non-MMP assay. This cross-assay inconsistency can be further illustrated by the example of 1,4-dihydropyridine (1,4-DHP) scaffold. In an initial structural alert screen, 22 compounds bearing this substructure were identified as MMP-active, including clinically used calcium channel blockers such as amlodipine, felodipine, and nicardipine [44,45]. These compounds are known to partition into lipophilic membranes and have been associated with inhibition of mitochondrial electron transport chain complexes, providing a plausible mechanism for their MMP activity [44,45,46]. However, 1,4-DHP membership does not predict toxicity in the non-MMP external set. This pattern is consistent with previous analyses showing that structural alert generalization across different toxicological endpoints is often limited [14,23]. These results support a model in which toxicity arises from multivariate physicochemical interactions rather than single defining structural features.

This study has several implications for computational mitochondrial toxicity assessment. First, cross-assay evaluation on an independent endpoint is essential; random-split benchmarking on a single assay substantially overestimates prospective performance. Second, descriptor-based representations, particularly Mordred autocorrelation descriptors, provide more transferable physicochemical signals than fingerprint-based approaches in this setting. Third, interpretability analyses should account for feature interactions and structural context rather than relying solely on global importance rankings or single-motif enrichment. Future work should focus on multi-assay training strategies, integration of mechanistic pathway information, and development of applicability domain methods tailored to cross-assay prediction scenarios.

Several limitations of this study should be acknowledged. The external test set is relatively small and heavily class-imbalanced (approximately 10:1 non-toxic to toxic), which limits the precision of cross-assay performance estimates. Model training relied exclusively on MMP assay labels, a proximal measure of mitochondrial dysfunction; the external set introduces endpoint heterogeneity that the models were not trained to address. Only 2D Mordred descriptors were used; 3D conformational or graph-based representations may capture additional predictive information. No applicability domain analysis was performed to identify regions of chemical space where predictions may be less reliable. These limitations suggest that the reported cross-assay performance values represent a conservative estimate of achievable generalization given the current data, and that future improvements in dataset breadth and endpoint diversity will be important for advancing this field.

5. Conclusions

In this study, we evaluated machine learning and deep learning models for mitochondrial toxicity prediction and tested on cross-assay generalizability. A consistent generalization gap of approximately 0.39 F1 units was observed between the internal MMP test set and the external non-MMP assay across all model configurations, demonstrating that single-assay random-split benchmarks do not reliably predict performance on alternative endpoints. Mordred descriptors provided the most transferable predictive signals, outperforming fingerprint-based features cross-assay, while scaffold-based splitting confirmed that a proportion of standard random-split performance reflects structural similarity leakage. SHAP analysis identified autocorrelation-family descriptors as dominant contributors, with C3SP3 acting through feature interactions and ATS5i showing consistent positive association with toxicity across assays. Substructure enrichment and stratified odds ratio analyses demonstrated that motif-level and descriptor-level associations are assay-dependent. Mitochondrial toxicity is related to multivariate physicochemical patterns rather than single structural features. These findings highlight multi-assay evaluation frameworks, descriptor-based feature engineering, and context-aware interpretability in developing reliable computational tools for mitochondrial toxicity assessment.

Author Contributions

Conceptualization, A.-C.W. and Y.-H.L.; methodology, Y.-H.L. and Y.-T.L.; formal analysis, Y.-H.L. and Y.-T.L.; data curation, Y.-H.L.; validation, Y.-T.L.; writing—original draft preparation, Y.-H.L., Y.-T.L. and A.-C.W.; writing—review and editing, Y.-H.L., Y.-T.L. and A.-C.W.; visualization, Y.-H.L. and Y.-T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan (NSTC 113-2221-E-002-048-MY3) and the National Taiwan University Center for Advanced Computing and Imaging in Biomedicine (NTU-114L900701).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study were obtained from public and literature sources, as cited. The code is available at GitHub: https://github.com/ntumitolab/mmp-toxicity-model-yuhen (accessed on 10 March 2026).

Acknowledgments

We thank the National Center for High-Performance Computing (NCHC) in Taiwan for providing computational and storage resources.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Zolotukhin, P.V.; Belanova, A.A.; Prazdnova, E.V.; Mazanko, M.S.; Batiushin, M.M.; Chmyhalo, V.K.; Chistyakov, V.A. Mitochondria as a signaling hub and target for phenoptosis shutdown. Biochemistry 2016, 81, 329–337. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.T.; Lin, K.H.; Huang, C.J.; Wei, A.C. MitoTox: A comprehensive mitochondrial toxicity database. BMC Bioinform. 2021, 22, 369. [Google Scholar] [CrossRef] [PubMed]
Sakamuru, S.; Li, X.; Attene-Ramos, M.S.; Huang, R.; Lu, J.; Shou, L.; Shen, M.; Tice, R.R.; Austin, C.P.; Xia, M. Application of a homogenous membrane potential assay to assess mitochondrial function. Physiol. Genom. 2012, 44, 495–503. [Google Scholar] [CrossRef] [PubMed]
Attene-Ramos, M.S.; Huang, R.; Michael, S.; Witt, K.L.; Richard, A.; Tice, R.R.; Simeonov, A.; Austin, C.P.; Xia, M. Profiling of the Tox21 chemical collection for mitochondrial function to identify compounds that acutely decrease mitochondrial membrane potential. Environ. Health Perspect. 2015, 123, 49–56. [Google Scholar] [CrossRef] [PubMed]
Eakins, J.; Bauch, C.; Woodhouse, H.; Park, B.; Bevan, S.; Dilworth, C.; Walker, P. A combined in vitro approach to improve the prediction of mitochondrial toxicants. Toxicol. In Vitro 2016, 34, 161–170. [Google Scholar] [CrossRef] [PubMed]
Gohil, V.M.; Sheth, S.A.; Nilsson, R.; Wojtovich, A.P.; Lee, J.H.; Perocchi, F.; Chen, W.; Clish, C.B.; Ayata, C.; Brookes, P.S.; et al. Nutrient-sensitized screening for drugs that shift energy metabolism from mitochondrial respiration to glycolysis. Nat. Biotechnol. 2010, 28, 249–255. [Google Scholar] [CrossRef] [PubMed]
Rosell-Hidalgo, A.; Moore, A.L.; Ghafourian, T. Prediction of drug-induced mitochondrial dysfunction using succinate-cytochrome c reductase activity, QSAR and molecular docking. Toxicology 2023, 485, 153412. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Zhang, X.; Gui, B.; Xu, X.; Su, L.; Zhao, Y.H.; Martyniuk, C.J. Comparison of modes of action between fish, cell and mitochondrial toxicity based on toxicity correlation, excess toxicity and QSAR for class-based compounds. Toxicology 2022, 470, 153155. [Google Scholar] [CrossRef] [PubMed]
Sinha, M.; Praveen, G.; Sachan, D.K.; Parthasarathi, R. Artificial intelligence in clinical toxicology. In Artificial Intelligence in Medicine; Lidströmer, N., Ashrafian, H., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 1487–1501. [Google Scholar]
Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. DeepTox: Toxicity prediction using deep learning. Front. Environ. Sci. 2016, 3, 80. [Google Scholar] [CrossRef]
Zhang, H.; Chen, Q.Y.; Xiang, M.L.; Ma, C.Y.; Huang, Q.; Yang, S.Y. In silico prediction of mitochondrial toxicity by using GA-CG-SVM approach. Toxicol. In Vitro 2009, 23, 134–140. [Google Scholar] [CrossRef] [PubMed]
Semenova, E.; Williams, D.P.; Afzal, A.M.; Lazic, S.E. A Bayesian neural network for toxicity prediction. Comput. Toxicol. 2020, 16, 100133. [Google Scholar] [CrossRef]
Zhao, P.; Peng, Y.; Xu, X.; Wang, Z.; Wu, Z.; Li, W.; Tang, Y.; Liu, G. In silico prediction of mitochondrial toxicity of chemicals using machine learning methods. J. Appl. Toxicol. 2021, 41, 1518–1526. [Google Scholar] [CrossRef] [PubMed]
Hemmerich, J.; Troger, F.; Füzi, B.; Ecker, G.F. Using machine learning methods and structural alerts for prediction of mitochondrial toxicity. Mol. Inform. 2020, 39, e2000005. [Google Scholar] [CrossRef] [PubMed]
Jaganathan, K.; Rehman, M.U.; Tayara, H.; Chong, K.T. XML-CIMT: Explainable machine learning (XML) model for predicting chemical-induced mitochondrial toxicity. Int. J. Mol. Sci. 2022, 23, 15655. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Xiong, H.; Li, X.; Wu, X.; Zhang, X.; Liu, J.; Bian, J.; Dou, D. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst. 2022, 64, 3197–3234. [Google Scholar] [CrossRef]
Yang, L.; Canaveras, J.C.G.; Chen, Z.; Wang, L.; Liang, L.; Jang, C.; Mayr, J.A.; Zhang, Z.; Ghergurovich, J.M.; Zhan, L.; et al. Serine catabolism feeds NADH when respiration is impaired. Cell Metab. 2020, 31, 809–821.e806. [Google Scholar] [CrossRef] [PubMed]
Richard, A.M.; Huang, R.; Waidyanatha, S.; Shinn, P.; Collins, B.J.; Thillainadarajah, I.; Grulke, C.M.; Williams, A.J.; Lougee, R.R.; Judson, R.S.; et al. The Tox21 10K compound library: Collaborative chemistry advancing toxicology. Chem. Res. Toxicol. 2021, 34, 189–216. [Google Scholar] [CrossRef] [PubMed]
Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A.; et al. PubChem substance and compound databases. Nucleic Acids Res. 2015, 44, D1202–D1213. [Google Scholar] [CrossRef] [PubMed]
Bringezu, F.; Gómez-Tamayo, J.C.; Pastor, M. Ensemble prediction of mitochondrial toxicity using machine learning technology. Comput. Toxicol. 2021, 20, 100189. [Google Scholar] [CrossRef]
Xia, M.; Huang, R.; Shi, Q.; Boyd, W.A.; Zhao, J.; Sun, N.; Rice, J.R.; Dunlap, P.E.; Hackstadt, A.J.; Bridge, M.F.; et al. Comprehensive analyses and prioritization of Tox21 10K chemicals affecting mitochondrial function by in-depth mechanistic studies. Environ. Health Perspect. 2018, 126, 077010. [Google Scholar] [CrossRef] [PubMed]
Wagner, B.K.; Kitami, T.; Gilbert, T.J.; Peck, D.; Ramanathan, A.; Schreiber, S.L.; Golub, T.R.; Mootha, V.K. Large-scale chemical dissection of mitochondrial function. Nat. Biotechnol. 2008, 26, 343–351. [Google Scholar] [CrossRef] [PubMed]
Seal, S.; Carreras-Puigvert, J.; Trapotsi, M.A.; Yang, H.; Spjuth, O.; Bender, A. Integrating cell morphology with gene expression and chemical structure to aid mitochondrial toxicity detection. Commun. Biol. 2022, 5, 858. [Google Scholar] [CrossRef] [PubMed]
Moriwaki, H.; Tian, Y.-S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef] [PubMed]
Jaeger, S.; Fulle, S.; Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 2018, 58, 27–35. [Google Scholar] [CrossRef] [PubMed]
Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef] [PubMed]
RDKit: Open-Source Cheminformatics Software, Version 2025_03_6; Zenodo: Genève, Switzerland, 2025. [CrossRef]
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2023 update. Nucleic Acids Res. 2023, 51, D1373–D1380. [Google Scholar] [CrossRef] [PubMed]
Bemis, G.W.; Murcko, M.A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 1996, 39, 2887–2893. [Google Scholar] [CrossRef] [PubMed]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bjerrum, E.J. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv 2017, arXiv:1703.07076. [Google Scholar] [CrossRef]
Contreras, M.L.; Alvarez, J.; Guajardo, D.; Rozas, R. Algorithm for exhaustive and nonredundant organic stereoisomer generation. J. Chem. Inf. Model. 2006, 46, 2288–2298. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Bolton, J.L.; Trush, M.A.; Penning, T.M.; Dryhurst, G.; Monks, T.J. Role of quinones in toxicology. Chem. Res. Toxicol. 2000, 13, 135–160. [Google Scholar] [CrossRef] [PubMed]
Terada, H. Uncouplers of oxidative phosphorylation. Environ. Health Perspect. 1990, 87, 213–218. [Google Scholar] [CrossRef] [PubMed][Green Version]
Lou, P.-H.; Hansen, B.S.; Olsen, P.H.; Tullin, S.; Murphy, M.P.; Brand, M.D. Mitochondrial uncouplers with an extraordinary dynamic range. Biochem. J. 2007, 407, 129–140. [Google Scholar] [CrossRef] [PubMed]
LoPachin, R.M.; Gavin, T. Molecular mechanisms of aldehyde toxicity: A chemical perspective. Chem. Res. Toxicol. 2014, 27, 1081–1091. [Google Scholar] [CrossRef] [PubMed]
Enoch, S.J.; Ellison, C.M.; Schultz, T.W.; Cronin, M.T.D. A review of the electrophilic reaction chemistry involved in covalent protein binding relevant to toxicity. Crit. Rev. Toxicol. 2011, 41, 783–802. [Google Scholar] [CrossRef] [PubMed]
Escher, B.I.; Hermens, J.L.M. Modes of action in ecotoxicology: Their role in body burdens, species sensitivity, QSARs, and mixture effects. Environ. Sci. Technol. 2002, 36, 4201–4217. [Google Scholar] [CrossRef] [PubMed]
Patrick, G.L. An Introduction to Medicinal Chemistry, 6th ed.; Oxford University Press: Oxford, UK, 2017. [Google Scholar]
Tsiper, M.V.; Sturgis, J.; Avramova, L.V.; Parakh, S.; Fatig, R.; Juan-García, A.; Li, N.; Rajwa, B.; Narayanan, P.; Qualls, C.W.; et al. Differential mitochondrial toxicity screening and multi-parametric data analysis. PLoS ONE 2012, 7, e45226. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Li, J.; Wu, Z.; Li, W.; Liu, G.; Tang, Y. Evaluation of different methods for identification of structural alerts using chemical Ames mutagenicity data set as a benchmark. Chem. Res. Toxicol. 2017, 30, 1355–1364. [Google Scholar] [CrossRef] [PubMed]
Velena, A.; Zarkovic, N.; Troselj, K.G.; Bisenieks, E.; Krauze, A.; Poikans, J.; Duburs, G. 1,4-dihydropyridine derivatives: Dihydronicotinamide analogues—Model compounds targeting oxidative stress. Oxidative Med. Cell. Longev. 2016, 2016, 1892412. [Google Scholar] [CrossRef] [PubMed]
Alves, V.M.; Muratov, E.N.; Capuzzi, S.J.; Politi, R.; Low, Y.; Braga, R.C.; Zakharov, A.V.; Sedykh, A.; Mokshyna, E.; Farag, S.; et al. Alarms about structural alerts. Green. Chem. 2016, 18, 4348–4360. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.-H. Using Machine Learning to Comparing Molecular Fingerprints and Descriptors for Interpretable Prediction of Mitochondrial Toxicity. Master’s Thesis, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan, November 2023. [Google Scholar]

Figure 1. Overview of the mitochondrial toxicity prediction workflow.

Figure 2. Chemical space characterization of the MMP and non-MMP datasets. (A) LogP versus molecular weight scatter plot of the training set and external test set, Test 2. (B) Tanimoto similarity violin plots for Morgan, MACCS keys, and PubChem fingerprints, comparing all toxic–toxic and toxic–non-toxic pairwise similarities across the full combined dataset. Black points show mean ± SD; white lines show medians. All comparisons significant at p < 0.001 (Mann–Whitney U test). (C) PCA projections of Morgan, MACCS, and PubChem fingerprints for all compounds in the internal MMP test set and external non-MMP test. (D) t-SNE projections of the same compounds and fingerprint representations as panel (C).

Figure 3. Performance of classical machine learning models across datasets and evaluation settings. (A) Best F1 scores achieved by each model, showing consistently high performance on Test 1 and substantial reduction on Test 2. (B) Mean F1 scores across feature representations, indicating that Mordred descriptors provide the highest overall performance. (C) Effect of SMOTE oversampling on model performance, showing minimal improvement on Test 2. (D) Comparison of random and scaffold splits, demonstrating reduced Test 1 performance under scaffold splitting while Test 2 remains largely unchanged. (E) Relationship between Test 1 and Test 2 performance across all configurations, with the dashed line indicating equal performance. Most points fall below the diagonal, highlighting a generalization gap. The best Test 2 performance is achieved by CatBoost with Mordred descriptors.

Figure 4. Deep learning performance across augmentation strategies and splitting schemes (A) FCModel performance under random split across feature types and augmentation methods, showing improved Test 1 performance with SMILES augmentation but limited effect on Test 2. (B) Mol2vec-BiLSTM model performance under random and scaffold splits, indicating modest gains from augmentation and consistent performance reduction under scaffold splitting. (C) Mean FCModel performance across features under random split, highlighting that SMILES augmentation improves Test 1 but not Test 2. (D) FCModel performance under scaffold split, showing reduced Test 1 performance and minimal effect of augmentation on Test 2. (E) Relationship between Test 1 and Test 2 performance across all deep learning configurations, with most points below the identity line, indicating a persistent generalization gap.

Figure 5. Internal versus external generalization reveals a systematic performance gap. Comprehensive analysis of cross-assay generalization performance across all model configurations (n = 156). (A) Distribution of the F1 performance gap (Test 1–Test 2), showing a mean gap of approximately 0.39, with 55% of configurations exceeding a gap of 0.30 and 42% exceeding 0.40. (B) Cross-assay performance gap stratified by model architecture, indicating consistently large gaps across all models, with mean differences ranging from approximately 0.33 to 0.39. (C) Effect of splitting strategy, showing higher Test 1 performance under random split but similar Test 2 performance, resulting in a larger gap compared to scaffold split. (D) Effect of augmentation strategies, demonstrating that augmentation improves Test 1 performance but does not reduce the cross-assay gap. (E) Performance across feature representations, showing similar Test 2 performance across all features despite variation in Test 1. (F) Relationship between Test 1 and Test 2 performance across all configurations, with most points below the identity line, indicating systematic degradation on the external assay. Mordred-based models show slightly improved Test 2 performance but do not eliminate the gap.

Figure 6. Consensus SHAP importance across models, splits, and test sets.

Figure 7. Distribution of representative descriptors by toxicity class.

Figure 8. Motif enrichment across assays. Comparison of motif prevalence between toxic and non-toxic compounds in the MMP assay and the non-MMP assay. Bars represent the percentage of compounds containing each motif, with enrichment ratios indicated. Statistical significance is assessed using BH multiple testing correction. Several motifs, including phenol and aromatic amine, are significantly enriched in the MMP assay but not in the non-MMP assay. Polycyclic aromatic and halogenated aromatic motifs are the only classes consistently enriched across both assays, highlighting limited cross-assay generalizability of motif-level signals.

Figure 9. Stratified odds ratios for nAcid. Forest plots showing odds ratios for nAcid across structural strata in both assays. The overall association in the MMP assay indicates reduced toxicity for compounds with acidic groups, while stratified results reveal substantial variability across structural contexts. In the non-MMP assay, no consistent pattern is observed. These results demonstrate that the effect of acidic functionality is highly context-dependent.

Figure 10. Stratified odds ratios for C3SP3. Forest plots showing odds ratios for C3SP3 across structural strata in both assays. Each point represents the odds ratio with 95 percent confidence interval for the association between high C3SP3 and toxicity within a given stratum. Most estimates are close to unity and not statistically significant, indicating weak marginal effects. This contrasts with the high SHAP importance of C3SP3 and supports its role as an interaction-dependent descriptor.

Figure 11. Stratified odds ratios for ATS5i. ATS5i shows positive associations with toxicity in several contexts, including carboxylic acid and polycyclic aromatic strata.

Figure 12. Stratified odds ratios for AATS2s. In the non-MMP assay, AATS2s is consistently associated with reduced toxicity across multiple strata, particularly within aromatic and polycyclic aromatic compounds. This pattern indicates that AATS2s captures a transferable physicochemical signal associated with lower toxicity.

Table 1. Composition of the dataset used for mitochondrial toxicity machine learning.

Dataset	Nontoxic	Toxic	Data Type	Reference Data Source
Training dataset	8534	2138	MMP	[2,3,4,18,19,23]
Testing dataset 1	500	500	MMP	[14,20,21,22]
Testing dataset 2	2080	217	non-MMP	[5,6,11,17]

Table 2. Best-performing configurations of classical machine learning models under random and scaffold splits, including augmentation strategies and feature representations, evaluated on Test 1 (MMP) and Test 2 (non-MMP).

Split Method	Model	Augmentation	Feature	Test Set 1 F1	Test Set 1 MCC	Test Set 2 F1	Test Set 2 MCC
Random	KNN	SMOTE	Combined	0.8436	0.7081	0.2984	0.2375
Random	Random Forest	SMOTE	Mordred	0.8344	0.7056	0.3483	0.299
Random	KNN	SMOTE	Morgan	0.8342	0.6959	0.3062	0.2431
Scaffold	CatBoost	SMOTE	Mordred	0.6431	0.5053	0.3341	0.2713
Scaffold	Random Forest	SMOTE	Mordred	0.6424	0.4798	0.3414	0.2919
Scaffold	SVM	-	Mordred	0.624	0.4602	0.3186	0.2678

Table 3. Best-performing configurations of deep learning models under random and scaffold splits, including augmentation strategies and feature representations, evaluated by F1 score and MCC on Test 1 (MMP) and Test 2(non-MMP).

Split Method	Model	Augmentation	Feature	Test Set 1 F1	Test Set 1 MCC	Test Set 2 F1	Test Set 2 MCC
Random	FC	Random SMILES	Combined	0.8124	0.6371	0.2829	0.2225
Random	FC	Random SMILES	MACCS	0.8107	0.6515	0.2992	0.2309
Random	LSTM	Stereoisomer enumeration	Mol2vec	0.6537	0.4069	0.2618	0.1762
Scaffold	FC	Random SMILES	PubChem	0.6333	0.4507	0.2952	0.2302
Scaffold	FC	-	Combined	0.6205	0.4469	0.3229	0.259
Scaffold	LSTM	-	Mol2vec	0.5651	0.3837	0.2532	0.1578

Table 4. Top SHAP-ranked molecular descriptors and their contributions to toxicity prediction.

Descriptor Feature	General Meaning	Times in Top-20 Lists (n = 8)	Average Mean \|SHAP\|	Proportional SHAP Importance (Fraction)
C3SP3	Count of sp3 carbon atoms bonded to exactly three other carbons; index of aliphatic branching density	8	0.1936	0.1530
AATS1pe	Average Broto–Moreau autocorrelation, lag 1, Pauling electronegativity weights; captures mean electronegativity contrast between bonded atom pairs	7	0.0600	0.1002
AATSC1d	Centred Broto–Moreau autocorrelation, lag 1, sigma-donor capacity (Crippen delta) weights; deviation from mean donor capacity at adjacent atoms	3	0.2373	0.0828
AATS2s	Average Broto–Moreau autocorrelation, lag 2, Kier-Hall electrotopological state weights; electronic environment two bonds apart	8	0.0952	0.0554
GATS3s	Geary autocorrelation, lag 3, intrinsic-state weights; dissimilarity of electrotopological state between atoms separated by three bonds	8	0.1296	0.0547
ATS5i	Broto–Moreau autocorrelation, lag 5, ionisation-potential weights; sum of IP products at five-bond topological distance	6	0.1124	0.0425
MATS2p	Moran autocorrelation, lag 2, polarisability weights; normalised co-variance of atomic polarisability at two-bond distance	4	0.0278	0.0422
GATS4pe	Geary autocorrelation, lag 4, Pauling electronegativity; dissimilarity of electronegativities four bonds apart	1	0.0103	0.0381
GATS1m	Geary autocorrelation, lag 1, atomic mass; mass similarity/dissimilarity between directly bonded atoms	6	0.0948	0.0376
GATS4Z	Geary autocorrelation, lag 4, atomic number; heavy-atom number dissimilarity at four-bond topological distance	2	0.0249	0.0363
AATS1are	Average Broto–Moreau autocorrelation, lag 1, molar refractivity weights; local refractivity contrast between bonded atoms	3	0.0103	0.0355
GATS1pe	Geary autocorrelation, lag 1, Pauling electronegativity; dissimilarity of electronegativity between directly bonded atoms	1	0.1383	0.0349
GATS3se	Geary autocorrelation, lag 3, Sanderson electronegativity; electronegativity dissimilarity three bonds apart	1	0.0091	0.0341
GATS2m	Geary autocorrelation, lag 2, atomic mass; mass dissimilarity at two-bond topological distance	4	0.0796	0.0339
GATS5i	Geary autocorrelation, lag 5, ionisation potential; ionisation-potential dissimilarity over five-bond topological distance	4	0.0096	0.0338
ATS1se	Broto–Moreau autocorrelation, lag 1, Sanderson electronegativity; sum of electronegativity products at adjacent atoms	6	0.0802	0.0333
GATS5se	Geary autocorrelation, lag 5, Sanderson electronegativity; electronegativity dissimilarity at long topological distance	5	0.0818	0.0320
nAcid	Count of acidic functional groups (pKa-based; includes carboxylic acids, phenols, sulfonamides, tetrazoles)	6	0.0900	0.0316
GATS3m	Geary autocorrelation, lag 3, atomic mass; mass dissimilarity three bonds apart	2	0.1175	0.0293
GATS1s	Geary autocorrelation, lag 1, Kier-Hall intrinsic state; electrotopological state dissimilarity between directly bonded atoms	2	0.0845	0.0278

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, Y.-H.; Lin, Y.-T.; Wei, A.-C. Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects. Clin. Bioenerg. 2026, 2, 11. https://doi.org/10.3390/clinbioenerg2030011

AMA Style

Lin Y-H, Lin Y-T, Wei A-C. Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects. Clinical Bioenergetics. 2026; 2(3):11. https://doi.org/10.3390/clinbioenerg2030011

Chicago/Turabian Style

Lin, Yu-Heng, Yu-Te Lin, and An-Chi Wei. 2026. "Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects" Clinical Bioenergetics 2, no. 3: 11. https://doi.org/10.3390/clinbioenerg2030011

APA Style

Lin, Y.-H., Lin, Y.-T., & Wei, A.-C. (2026). Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects. Clinical Bioenergetics, 2(3), 11. https://doi.org/10.3390/clinbioenerg2030011

Article Menu

Interpretable Machine Learning for Mitochondrial Toxicity Prediction: Cross-Assay Generalization, Descriptor Transferability, and Context-Dependent Structural Effects

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Molecular Featurization

2.3. Data Splitting

2.4. SMILES Augmentation

2.5. Model Architectures

2.6. Model Training, Validation, and Evaluation

2.7. SHAP Attribution and Distribution Analysis

2.8. Substructure Enrichment Analysis

2.9. Stratified Odds Ratio Analysis

2.10. Computational Environment

3. Results

3.1. Dataset Analysis and Tanimoto Analysis

3.2. Classical Machine Learning Benchmarks

3.3. Deep Learning Performance Across Augmentation Strategies

3.4. Internal Versus External Generalisation Reveals a Systematic Performance Gap

3.5. Descriptor Importance and Distributional Analysis

3.6. Structural Motif Analysis and Context-Dependent Effects

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI