1. Introduction
Mitochondria are double-membrane organelles that serve as the fundamental hub for cellular energy transduction, metabolic integration, and programmed cell death [
1]. Beyond their primary role in ATP synthesis through oxidative phosphorylation, mitochondria are involved in critical processes such as calcium signaling, heme biosynthesis, and the regulation of innate immunity [
2]. The mammalian mitochondrial proteome comprises approximately 1100 to 1500 distinct proteins, yet only a small fraction is encoded by the mitochondrial DNA (mtDNA). The vast majority of these proteins are encoded by the nuclear genome, synthesized on cytosolic ribosomes, and subsequently imported into the organelle through sophisticated protein translocases [
3,
4]. Within the mitochondrion, proteins are strictly partitioned into four distinct sub-compartments: the outer mitochondrial membrane, the inner mitochondrial membrane, the intermembrane space, and the matrix. Each compartment possesses a unique microenvironment and protein composition tailored to its specific physiological functions; for instance, the inner mitochondrial membrane houses the respiratory chain complexes, while the matrix contains the enzymes for the tricarboxylic acid (TCA) cycle. Consequently, a protein’s sub-organellar localization is a prerequisite for its functional maturity. Mislocalization is often associated with severe pathological conditions, including neurodegenerative diseases and metabolic syndromes [
5]. Given the labor-intensive nature of experimental techniques such as immunofluorescence and sub-cellular fractionation, there is an urgent need for high-throughput computational tools to accurately predict submitochondrial localization from primary sequence data.
In recent years, the explosive growth of protein sequence data has enabled researchers to extract deeper structural information, significantly advancing research in computer-aided protein sub-subcellular prediction, including subnuclear [
6,
7,
8,
9,
10,
11], subchloroplast [
12,
13,
14,
15,
16], and submitochondrial localization [
17,
18,
19,
20,
21]. The computational prediction of protein submitochondrial localization has progressed through several distinct stages of development. Early research focused on the identification of N-terminal presequences, which are characterized by an abundance of positively charged residues and the ability to form amphiphilic
-helices. Claros and Vincens developed MitoProt II [
22], which utilized these statistical properties to estimate the probability of mitochondrial import. Similarly, Emanuelsson and colleagues introduced TargetP [
23,
24], which employed artificial neural networks to discriminate between mitochondrial targeting signals and other sorting peptides based on N-terminal residue patterns.
As the field moved toward resolving internal sub-organellar partitioning, researchers integrated a wider array of biological descriptors. Du and Li developed SubMito [
17], which hybridized pseudo-amino acid composition with various physicochemical features of segmented sequences using Support Vector Machines (SVMs). Du and Li further improved this by incorporating evolutionary information into the Pseudo-Amino Acid Composition (PseAAC) framework [
25]. Building upon these traditional machine learning approaches, SubMito-XGBoost [
20] was developed to leverage the power of ensemble learning. By fusing multiple types of feature information, including evolutionary profiles and physicochemical properties, and employing the eXtreme Gradient Boosting (XGBoost) algorithm, this method significantly improved the robustness and predictive accuracy of submitochondrial localization.
The introduction of deep learning allowed for the extraction of non-linear sequential dependencies. Savojardo et al. developed DeepMito [
18], which used multi-branch convolutional neural networks (CNNs) to process evolutionary profiles. Subsequently, the same group released BUSCA [
26] for integrative sub-cellular mapping. More recently, several specialized frameworks have further refined protein submitochondrial localization prediction. iDeepSubMito [
19] implemented a deep learning framework optimized for hierarchical feature extraction from sequences. DeepPred-SubMito [
27] introduced a multi-channel CNN combined with dataset balancing treatments to improve performance on minority classes. Furthermore, Jiang et al. proposed a framework focused on residue-level interpretation [
28], allowing for the visualization of amino acids that contribute to localization decisions.
More recently, the emergence of protein language models (pLMs), such as ESM-2, has further pushed the performance boundaries by leveraging high-dimensional embeddings that encode deep evolutionary and structural information. Recent pLM-based methodologies typically involve transformer architecture fine-tuning or the integration of complex attention mechanisms on top of the sequence embeddings. While theoretically highly expressive, these transformer-heavy approaches face significant limitations in submitochondrial localization. First, they are intrinsically data-hungry and heavily parameterized, making them highly susceptible to severe overfitting when applied to the small-scale and imbalanced datasets available in this domain (e.g., containing fewer than 600 sequences). Second, they frequently rely on global attention or mean pooling mechanisms across the entire sequence. In long mitochondrial proteins, this mathematically dilutes the high-resolution signal of short N-terminal motifs within the massive non-targeting sequence data, causing the model to miss critical ‘sorting codes’.
To address the dual bottlenecks of signal dilution and model overfitting, we propose Mito-Mixer, a multi-scale feature mixing framework. We introduce a Mixer-based architecture [
29] that employs two orthogonal operations: Token-mixing (analogous to learning the ‘spatial melody’ or positional arrangement of amino acids within targeting motifs) and Channel-mixing (which refines the ‘biochemical vocabulary’ or specific physical properties at each position). By utilizing a parameter-efficient MLP-Mixer architecture on top of frozen ESM-2 embeddings, Mito-Mixer bypasses the computational overhead and overfitting risks associated with transformer fine-tuning while explicitly preserving localized N-terminal features. This multi-scale approach ensures that short, critical motifs remain prominent even in exceptionally long proteins, bridging the gap between global evolutionary context and local targeting signals.
3. Results and Discussion
3.1. Hyperparameter Optimization
To identify the optimal architectural and training configuration for Mito-Mixer, we conducted an automated hyperparameter search using Optuna version 4.7.0 [
34], a state-of-the-art optimization framework. The search process utilized Bayesian Optimization via the Tree-structured Parzen Estimator (TPE) algorithm, which intelligently explores the search space by balancing exploration of new parameter combinations with exploitation of previously identified high-performance regions. The Hyperparameter Optimization (HPO) space was carefully defined to cover structural components of the Mixer blocks, the classification head, and various optimization settings. The specific search ranges are detailed in
Table 2.
During the HPO process, the N-terminal truncation length (
) was fixed at 80 residues. We explicitly excluded
from the Optuna search space to prevent conflating the optimization of network capacity with the biological boundaries of targeting signals. A dedicated parametric sweep of
was subsequently performed (detailed in
Section 3.2).
The optimization was implemented using PyTorch version 2.1.2 and executed on an NVIDIA GeForce RTX 4090 GPU. To ensure the robustness and generalizability of the selected parameters, we employed a stratified 5-fold cross-validation strategy. This stratification guarantees that the highly imbalanced class proportions (e.g., the sparse intermembrane space proteins) are preserved across all folds. During each trial, the model was trained for a maximum of 50 epochs. To prevent over-fitting and reduce unnecessary computational expenditure, an early stopping mechanism was implemented, which terminated the training if the validation loss failed to improve for 20 consecutive epochs. The objective function for Optuna was defined as the maximization of the average Generalized Correlation Coefficient (GCC) across all five folds, ensuring that the final model configuration excels in global multi-class classification consistency.
Following the completion of 50 optimization trials, the best-performing configurations were identified. While the search space was shared, the optimal parameters converged to slightly different values for the SM424-18 and SubMitoPred datasets, reflecting their differing scales and sequence diversities. The final selected parameters, which were used to generate the benchmark results in the following sections, are presented in
Table 3.
3.2. Analysis of N-Terminal Truncation Length ()
Since Mitochondrial Targeting Signals (MTS) are typically located at the N-terminus, the choice of N-terminal Truncation Length () represents a trade-off between capturing sufficient signaling information and avoiding the inclusion of excessive “noise” from the mature protein region. To investigate this relationship, we conducted a sensitivity analysis by varying from 20 to 200 residues while keeping all other hyperparameters constant. This parameter sweep for N was systematically evaluated using the same stratified 5-fold cross-validation protocol employed during the main hyperparameter optimization, ensuring that the selected lengths are robust and not artifacts of a specific data split. The results across both benchmark datasets exhibit a distinct non-linear relationship between sequence length and predictive performance.
In the SM424-18 dataset, the model performance exhibits significant sensitivity to the truncation length. As illustrated in
Figure 2, starting from a baseline GCC of approximately 0.685 at
, the performance climbs steadily as more sequence context is included. The model reaches its global maximum performance at
, achieving a GCC of 0.7443. This suggests that for this specific dataset, 80 residues are sufficient to encompass the majority of mitochondrial targeting presequences (MTS) while maintaining a high signal-to-noise ratio. Beyond
, the GCC fluctuates and generally trends downward, dipping as low as 0.696 at
. This decline validates our hypothesis that including excessive residues from the mature protein can dilute the specialized targeting features, making it harder for the Mixer blocks to extract relevant localization cues.
The SubMitoPred dataset demonstrates a similar overall trend but favors a significantly longer truncation length to reach peak accuracy. As shown in
Figure 3, unlike SM424-18, the performance on the SubMitoPred dataset shows a more prolonged upward trajectory, maintaining a GCC above 0.730 for most intervals. The optimal performance is achieved at
, with a GCC of 0.7878. The requirement for a longer sequence length here likely reflects the presence of more complex or distal targeting signals within the SubMitoPred protein distribution, such as those found in certain inner membrane or intermembrane space proteins that rely on internal motifs rather than just N-terminal presequences. Similar to the first dataset, performance begins to drop once the length exceeds the optimal threshold, falling toward 0.749 at
.
3.3. Performance on Benchmark Datasets
We evaluated the performance of Mito-Mixer on two widely used benchmark datasets: SM424-18 and SubMitoPred. To provide a rigorous assessment, we compared our model against several state-of-the-art (SOTA) predictors, including iDeepSubMito, DeepMito, and SubMito. The performance was quantified using the class-specific Matthews’ Correlation Coefficient (MCC) for the four sub-compartments (matrix, inner membrane, outer membrane, and intermembrane space) and the Generalized Correlation Coefficient (GCC) for overall classification consistency. The performance metrics reported for Mito-Mixer in
Table 4 and
Table 5 represent the mean evaluation scores obtained from the stratified 5-fold cross-validation process. This methodology aligns with standard practices for evaluating models on these specific benchmarks, ensuring fair comparison against SOTA predictors.
The SM424-18 dataset serves as a rigorous benchmark due to its high sequence diversity and the inherent difficulty in distinguishing between the four submitochondrial compartments. As demonstrated in
Table 4, Mito-Mixer achieves a superior performance profile across all evaluation metrics compared to established deep learning architectures such as DeepMito and iDeepSubMito.
Specifically, our model achieves a GCC of 0.7443, representing a significant improvement over iDeepSubMito (0.6803) and DeepMito (0.54). This global metric underscores Mito-Mixer’s ability to maintain high classification consistency across the entire dataset. Mito-Mixer reaches an MCC(M) of 0.8212 and an MCC(I) of 0.7011, outperforming iDeepSubMito by approximately 14% and 12%, respectively. This gain is largely attributed to the Channel-mixing module’s ability to refine biochemical features, such as the high hydrophobicity characteristic of the inner membrane proteins. Our method records an MCC(O) of 0.7479, a marked increase over the 0.6583 achieved by the previous state-of-the-art iDeepSubMito. For the historically challenging intermembrane space category, Mito-Mixer achieves an MCC(T) of 0.6870. This is a substantial leap from DeepMito’s 0.53, validating our strategy of using Token-mixing to preserve and amplify short, high-resolution N-terminal signals (such as cysteine-rich motifs) that are otherwise lost in global pooling methods. Specifically, Token-mixing is highly adept at capturing the strict spatial spacing of cysteine residues, such as the twin CX9C or CX3C motifs, which are the essential hallmarks recognized by the MIA (Mitochondrial Intermembrane space Assembly) pathway for IMS import.
To further validate the robustness and generalization of the Mito-Mixer framework, we evaluated its performance on the SubMitoPred dataset. This dataset provides an alternative distribution of mitochondrial proteins, allowing for a comprehensive comparison against existing high-performance predictors. As summarized in
Table 5, Mito-Mixer consistently exceeds the performance of both DeepMito and iDeepSubMito across all metrics.
Mito-Mixer achieved an overall GCC of 0.7878, a substantial improvement over the 0.6783 reported for iDeepSubMito. This gain demonstrates that our model’s multi-scale feature integration is highly effective across different data sources. The class-specific analysis via MCC reveals several critical insights. The model shows exceptional proficiency in identifying membrane-associated proteins, achieving an MCC(I) of 0.7682 and a remarkable MCC(O) of 0.8404. This represents an improvement of approximately 12% and 14% over iDeepSubMito, respectively. The significantly higher scores in these categories reinforce our hypothesis that the Channel-mixing module successfully distills biochemical features like hydrophobicity and transmembrane helix propensity from the ESM-2 embeddings. For matrix proteins, Mito-Mixer recorded an MCC(M) of 0.7886. Compared to DeepMito (0.76) and iDeepSubMito (0.7335), our model more accurately captures the traditional N-terminal presequences that guide proteins to the mitochondrial interior. Notably, Mito-Mixer achieved an MCC(T) of 0.7298, which is the highest recorded for this difficult-to-predict compartment (Intermembrane Space). This is a 33% relative improvement over iDeepSubMito (0.5494) and significantly higher than DeepMito (0.46). This result confirms that our Token-mixing approach is uniquely capable of identifying the subtle, non-canonical signals, such as cysteine-rich motifs, that characterize intermembrane space-bound proteins.
3.4. Generalization on Independent Test Set (M983)
To evaluate the practical utility and robustness of Mito-Mixer, we conducted an independent generalization test using the M983 dataset. For this generalization experiment, we utilized the Mito-Mixer model that was fully trained and optimized on the SubMitoPred benchmark dataset. This model was subsequently deployed to predict the localization of the entirely unseen sequences in the M983 dataset. This is a critical measure of whether the model has learned general biological principles of protein sorting rather than merely over-fitting to specific dataset biases.
The results, detailed in
Table 6, demonstrate that Mito-Mixer maintains a high level of predictive accuracy even when confronted with diverse sequences from the M983 dataset. Mito-Mixer achieved a GCC of 0.6945, significantly outperforming iDeepSubMito (0.6321) and DeepMito (0.5558). This result confirms that the hierarchical mixing architecture effectively captures evolutionary and structural signatures that remain consistent across different proteomic sources.
Specifically, our model achieved a near-perfect MCC(O) of 0.9302, a substantial leap over previous methods. This suggests that the Channel-mixing component is highly reliable at identifying the specific biochemical signatures of outer membrane-resident proteins, such as beta-barrel structures or unique hydrophobic anchors. The model maintained strong performance in the matrix and inner membrane categories, with MCC(M) of 0.7656 and MCC(I) of 0.7713, respectively. These scores indicate that the multi-scale integration of N-terminal “Head” features and global context allows the model to generalize the complex “MTS plus transmembrane domain” logic required for these compartments. A notable observation in
Table 6 is the 0.0000 MCC(T) recorded across all models, including Mito-Mixer. This result is not a reflection of model failure, but rather a characteristic of the M983 independent test set. Unlike the benchmark datasets used for training, the M983 dataset does not contain any proteins categorized under the intermembrane space. Despite the reduced number of target classes, Mito-Mixer’s superior GCC of 0.6945, compared to 0.6321 for iDeepSubMito, demonstrates that our model remains the most robust architecture for identifying the primary mitochondrial sub-compartments in large-scale, real-world proteomic data. We explicitly acknowledge this as a limitation of the current generalization test, as the model’s robust capacity to identify novel IMS proteins cannot be externally validated on this specific independent dataset.
To provide deeper visual insight into the performance metrics,
Figure 4 presents the 4-class confusion matrix generated from the independent M983 test set. A detailed analysis of the diagonal elements (true positives) and off-diagonal elements (misclassifications) reveals where the model excels and where it encounters biological ambiguity. The model demonstrates exceptional accuracy in identifying Matrix proteins (with 173 out of 177 correctly classified) and Outer Membrane proteins (133 out of 145). The most prominent source of misclassification originates from the Inner Membrane category, where 77 proteins were incorrectly predicted as Matrix residents and 31 as Intermembrane Space proteins. This specific misclassification pattern is highly consistent with mitochondrial biology: many Inner Membrane proteins share classical N-terminal targeting presequences with Matrix proteins (as both are initially imported via the TOM/TIM23 translocase complexes) and rely on secondary, downstream hydrophobic “stop-transfer” signals for membrane anchoring. It is inherently challenging for purely sequence-based models to perfectly disambiguate these compartments when secondary signals are subtle or atypical. Furthermore, as previously noted, the true label row for the Intermembrane Space is entirely zero due to the M983 dataset’s specific composition. Nevertheless, the matrix confirms that the model maintained stable control over false positive predictions for this class (only 35 total false assignments out of 983 samples), which was crucial for preserving the high overall Generalized Correlation Coefficient (GCC).
3.5. Ablation Study
To systematically evaluate the contribution of each core component in Mito-Mixer, we conducted a series of ablation experiments across two benchmark datasets: SM424-18 and SubMitoPred. These experiments were designed to disentangle the benefits of our multi-scale input strategy from the architectural innovations of the Mixer block. To ensure statistical consistency, all ablation variants were evaluated using the same stratified 5-fold cross-validation protocol employed for the primary model evaluation. The results are summarized in
Table 7.
The first two experiments address the “signal dilution” problem by testing the processing mechanism and the input scope respectively.
Exp A (No Mixer—Architecture Ablation): We replaced the Mixer blocks with a standard Global Average Pooling layer applied to the N-terminal slice. While this model still utilizes the N-terminal information, it lacks the non-linear interaction capacity of the Mixer blocks. In SM424-18, the GCC dropped from 0.7443 to 0.7143, and in SubMitoPred, it fell from 0.7878 to 0.7380. This indicates that simply isolating the N-terminus is insufficient; the model requires the Mixer’s ability to decode complex spatial interdependencies to achieve optimal accuracy.
Exp B (No N-terminal Slicing—Input Ablation): We tested the necessity of explicit N-terminal focus by feeding the full protein sequence directly into the Mixer blocks. Despite the powerful modeling capability of the Mixer architecture, the GCC declined to 0.6911 on SM424-18 and 0.7536 on SubMitoPred. This confirms that in long sequences, localized targeting signals are mathematically “diluted” by the massive amount of mature protein data, proving that the specialized strategy is essential to preserve high-resolution targeting codes.
We further investigated the individual contributions of the two fundamental mixing mechanisms within the Mixer block. The results reveal that neither dimension alone is sufficient to maintain the model’s predictive power.
Token-only Variant: This configuration focuses purely on spatial correlations and positional motifs. The GCC showed a clear decrease to 0.7251 on SM424-18 and 0.7403 on SubMitoPred. While this variant can still recognize some “rhythmic” arrangements like amphipathic -helices, the performance gap compared to the full model highlights that spatial patterns alone cannot accurately resolve sub-compartment identities without the refined biochemical context provided by channel interaction.
Channel-only Variant: This version treats each residue as an independent biochemical entity, ignoring its position. This led to a further decline in performance, with GCC values falling to 0.7121 and 0.7386, respectively. The drop in consistency proves that treating residues as isolated chemical units, ignoring their sequential context, severely handicaps the model’s ability to recognize structural motifs, such as the cysteine-rich patterns essential for the intermembrane space pathway.
Overall, the full Mito-Mixer architecture consistently outperforms all ablated versions across both datasets. This proves that state-of-the-art accuracy in protein submitochondrial localization requires a synergistic integration of multi-scale spatial focus and dual-dimensional feature mixing, as the “sorting code” is composed of both spatial arrangements and specific biochemical features.
3.6. Computational Case Study: Disentangling Bipartite Targeting Signals
To biologically validate our hypothesis regarding signal dilution, we analyzed the model’s processing of proteins with bipartite targeting signals, such as Cytochrome c1 (a resident of the inner mitochondrial membrane). Biologically, these proteins rely on a complex N-terminal code: an initial positively charged amphipathic -helix that directs the protein to the Matrix via the TIM23 complex, immediately followed by a hydrophobic “stop-transfer” sequence that arrests translocation and anchors it in the Inner Membrane.
In our ablation studies, models relying solely on global mean pooling frequently misclassified such proteins as Matrix residents. Mathematically, the broader sequence data diluted the subtle, localized hydrophobic stop-transfer signal. However, Mito-Mixer successfully classifies these proteins into the Inner Membrane. By explicitly truncating the input to the high-resolution N-terminal head and processing it through the Channel-mixing module, the framework effectively isolates and amplifies the biochemical signature (hydrophobicity) of the stop-transfer signal against the background of the initial matrix-targeting sequence. This case study confirms that Mito-Mixer aligns with known biological sorting mechanisms, successfully decoding complex, multi-part targeting motifs that are easily lost in global embeddings.
4. Conclusions
In this study, we introduced Mito-Mixer, a novel multi-scale MLP-Mixer framework specifically engineered to address the “signal dilution” challenge in protein submitochondrial localization. By integrating high-resolution N-terminal features with global evolutionary semantics from the ESM-2 protein language model, Mito-Mixer provides a robust solution for annotating protein positions within the complex mitochondrial structure. Our comprehensive evaluation demonstrates that the model effectively mitigates the dilution of essential sorting codes, a common failure point for traditional global pooling methods. Through the explicit slicing of N-terminal sequences and the specialized use of the Token-mixing module, the framework successfully preserves critical, short-range targeting signals that are often overwhelmed by the larger mature protein body. The experimental results across the SM424-18 and SubMitoPred datasets establish Mito-Mixer as a state-of-the-art tool in the field. Notably, the model achieved a significant leap in identifying the intermembrane space proteins and exhibited high precision in distinguishing between the inner and outer membranes. These improvements in MCC and GCC metrics are further reinforced by the high performance on the M983 independent test set, which confirms that Mito-Mixer has captured fundamental biological principles of protein translocation rather than dataset-specific biases. This robust generalization capability underscores its potential for large-scale proteomic annotation and its utility in advancing our understanding of mitochondrial function and related molecular diseases. Furthermore, our systematic ablation studies revealed that the synergy between Token-mixing for spatial motifs and Channel-mixing for biochemical refinement is the primary driver of the model’s accuracy.
Furthermore, while attention-based architectures are theoretically highly expressive, they are intrinsically data-hungry and prone to severe overfitting when applied to small-scale, highly imbalanced datasets like the currently available submitochondrial benchmarks (e.g., comprising only 424 to 570 sequences). Mito-Mixer explicitly mitigates this risk by utilizing a parameter-efficient MLP-Mixer architecture to decode the deep evolutionary semantics already captured by the pre-trained ESM-2 model. This design strikes an optimal balance between feature expressiveness and model generalization, avoiding the heavy computational overhead and overfitting risks associated with additional Transformer blocks.
While Mito-Mixer demonstrates robust predictive capabilities, the current framework remains constrained by its reliance on purely sequence-based features and the inherent biases of small-scale experimental datasets. Future developments should address these constraints by incorporating 3D structural information (e.g., predicted via AlphaFold) to further refine the model’s precision, particularly for minority class proteins.