1. Introduction
Mitochondria are present in almost all eukaryotic organisms. They are usually enclosed by membranes, and their biogenesis is a result of delicate coordination between nuclear and mitochondrial genomes [
1]. The mitochondrial intermembrane space is located among two mitochondrial membranes. The mitochondrial matrix is surrounded by the mitochondrial inner membrane [
2]. Mitochondria are not only the energy metabolism center of the body, but they also participate in many important cellular pathological processes [
3,
4], such as electron transfer, adenosine triphosphate synthesis, tricarboxylic acid cycle, fatty acid oxidation, amino acid degradation, and other complex biological processes. Theoretically, for normal cell function, it is critical to have the proteins appear at the right location at the correct time for forming appropriate interactions with correct molecular partners. Mislocalization will make the proteins inaccessible, and thereby not be integrated into the proper functional biological networks or pathways. Dysfunctional mitochondria lead to energy metabolism disorders that cause a series of interacting states of injury. A number of diseases are associated with mitochondria, such as the commonly seen polygenic disorder [
5], Parkinson’s disease, diabetes mellitus, etc. Therefore, understanding the protein submitochondrial location can further understand the function of proteins and provide help for the design of auxiliary drugs for diseases caused by mitochondrial defects. Unfortunately, experimental methods to obtain information about the protein submitochondrial location are expensive and time-consuming. It is vital to develop some effective computational methods to assist researchers in solving this problem.
Protein subcellular localization is a significant research area for proteomics, and researchers have acquired some remarkable achievements in recent years. The exploitation of research at the sub-subcellular level is slow, because it is more complicated than that at protein subcellular localization. However, with the increasing amount of sequence data, computational methods suitable for predicting protein submitochondrial location have emerged. Over the last decade, several effective methods achieved distinct achievement in predicting protein submitochondrial location. For example, Mei et al. [
6] presented a marked nuclear transfer learning model (MK-TLM) method. Lin et al. [
7] employed the Over-Represented Tetrapeptides to predict the submitochondrial location and established the M495 dataset. Kumar et al. [
8] put forward a method that could predict the mitochondrial protein location and submitochondrial location. Qiu et al. [
9] used pseudo-amino acid composition and pseudo-position-specific scoring matrix to extract features. Yu et al. [
10] predicted protein submitochondrial localization by eXtreme gradient boosting. Recently, Savojardo et al. [
11] adopted deep learning to predict the four submitochondrial locations.
The prediction of protein submitochondrial localization is a multi-label multi-class problem. It is hard to train a multi-label predictor due to the limitation of the number of proteins with multi-label. In previous multi-class studies, the mitochondrial intermembrane space proteins were always excluded. However, the amount of mitochondrial intermembrane space proteins has increased, and those proteins should be considered in the following research [
12]. Among the existing methods, only the methods of Kumar et al. and Savojardo et al. allow the discrimination of four different locations. Thus, it is urgent to propose a novel method to predict the submitochondrial localization including the intermembrane space.
Currently, predicting protein submitochondrial localization methods are mainly based on machine learning algorithms. The traditional machine learning method first requires researchers to extract diverse features from protein sequences, including amino acid composition [
13] and pseudo-amino acid composition [
14]. After features are transformed into suitable vectors, the vectors are classified [
15]. Although those methods have achieved good performance, there still are some essential drawbacks; for example, such manually designed features are very likely to be a suboptimal feature representation. Hence, the performance of models is limited. Compared with machine learning methods that require manual feature extraction, deep learning is a feature learning method that can learn from the original data and classify the abstract features with strong correlation and at a higher level through algorithms. It eliminates the noise of manual intervention. Deep learning has been proven to be a very powerful method that has been successfully applied to various biological applications, including genomics, transcriptome, proteomics, structural biology, and chemistry [
15,
16,
17]. A prediction tool “DeepLoc” [
18] based on deep learning was proposed for protein subcellular locations. However, the model considers only one possible label for each protein, whereas the protein subcellular location belongs to a multi-label multi-class problem in general. Long et al. [
19] proposed a model combining CNN and XGBoost to solve the problem. Manaz et al. [
20] used the CNN model to predict the subcellular localization of endometrial system and secretory pathway proteins. To handle the issue for RNA-protein sequence and structure binding preferences, Pan et al. [
21] proposed a model based on convolutional and recurrent neural networks. All of this demonstrates that CNN is an effective deep learning method and widely used in this field.
No predictor is an end-to-end way to predict submitochondrial location. Although Savojardo et al. [
11] employed deep learning to predict the submitochondrial location, it also relied on artificial feature extraction. Another problem remains at the subcellular prediction. Rare researchers viewed the matter of skewed data before categorization, which will cause bias for some categories [
22,
23]. Hence, it is imperative to figure out the classification issue of imbalanced datasets. Convolutional neural networks (CNN) can find motifs in protein sequences, which is very important information for subcellular localization. Therefore, it is very effective to use CNN to capture features in sequences. Unfortunately, CNN cannot capture the effects of past and future states at the current state. To solve this dilemma, we use multi-channel CNN to consider the entire protein sequence.
This paper proposes an end-to-end predictor based on deep learning, namely DeepPred-SubMito. First, it utilizes random over-sampling methods to handle datasets for ensuring the balance among submitochondrial protein classes. Then, it transforms the protein sequence into a one-hot matrix. Finally, it applies multi-channel convolution neural networks to grasp features from protein sequences and output the consequence. We use a cross-validation method to evaluate the performance of our proposed predictor on two datasets containing four submitochondrial locations and compare them with state-of-the-art methods. To further verify the ability of our proposed predictor on a dataset containing only three submitochondrial locations except for the intermembrane space location, we use the M983 dataset to evaluate the performance of our proposed predictor and compare with the state-of-the-art predictors.
The rest of this paper is established as follows.
Section 2 discusses the experimental results of DeepPred-SubMito.
Section 3 introduces two datasets, random over-sampling, convolutional neural networks, and an evaluation index.
Section 4 summarizes this paper.