Accurate Prediction and Key Feature Recognition of Immunoglobulin

: Immunoglobulin, which is also called an antibody, is a type of serum protein produced by B cells that can speciﬁcally bind to the corresponding antigen. Immunoglobulin is closely related to many diseases and plays a key role in medical and biological circles. Therefore, the use of effective methods to improve the accuracy of immunoglobulin classiﬁcation is of great signiﬁcance for disease research. In this paper, the CC–PSSM and monoTriKGap methods were selected to extract the immunoglobulin features, MRMD1.0 and MRMD2.0 were used to reduce the feature dimension, and the effect of discriminating the two–dimensional key features identiﬁed by the single dimension reduction method from the mixed two–dimensional key features was used to distinguish the immunoglobulins. The data results indicated that monoTrikGap ( k = 1) can accurately predict 99.5614% of immunoglobulins under 5-fold cross–validation. In addition, CC–PSSM is the best method for identifying mixed two–dimensional key features and can distinguish 92.1053% of immunoglobulins. The above proves that the method used in this paper is reliable for predicting immunoglobulin and identifying key features.


Introduction
Immunoglobulin, also known as an antibody, is a serum protein present in humans. When the immune system of the body encounters invasion, B cells are stimulated, depending on the degree of invasion, to produce different numbers of globins that can specifically bind to the corresponding antigen and provide immune functions. Immunoglobulins, therefore, play a key role in protecting the human body from internal and external threats and help maintain the stability of the immune system and self-tolerance [1]. Immunoglobulins are closely related to disease treatment and have been used for a long time in the study of multiple autoimmune diseases. For example, treatment with intravenous immunoglobulin in patients with systemic sclerosis not only alleviates muscle symptoms but also ameliorates systemic inflammation in skin disease [2]. Immunoglobulins also exert a remission effect against different forms of lupus erythematosus skin disease, and, even when used in the treatment of Behcet's disease, there is a sustained response over time without any side effects [3,4]. The in-depth study of immunoglobulins can better determine the immune mechanism and develop effective drugs to treat diseases [5].
In fact, the detection of immunoglobulins has attracted the attention of researchers. Marcatili et al. developed a strategy to predict the 3D structure of antibodies [6]. This strategy only approximately ten minutes on average to build a structural model of the antibody. This process is fully automated while achieving a very satisfactory level of accuracy. In order to identify antigen-specific human monoclonal antibodies, Liu et al. successfully developed an antibody clone screening strategy based on clone kinetics and relative frequency [7]. This method can simplify the subsequent experimental screening. The enzyme-linked immunosorbent assay showed that at least 52% of the putative positive immunoglobulin heavy chains constituted antigen-specific antibodies. In addition, Salvo et al. introduced biosensors for detecting the total content of immunoglobulins, including electrochemical biosensors, optical biosensors, and piezoelectric biosensors [8]. Immunoglobulin optical biosensors are mainly based on surface plasmon resonance detection, but the current limitation is that almost all of them only work in buffer solutions. These current state-of-the-art technologies do help the study of immunoglobulin, but biochemical experiments usually need considerable money or time [9,10].
With the proliferation of protein data, we urgently need effective and efficient computational methods to identify immunoglobulins, and the first step to reveal the function of immunoglobulins is to accurately identify them [11,12]. Over the past decade, a remarkable number of machine learning-based techniques for protein sequence analysis have been developed [13,14]. The amino acid composition is an important factor for protein identification. The amino acid composition (ACC) model, as a commonly used feature representation method, was used to represent the normalized frequency of natural amino acids in peptide chains [15][16][17][18]. Subsequently, the concept of pseudo amino acid composition (PseACC), also based on the amino acid sequence, was proposed as a widely used method [19][20][21][22][23]. An improved feature extraction method based on the amino acid composition of pseudo amino acid composition is better than the feature extraction method of amino acids in the protein prediction model because, not only the amino acid composition but also the physical and chemical properties of the correlation between two residues are included [24][25][26][27]. Inspired by the pseudo amino acid composition, a pseudo k-tuple reduced amino acid composition (PseKRAAC) model was proposed by reducing the computational barriers for complexity reduction of proteins by reducing the use of amino acid letters [28].
The above methods focus mainly on the feature representation of protein sequences. To discriminate between immunoglobulins and non immunoglobulins, Tang et al. subsequently proposed a prediction model based on a support vector machine for the combination of pseudo amino acid composition and nine physical and chemical properties of amino acids [29]. However, this model passed 105 features, and jackknife experimental results indicated that 96.3% of the immunoglobulins could be correctly predicted, a result that awaits further improvement. How key features are additionally exploited to recognize immunoglobulins remains to be investigated.
In this paper, two feature representation methods, profile-based cross covariance (CC-PSSM) [30] and monoTriKGap [31] were selected to explore the accurate prediction problems of immunoglobulins. With the application of MRMD1.0 and MRMD2.0 feature selection techniques, feature dimension screening of two-dimensional key features achieved a high identification effect. The results showed that the best feature subset generated by the monoTriKGap feature extraction method was able to correctly predict 99.6% of the immunoglobulins by the support vector machine(SVM) classifier [32] based on sequential minimal optimization. The CC-PSSM feature extraction method was better able to identify key features discriminating immunoglobulins, and the identified two-dimensional mixed key features were validated by the multilayer perceptron classifier and could correctly identify 92.1% of the immunoglobulins. In addition, the performance of different feature extraction methods under different classifiers is compared, which proves that the method in this paper is reliable for immunoglobulin research.
In this part, we will introduce the establishment of benchmark data. Since immunoglobulins are often found on or outside the cell membrane, to ensure proper discrimination, we picked a certain number of immunoglobulins both at the cell membrane and extracellularly. Immunoglobulin and non immunoglobulin sequences were downloaded from the UniProt [33] database. To establish a benchmark dataset, the following steps were taken. Protein sequences containing the nonstandard amino acid characters "B", "J", "O", "X", "U" and "Z" were first deleted. Second, to avoid overfitting caused by homologous bias and to reduce redundancy, the CD-Hit program [34,35] was selected to set a 60% sequence identity cutoff to remove highly similar sequences. Finally, if a certain protein sequence was a subsequence of other proteins, it was also removed. Considering that we needed to avoid the influence of the expression of different protein sequences on the predicted effects, we selected only human, mouse, and rat samples.
After filtering, immunoglobulin dataset samples are represented by I + , non immunoglobulin dataset samples by I − , and the benchmark dataset is a combination of I + and (1) Figure 1. The overall framework of immunoglobulin prediction and key feature recognition. First, the dataset was established, and, next, the protein sequence was represented by CC-PSSM and monoTriKGap. Then, the work was divided into two parts: identifying key features and predicting and evaluating. In the above steps, MRMD1.0, MRMD2.0, and MRMD1.0+2.0 were used to obtain key features, and k-fold cross-validation was performed under Naïve Bayes, multilayer perceptron, and SVM classifiers (k = 5).
This paper shows that these three classifiers chosen in the text work better than the others through comparison. Meanwhile, the independent test set shows that the model in this paper has good generalization performance.

Dataset Construction
In this part, we will introduce the establishment of benchmark data. Since immunoglobulins are often found on or outside the cell membrane, to ensure proper discrimination, we picked a certain number of immunoglobulins both at the cell membrane and extracellularly. Immunoglobulin and non immunoglobulin sequences were downloaded from the UniProt [33] database. To establish a benchmark dataset, the following steps were taken. Protein sequences containing the nonstandard amino acid characters "B", "J", "O", "X", "U" and "Z" were first deleted. Second, to avoid overfitting caused by homologous bias and to reduce redundancy, the CD-Hit program [34,35] was selected to set a 60% sequence identity cutoff to remove highly similar sequences. Finally, if a certain protein sequence was a subsequence of other proteins, it was also removed. Considering that we needed to avoid the influence of the expression of different protein sequences on the predicted effects, we selected only human, mouse, and rat samples.
After filtering, immunoglobulin dataset samples are represented by I + , non immunoglobulin dataset samples by I − , and the benchmark dataset is a combination of I + and I − The I + dataset includes 109 positive samples, and the I − dataset includes 119 negative samples. Therefore, the benchmark dataset consists of 228 protein sequences, and the detailed information is shown in Table 1. These can be downloaded for free from https: //github.com/gongxiaodou/Immunoglobulin (accessed on 21 July 2021). To further validate the accuracy of the method in this paper for immunoglobulin prediction and the reliability of key feature identification, we used two datasets for independent testing. The CC-PSSM feature representation method is based on the position-specific scoring matrix (PSSM) [36,37] as a feature. Each immunoglobulin sequence runs PSI-BLAST [38] through NCBI's NR database for local information comparison to obtain PSSM matrix information. The element S ji in the PSSM matrix represents the substitution score of the amino acid i at the sequence position j. Each protein sequence containing D residues can be represented as where R j (j = 1, 2, 3 . . . D) represents the position j of the residue in the protein sequence sample R. CC-PSSM [39] transforms PSSM matrices of different sizes into vectors of the same length. CC [40] was used to calculate the different properties of the two residues separated along with the sequence lag. The formula was calculated as follows where i 1 i 2 represents two different amino acids, S i 1 , S i 2 represents the mean of substitution scores for amino acids i 1 and i 2 along the sequence, and D represents the length of the protein sequence. Calculated in this way, the PSSM matrix resulting from each protein sequence alignment will be transformed into a vector of length 380 * lag, and lag is the maximum value of LG (LG = 1, 2 . . . , lag). In this study, we set the maximum lag number to 2. When LG = 1, the extracted features such as "CC(A,R,1)", "CC(A,N,1)", "CC(A,D,1)", etc., are transformed into a vector of length 380. When LG = 2, the extracted features such as "CC(A,R,2)", "CC(A,N,2)", "CC(A,D,2)", etc., are transformed into a 380-length vector. Therefore, each protein sample was finally computationally transformed into a vector of length 760. The demonstration process is shown in Figure 2. Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 17

monoTriKGap
The monoTriKGap feature extraction method was used in this article, which stems from PyFeat. This method has been widely used in the prediction of proteins and other biologies [41][42][43]. PyFeat differs from the previous adoption of Kmer [44] frequency by setting the important parameter KGap. The kmer frequency has always been a principal method for extracting the local characteristics. However, as the length of K, the subsequence continues to increase, and the number of features also increases sharply. For proteins, there is a surge quantity of features produced due to the higher number of amino acids. Thus, monoTriKGap uses KGap parameters to address this limitation [45,46]. In the monoTriKGap model, the parameter KGap can be set to 1, 2, or 3.
The important point is that, while generating the full feature set, monoTriKGap chooses the AdaBoost classification model [47] to reduce the redundant features to generate the best feature set. Utilizing this model not only reduces the feature dimensionality but also guarantees robustness under high-dimensional feature multicollinearity. In this study, to reduce data sparsity, KGap was set at 1. When KGap = 1, the characteristic shape is _ X XXX , where X is the twenty natural amino acids, denoted as  (c) The first 20 columns of alignment information were intercepted to obtain the PSSM matrix corresponding to each protein.
(d) CC was calculated for each PSSM matrix, and lag = 2 was set. First, the eigenvector of 380 length when LG = 1 was obtained. (e) Then each PSSM matrix was calculated with the CC when LG = 2, and the eigenvector of length 380 is obtained. Finally, when lag = 2, each protein sample will be converted into a vector of length 760 after calculation.

monoTriKGap
The monoTriKGap feature extraction method was used in this article, which stems from PyFeat. This method has been widely used in the prediction of proteins and other biologies [41][42][43]. PyFeat differs from the previous adoption of Kmer [44] frequency by setting the important parameter KGap. The kmer frequency has always been a principal method for extracting the local characteristics. However, as the length of K, the subsequence continues to increase, and the number of features also increases sharply. For proteins, there is a surge quantity of features produced due to the higher number of amino acids. Thus, monoTriKGap uses KGap parameters to address this limitation [45,46]. In the monoTriKGap model, the parameter KGap can be set to 1, 2, or 3.
The important point is that, while generating the full feature set, monoTriKGap chooses the AdaBoost classification model [47] to reduce the redundant features to generate the best feature set. Utilizing this model not only reduces the feature dimensionality but also guarantees robustness under high-dimensional feature multicollinearity. In this study, to reduce data sparsity, KGap was set at 1. When KGap = 1, the characteristic shape is X_XXX, where X is the twenty natural amino acids, denoted as Features such as "A_AAA", "A_AAC", etc., are generated. The full dataset generated by the monoTriKGap model at this time had 160,000 features and was automatically optimized by AdaBoost to generate the best feature set containing 212 more discriminative features. When KGap = 1, taking the sequence "EAHAAALAACAAAYHYYWLECYRYYY" as an example, the feature dataset generation is demonstrated in Figure 3. Features such as "A_AAA", "A_AAC", etc., are generated. The full dataset generated by the monoTriKGap model at this time had 160,000 features and was automatically optimized by AdaBoost to generate the best feature set containing 212 more discriminative features. When KGap = 1, taking the sequence "EAHAAALAACAAAYHYYWLECYRYYY" as an example, the feature dataset generation is demonstrated in Figure 3. In this study, KGap = 1 was set, and the character was represented as X_XXX, where X represents 20 kinds of natural amino acids. By calculating the frequency of occurrence of each feature X_XXX, the feature value was obtained.

Classifier
To further accurately predict whether the protein sequence is an immunoglobulin, this classification problem is regarded as a dichotomy problem [48,49]. Three classifiers were employed in this paper to select those that could predict immunoglobulins more accurately by comparison. The three classifiers used were Naïve Bayes, SVM, and multilayer perceptron.
Naïve Bayes (NB) [50,51], as a Bayesian probabilistic classifier, is assumed to be independent and equal across features for classification. The independence of samples from each other is not affected by either and does not cause interference with the classification results. Based on this characteristic, the feature classification of samples avoids the linear influence, so that it is also easy to implement, has fast running speed, and is noise insensitive in high-dimensional features, which is beneficial for applications. As a supervised machine learning method, the support vector machine can solve both classification and regression problems. This paper is a binary classification problem, the basic idea of which is to separate samples of different categories by finding a separation hyperplane. In order to reduce the amount of computation and memory, John C. Platt proposed sequential minimal optimization [52,53]based on support vector machines (SVMs) [54][55][56]. This is widely used because it decomposes the large quadratic programming (QP) problem that SVMs need to solve into a series of minimum possible QP problems, avoiding time-consuming iterative optimization of in-house QPs. The choice of kernel function for support vector machines is very important. On the same dataset, different kernel function algorithms may have different prediction effects. In general, appropriate kernel functions can . The calculation principle of the protein sequence expressed by monoTriKGap is explained. In this study, KGap = 1 was set, and the character was represented as X_XXX, where X represents 20 kinds of natural amino acids. By calculating the frequency of occurrence of each feature X_XXX, the feature value was obtained.

Classifier
To further accurately predict whether the protein sequence is an immunoglobulin, this classification problem is regarded as a dichotomy problem [48,49]. Three classifiers were employed in this paper to select those that could predict immunoglobulins more accurately by comparison. The three classifiers used were Naïve Bayes, SVM, and multilayer perceptron.
Naïve Bayes (NB) [50,51], as a Bayesian probabilistic classifier, is assumed to be independent and equal across features for classification. The independence of samples from each other is not affected by either and does not cause interference with the classification results. Based on this characteristic, the feature classification of samples avoids the linear influence, so that it is also easy to implement, has fast running speed, and is noise insensitive in high-dimensional features, which is beneficial for applications. As a supervised machine learning method, the support vector machine can solve both classification and regression problems. This paper is a binary classification problem, the basic idea of which is to separate samples of different categories by finding a separation hyperplane. In order to reduce the amount of computation and memory, John C. Platt proposed sequential minimal optimization [52,53] based on support vector machines (SVMs) [54][55][56]. This is widely used because it decomposes the large quadratic programming (QP) problem that SVMs need to solve into a series of minimum possible QP problems, avoiding time-consuming iterative optimization of in-house QPs. The choice of kernel function for support vector machines is very important. On the same dataset, different kernel function algorithms may have different prediction effects. In general, appropriate kernel functions can improve the prediction accuracy of the model, such as linear kernel function, polynomial kernel function, and radial basis function (RBF). The multilayer perceptron (MLP) [57] is a simple artificial neural network, in which neurons are connected between adjacent layers and there is no connection between neurons in each layer. It maps the input dataset to the output set in a feedforward manner, and the output of each node is a weighted unit followed by a nonlinear activation function to distinguish nonlinearly separable data. The multilayer perceptron is usually trained using backpropagation. Previously, the MLP results for solving classification problems have been well verified.

Key Feature Recognition
In the feature extraction subsection, several hundred features were extracted by CC-PSSM and monoTriKGap methods. However, there was redundancy between these features. This section introduces the identification of two-dimensional key features by means of MRMD1.0 and MRMD2.0 feature selection techniques, reaching the experimental effect of predicting immunoglobulins with fewer characteristics.
The feature selection method of MRMD1.0 [58,59] is decided in two main parts: one is the correlation between the characteristic and the instance class standard, and the Pearson correlation coefficient is used to calculate the correlation between the characteristic and the class standard. The other part is the redundancy among characteristics. This method makes use of three distance functions: Euclidean distance, cosine distance, and the Tanimoto coefficient, to calculate the complexity among characteristics. A larger Pearson correlation coefficient indicates that the features are more closely related to the class scale, and a larger distance indicates less redundancy among the characteristics. Finally, MRMD1.0 generates feature subset ranking information with strong correlation to class labels and low redundancy between features. Here, we selected the first two features as the two-dimensional key features identified by MRMD1.0 based on the ranking information.
MRMD2.0 [60], as a currently commonly used feature ranking and dimension reduction tool, contains seven means of feature ranking: MRMR, LASSO, ANOVA, and MRMD [61]. MRMD2.0 utilizes the PageRank algorithm technique to calculate a directed graph score for each feature, ranking features according to score. Meanwhile, users can also custom-select feature numbers, yielding the optimal feature subset with maximum relevance and minimum redundancy balanced. Here, we chose to screen 2 optimal features as the key features for immunoglobulin recognition.

Performance Evaluation
To further estimate the classification performance of our selected feature set and two-dimensional key features, the TP rate (TPR), FP rate (FPR) [62], precision [63][64][65][66], Matthews correlation coefficient (MCC) [67], and accuracy (ACC) [68][69][70][71] were calculated and compared to obtain the best immunoglobulin accurate prediction and key feature identification method. Individual performance metrics were calculated as follows TP indicates the amount of exactly forecasted immunoglobulin samples, and FN indicates the amount of exactly forecasted non immunoglobulin samples [72]. TPR represents the ratio of correctly forecasted immunoglobulins, and FPR represents the ratio of inexactly forecasted non immunoglobulins. Precision indicates the rate of correctly classifying positive datasets. MCC indicates the correlation between the actual and forecasted classification. ACC indicates the ratio of exactly classified datasets.

Comparison of Different Feature Extraction and Classification Methods
According to the previous article, this study compared the prediction effects of the three classifiers: Naïve Bayes, SVM, and multilayer perceptron. Among them, the parameters of the three classifiers adopted the default parameters built in the algorithm. The default kernel function of SVM was linear kernel function, and the value of penalty coefficient was C = 1.0. The topology of the multilayer perceptron was selected as a simple 3-layer network, including an input layer, a hidden layer, and an output layer, using the Sigmoid function as the activation function. CC-PSSM and monoTrikGap feature extraction methods were compared with previous research results, and we tested the accuracy of the classification of the immunoglobulin dataset through 5-fold cross validation [73]. The predictions obtained from the 760 features extracted by the CC-PSSM method, and the 212 best feature subsets generated by the monoTriKGap method with the three different classifiers mentioned above, are presented in Table 2, and the contrasts of ACC values are presented in Figure 4.    The data in Table 2 show that for the CC-PSSM feature extraction method, the ACC values of the multilayer perceptron classifier are higher than those of Naïve Bayes and SVM. Using multilayer perceptron to predict the immunoglobulin TPR value, the value reached 0.961, the FPR value reached 0.041, the MCC value reached 0.921, and the ACC value reached 96.0526%. The ROC curve area was 0.994. Compared with the Naïve Bayes classifier, the ACC value increased by 3.1%. For the monoTriKGap feature extraction method, the TPR, precision, ACC, and other values of the SVM classifier were higher than the values of Naïve Bayes and multilayer perceptron. Using SVM to predict the immunoglobulin TPR value reached 0.996, the FPR value reached 0.004, the MCC value reached 0.991, and the ACC value reached 99.5614%. The ROC curve area was 0.996. Compared with the Naïve Bayes classifier, the ACC value increased by 7.5%.
Through comparison and analysis, the study found that the SVM classification result of the best feature subset extracted by monoTriKGap improved compared with the multilayer perceptron classification result of the feature subset extracted by CC-PSSM and the prediction model proposed by Tang et al. [29]. This shows that employing the monoTriKGap feature extraction method to generate the best feature subset and SVM can achieve a higher prediction effect, which is most conducive to the accurate prediction of immunoglobulins.
Then, we compared the performance of SVM under three kinds of kernel functions (linear kernel function, quadratic polynomial kernel function, and radial basis kernel function), as shown in Table 3. It can be seen from Table 3 that when using the best feature subset of monoTriKGap, the prediction effect of the linear kernel function was better than the other two kernel functions. The linear kernel function used to predict the accuracy of immunoglobulin reached 99.56%, the MCC value reached 0.991, and the precision value reached 0.996. The ACC value was 13.16% higher than the polynomial kernel function and 25% higher than the radial basis kernel function. Therefore, this paper adopted the linear kernel function as the kernel function of the support vector machine.

Key Feature Analysis
This study introduced the use of MRMD1.0 and MRMD2.0 for two-dimensional key feature recognition. Key feature recognition and analysis were performed on the feature subsets extracted based on CC-PSSM and monoTriKGap, respectively.
First, based on the 760 features extracted by CC-PSSM, MRMD1.0 was used to reduce the dimensionality, and the first two features were selected according to the ranking information to be CC(P, D, 2) and CC(E, R, 1) as the first group two-dimensional key features. Second, MRMD2.0 was used to reduce the dimensionality of the original feature set; the number of generated features was set to two, and the two matched features were CC(D, T, 2) and CC(H, C, 1) as the second group of two-dimensional key features. Then, the key features of the first two groups were combined in any two pairs to generate four sets of 2-dimensional feature combinations that were different from the first two groups. The feature combined with the largest classification index ACC value, namely CC(E, R, 1) and CC(D, T, 2), were selected as the two-dimensional key features of MRMD1.0 and MRMD2.0.
Based on the 212 best features extracted by monoTriKGap, the above steps were also performed. The first two-dimensional key features, including D_DDD and H_HHD, were obtained through MRMD1.0. Through MRMD2.0, the second set of two-dimensional key features, including F_HHV and D_HHF, were obtained. By comparing the ACC values of four sets of two-dimensional hybrid features in any combination, we obtained the hybrid two-dimensional key features of MRMD1.0 and MRMD2.0, including H_HHD and F_HHV.
To analyse the ability of each group of key features to distinguish immunoglobulins, the classification performance of the three groups of key features of CC-PSSM and monoTriKGap was evaluated. Three classifiers of Naïve Bayes, SVM, and multilayer perceptron classifiers were used under 5-fold cross validation. The results of the two-dimensional key feature analysis are shown in Table 4. The research results in Table 4 show that after the features extracted by CC-PSSM reduce the dimensionality, the mixed two-dimensional key features of MRMD1.0 and MRMD2.0 have better classification performance than the single-obtained two-dimensional key features. The ACC value of mixed two-dimensional key features using the multilayer perceptron classifier was 92.1053% higher than the ACC value of the single group of two-dimensional key features, which were 90.3509% and 84.6491%, respectively. At this time, the TPR was 0.921, the FPR was 0.081, the MCC was 0.842, and the ROC curve area was 0.934. Similarly, after the features extracted by monoTriKGap reduced the dimensionality, the classification performance of the mixed two-dimensional key features was also better than the classification performance of a single group of two-dimensional key features. At this time, MRMD1.0 and MRMD2.0 mixed two-dimensional key features using a Naïve Bayes classifier to achieve the ACC value of 53.9474%. Most importantly, the mixed two-dimensional key features of the CC-PSSM feature extraction method were better than monoTriKGap, which proves that the CC-PSSM feature extraction method could better identify the key features to distinguish immunoglobulins. Figure 5 shows the scatter plot of the 2-dimensional mixed features recognized by CC-PSSM to distinguish immunoglobulins from non immunoglobulins. key features. At this time, MRMD1.0 and MRMD2.0 mixed two-dimensional key features using a Naïve Bayes classifier to achieve the ACC value of 53.9474%. Most importantly, the mixed two-dimensional key features of the CC-PSSM feature extraction method were better than monoTriKGap, which proves that the CC-PSSM feature extraction method could better identify the key features to distinguish immunoglobulins. Figure 5 shows the scatter plot of the 2-dimensional mixed features recognized by CC-PSSM to distinguish immunoglobulins from non immunoglobulins.

Compared with Other Classifiers
For a fair comparison, we further studied the performance of the other three classifiers on the same benchmark dataset, namely k-Nearest Neighbor (KNN) [74], C4.5 [75], and random forest (RF) [76,77]. The parameters of the classifier were set to default values. The basic idea of KNN is that there are always k most similar samples in the feature space. If most of the samples belong to a certain category, the sample also belongs to this category. Here, we set the value of k in our model to be 3. The C4.5 algorithm, as a classification decision tree algorithm [78] that uses the information gain rate to select node attributes, is pruned in the tree construction and the generated classification rules are easy to understand. Here we set the default confidence factor for pruning c = 0.25.
The previous study showed that monoTriKGap had a better predictive ability for immunoglobulin under the SVM classifier. Next, we used the best 212 feature subsets extracted by monoTriKGap for performance evaluation under KNN, C4.5, and RF. The comparison results are recorded in Table 5. The data in Table 5 further verifies that the monoTrikGap feature extraction method generated the best feature subset using the SVM classifier, had a high predictive effect, and could accurately distinguish between non immunoglobulins and immunoglobulins. For the recognition of key features, through the previous research, we obtained that the mixed two-dimensional key features of the CC-PSSM feature extraction method had better recognition capabilities by the multilayer perceptron classifier. Next, the mixed two-dimensional key features of CC(E, R, 1) and CC(D, T, 2) were used to explore the classification performance under KNN, C4.5, and RF. The comparison results are recorded in Table 6. The data in Table 6 shows that the original classifier we used does have better performance than other methods.

Independent Test Set Evaluation
In order to evaluate the generalization ability of monoTriKGap and CC-PSSM models, we conducted two independent tests. Each test set had two tasks: one was to evaluate the generalization ability of monoTriKGap optimal feature subset for accurate prediction of immunoglobulin under SVM, and the other was to evaluate the generalization ability of CC-PSSM for recognition of key features of immunoglobulin under MLP.
In the first group, 112 sequences from human and rat data were selected as the training set, and 116 sequences from mouse data were selected as the test set, including 33 immunoglobulins and 83 non immunoglobulins. The second group selected 112 human and rat sequences in the benchmark dataset as the training set. Thirty-three immunoglobulin sequences and thirty-three non immunoglobulin sequences from mouse were selected to form a test set. Details are shown in Table 7. The 212 optimal feature subsets generated by each group of monoTriKGap were trained under different classifiers, and the ACC value comparison of the test set is shown in Figure 6. Figure 6a shows that the first set accurately predicted 87.93% of immunoglobulins under SVM, which is higher than the accurate values of C4.5, KNN, and RF (72.14%, 77.58%, and 60.34%, respectively). Figure 6b shows that the second test set accurately predicted 87.88% of immunoglobulins under SVM, and the accuracy values higher than C4.5, KNN and RF were 77.27%, 69.69%, and 78.78%, respectively. Therefore, the two sets of data show that monoTriKGap does have a good generalization ability for the accurate prediction of immunoglobulin.
In addition, 760 features extracted by each group of CC-PSSM were reduced in dimension by MRMD1.0 and MRMD2.0, and the identified mixed two-dimensional key features CC(E, R, 1) and CC(D, T, 2) were trained. The ACC values of the test set were compared and are shown in Figure 7. Figure 7a shows that in the first test set, the multilayer perceptron classifier correctly predicted 91.07% immunoglobulin, which is 0.9% higher than the C4.5 algorithm, 7.45% higher than the KNN algorithm, and 13.49% higher than the RF algorithm. Figure 7b shows that in the second test set, the multilayer perceptron classifier correctly predicted 90.90% of immunoglobulins, which is 3.03% higher than the C4.5 algorithm, 10.6% higher than the KNN algorithm, and 13.36% higher than the RF algorithm. Combined with the two groups of data, it can be concluded that CC-PSSM has a good generalization ability for key feature recognition. However, in order to ensure the prediction and recognition ability of the model for immunoglobulins, our future work will extend the data for further study.
in Figure 6. Figure 6a shows that the first set accurately predicted 87.93% of immunoglob-ulins under SVM, which is higher than the accurate values of C4.5, KNN, and RF (72.14%, 77.58%, and 60.34%, respectively). Figure 6b shows that the second test set accurately predicted 87.88% of immunoglobulins under SVM, and the accuracy values higher than C4.5, KNN and RF were 77.27%, 69.69%, and 78.78%, respectively. Therefore, the two sets of data show that monoTriKGap does have a good generalization ability for the accurate prediction of immunoglobulin. In addition, 760 features extracted by each group of CC-PSSM were reduced in dimension by MRMD1.0 and MRMD2.0, and the identified mixed two-dimensional key features CC(E, R, 1) and CC(D, T, 2) were trained. The ACC values of the test set were compared and are shown in Figure 7. Figure 7a shows that in the first test set, the multilayer perceptron classifier correctly predicted 91.07% immunoglobulin, which is 0.9% higher than the C4.5 algorithm, 7.45% higher than the KNN algorithm, and 13.49% higher than the RF algorithm. Figure 7b shows that in the second test set, the multilayer perceptron classifier correctly predicted 90.90% of immunoglobulins, which is 3.03% higher than the C4.5 algorithm, 10.6% higher than the KNN algorithm, and 13.36% higher than the RF algorithm. Combined with the two groups of data, it can be concluded that CC-PSSM has a good generalization ability for key feature recognition. However, in order to ensure the prediction and recognition ability of the model for immunoglobulins, our future work will extend the data for further study.

Conclusions
The main work of protein prediction consists of two steps: one step is the selection of the feature representation method, and the other step is to reduce the feature dimension and identify key features. As a significant component of the immune system, immunoglobulin is closely related to various diseases. Accurate prediction of immunoglobulin can be more beneficial to drug development and disease treatment. This research focuses on the accurate prediction of immunoglobulin and the recognition of key features. By comparing the feature representation methods of CC-PSSM and monoTriKGap, the best feature set generated by monoTriKGap through the AdaBoost classification model is found to be able to accurately predict 99.5614% of immunoglobulins under the SVM classifier. For the identification of key features, unlike the past, we considered MRMD1.0 and MRMD2.0 for key feature screening and consider two-dimensional hybrid key features. The results show that the features extracted by CC-PSSM are identified by the mixed twodimensional key, and 92.1053% of immunoglobulins can be distinguished under the multilayer perceptron classifier. Therefore, the method used in this article can be used as a powerful means to study immunoglobulin. In future work, we will collect and expand the dataset, and use more data to verify the effectiveness of the model. In order to improve the performance of the SVM algorithm, some important parameters of the algorithm (such as penalty coefficient C) will be optimized. At the same time, to avoid overfitting, we will consider adding related regularization tests in our future work.

Conclusions
The main work of protein prediction consists of two steps: one step is the selection of the feature representation method, and the other step is to reduce the feature dimension and identify key features. As a significant component of the immune system, immunoglobulin is closely related to various diseases. Accurate prediction of immunoglobulin can be more beneficial to drug development and disease treatment. This research focuses on the accurate prediction of immunoglobulin and the recognition of key features. By comparing the feature representation methods of CC-PSSM and monoTriKGap, the best feature set generated by monoTriKGap through the AdaBoost classification model is found to be able to accurately predict 99.5614% of immunoglobulins under the SVM classifier. For the identification of key features, unlike the past, we considered MRMD1.0 and MRMD2.0 for key feature screening and consider two-dimensional hybrid key features. The results show that the features extracted by CC-PSSM are identified by the mixed two-dimensional key, and 92.1053% of immunoglobulins can be distinguished under the multilayer perceptron classifier. Therefore, the method used in this article can be used as a powerful means to study immunoglobulin. In future work, we will collect and expand the dataset, and use more data to verify the effectiveness of the model. In order to improve the performance of the SVM algorithm, some important parameters of the algorithm (such as penalty coefficient C) will be optimized. At the same time, to avoid overfitting, we will consider adding related regularization tests in our future work.