iAcety–SmRF: Identification of Acetylation Protein by Using Statistical Moments and Random Forest

Acetylation is the most important post-translation modification (PTM) in eukaryotes; it has manifold effects on the level of protein that transform an acetyl group from an acetyl coenzyme to a specific site on a polypeptide chain. Acetylation sites play many important roles, including regulating membrane protein functions and strongly affecting the membrane interaction of proteins and membrane remodeling. Because of these properties, its correct identification is essential to understand its mechanism in biological systems. As such, some traditional methods, such as mass spectrometry and site-directed mutagenesis, are used, but they are tedious and time-consuming. To overcome such limitations, many computer models are being developed to correctly identify their sequences from non-acetyl sequences, but they have poor efficiency in terms of accuracy, sensitivity, and specificity. This work proposes an efficient and accurate computational model for predicting Acetylation using machine learning approaches. The proposed model achieved an accuracy of 100 percent with the 10-fold cross-validation test based on the Random Forest classifier, along with a feature extraction approach using statistical moments. The model is also validated by the jackknife, self-consistency, and independent test, which achieved an accuracy of 100, 100, and 97, respectively, results far better as compared to the already existing models available in the literature.


Introduction
Proteins are the basic and key part of the human body that perform many kinds of major functions in and outside a cell. The proteins are translated or synthesized from messenger RNA which is first codified into ribosomes and makes a chain of amino acid or polypeptide. After the translation process, certain amino acids can experience chemical changes at the protein's C-termini or N-termini or in amino acid side chains, known as post-translation modifications (PTLM or PTM). The PTM can modify or may introduce the new functional group to the protein, such as in Acetylation, an example of Acetyllysine is shown in the Figure 1) [1]. It plays a key role in making protein products [2][3][4]. Each protein in the proteome may be altered either before or after it is translated. The charge state, hydrophobicity, conformation, and stability of a protein are all affected by various changes, which, in turn, influence its function. Protein modification has a variety of functions in different organs: (1) It ensure the fast and complex response of cells to regulate intra-cellular communication, division, and growth of cells (2) also pivotal for various physiological and pathological mechanism.
Protein acetylation can be achieved using a variety of methods, this adds the acetyl functional group into a chemical compound, which make another ester, the acetate.
Another form of lysine residue is usually acetylated [1][2][3][4]. The active substance, acetic anhydride, is commonly used to react with free hydroxyl groups as an acetylating agent. Acetylation also plays a prominent role in numerous important cellular processes, such as stability and localization of protein [4,5]. In addition, modification of S/T/Y sites by acetylation, glycosylation, sulfation, and nitration has been reported [5,6]. Moreover, it also plays a role in the modulation of gene expression by histone alteration, as well as is a very significant function in controlling cellular metabolism and protein folding [15][16][17][18].
From the above discussion it revealed that Acetylation is an important post-translation modification, and it is necessary to correctly identify them; however, it remains a major challenge to understand the functions and regulations of the molecular acetylation mechanism. Many traditional approaches are in use for their identification, including high-throughput mass spectrometry (MS) [19,20]. However, since the acetylation mechanism is complicated, rapid, and reversible, such methods remain time-consuming, expensive, and laborious [21,22].
Reference [1] proposed a novel measurement procedure iAcet-PseFDA, a classifica- Further, acetylation sites also play an important role in regulating membrane protein functions of multiple families, as documented in Reference [5]. It is supported by examples that acetylation is significantly enhanced in membrane-binding regions, where it is often located directly in critical membrane-binding pockets ideally positioned to modulate membrane interactions. Moreover, it was found that acetylation and acetylation mimetics strongly affected the membrane interaction of proteins, resulting in decreased membrane affinity and, in the case of amphiphysin and EHD2, altered membrane remodeling [6]. In cells, mimicking even a single acetylation event within the membrane interaction region reduced the binding affinity to membranes, resulting in cytoplasmic dispersal. In another report, acetylation affected the effects on membrane interaction, as well as membrane remodeling [7]. Similarly, the ability to control the membrane-binding activity of C2 domains via acetylation could allow the cell to further regulate Ca-dependent transmembrane transport and signaling events. It has also been documented that acetylation is present on the membrane-binding surface of the phosphatase domain of K163/K164, as it appears to be that Alanine mutations reduce membrane binding [8]. In addition, two reports on proteins with PH domains indicate that acetylation has opposite effects on membrane localization in cells (either increased or decreased) [9,10]. Much has been learned about the acetyltransferases and deacetylases that regulate protein-DNA-protein-protein interactions [2,5,11]. Some of these enzymes may also be involved in controlling protein membrane interactions.

Materials and Methods
In this review, we follow the 5-step procedure mentioned in References [16,41] to establish a predictive predictor for biological sequences. It consists of the following steps: (1) choosing or generating an appropriate benchmark dataset to be used for training and testing; (2) transforming a biological sequence into its equivalent mathematical form, which returns the basic correlation with the biological sequence, and transforming a biological sequence into its equivalent mathematical form, which returns the required correlation with the biological sequence; (3) developing or using existing classifier for the required predictor; (4) using cross-validation experiments to determine the accuracy of the suggested indicator; and, finally, (5) developing an influential website/GitHub resource for public use for the benefit of future research and development. All the above steps are presented in Figure 2. Further, details of the above can be found in the following subsections.

Benchmark Dataset
We begin with collection of a valid benchmark dataset for training and testing, which is the first step in the 5-step rule [1]. The dataset is collected from the well-known data repository, the UNIPROT http://www.uniprot.org, retrieved dated 8 December 2021. It contains 2900 protein samples, of which 725 were positive denoted by Sposi, and 2175 were negative denoted by Snegt. Further, for the effectiveness of the proposed predictor, the set of negative data samples is equally divided in three sets, S , S , and S as in Reference [1] such that, Where the sign ∩ represents the intersection of sets. Furthermore, we individually combine these three negative datasets S , S and S with S_posi and created three new datasets with the same number of positive and negative samples as expressed in Equation (1) below: The symbol ∪ denotes the union of two sets. It is important to note that the dataset under discussion was used for prediction by [1], whereas the positive dataset was collected based on the given three steps: (1) The possible proteins to be acetylated are identified by a single fixed keyword, i.e., {N acetylcysteine, N acetylserine, N acetylglutamate, N acetylglycine, N acetylcholine, N acetylthreonine, N acetylcholine, N acetylmethionine, N acetylmethionylc, N acetyltyrosine, O-acetylserine, or N6-acetyllysine O-acetyltheronine O}; (2) here, protein collection was validated using some assertion technique; and (3) the 30, or additional, amino acids in proteins, along with the redundant proteins, were removed as discussed in [1].

Benchmark Dataset
We begin with collection of a valid benchmark dataset for training and testing, which is the first step in the 5-step rule [1]. The dataset is collected from the well-known data repository, the UNIPROT http://www.uniprot.org, retrieved dated 8 December 2021. It contains 2900 protein samples, of which 725 were positive denoted by S posi , and 2175 were negative denoted by S negt . Further, for the effectiveness of the proposed predictor, the set of negative data samples is equally divided in three sets, S 1 , S 2 , and S 3 as in Reference [1] such that, where the sign ∩ represents the intersection of sets. Furthermore, we individually combine these three negative datasets S 1 , S 2 and S 3 with S_posi and created three new datasets with the same number of positive and negative samples as expressed in Equation (1) below: The symbol ∪ denotes the union of two sets. It is important to note that the dataset under discussion was used for prediction by [1], whereas the positive dataset was collected based on the given three steps: (1) The possible proteins to be acetylated are identified by a single fixed keyword, i.e., {N acetylcysteine, N acetylserine, N acetylglutamate, N acetylglycine, N acetylcholine, N acetylthreonine, N acetylcholine, N acetylmethionine, N acetylmethionylc, N acetyltyrosine, O-acetylserine, or N6-acetyllysine O-acetyltheronine O}; (2) here, protein collection was validated using some assertion technique; and (3) the 30, or additional, amino acids in proteins, along with the redundant proteins, were removed as discussed in [1].
Whereas the negative data was generated in a similar way to the positive data, except those proteins are not a member being searched by the above keywords? As a result, it produced a large number of negative samples, moreover, random selection is made from those that were balanced in size to positive samples.

Feature Extraction
The existing traditional classifiers, such as SVM, KNN, ANN, and many others, are not as powerful in classifying the biological data and making the required prediction. Therefore, a medium is needed to convert biological data into the necessary numerical form to make it suitable for traditional classifiers. For this purpose, many models are developed to extract the required characteristics from biological data, e.g., PseAAC, AAC, Pse-in-One, Pse-Analysis, and many more [27][28][29]. In feature extraction, the emphasis is on preserving the critical properties of the protein, its location, and functions. The statistical moment [42] is used to derive features in this study, which is discussed in detail below.

Statistical Moments
In statistics and probability distributions, some form remains beneficial when performing analysis of a particular sequence. The study of such configuration of data collection in pattern matching is known as moments [25]. There are useful moments when there are various pattern recognition problems related to feature development that do not depend on the pattern or sequence parameters provided [27,[29][30][31]43]. Particular moments are used to calculate data size, data alignment, and data eccentricity. In this study, we extract the necessary features of acetylation proteins using Hahn, raw, and central moments. The raw moment is used to estimate the probability distribution by using mean, variance, and asymmetry, these moments are neither location invariant nor scale invariant [32]. Similarly, the same procedure is used in case of the central moment, but the calculation is based on the data centroid. This moment is a scale variant and location invariant. The Hahn moments, on the other hand, are dependent on Hahn polynomials, it is neither scale invariant nor location variant [33,34,44]. These moments are very important for extracting obscure features from protein sequences, as they contain complex orderly details about biotic sequences. In the proposed work, a linearly planned structural of a protein sequence is used, as given in Equation (2).
where R 1 is the 1st amino acid, represented in proteins P, the last amino acid is R L , and total length is 'L'. Transforming the information of the protein linear structure as given in Equation (2) into 2D matrix representation of dimension k as computed by the follow- where "P" represents the protein sequence length, and k represents the dimension of the obtained 2D square matrix. Hence the Equation (4), represents the matrix denoted by N is constructed by using the order obtained from Equation (3) The raw moment R (a, b) is computed by the values of N', which is a continuous 2D function of order (a + b), as shown in Equation (5): Up to order 3, the equation calculates the raw moments. This raw moments are measured using the data's roots as a starting point [45][46][47][48]. R 00 , R 01 , R 10 , R 11 , R 02 , R 20 , R 21 , R 30 , and R 03 are the raw moment's characteristics, weighed up to order 3rd.
The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
The unique features are calculated up to 3rd order as: C 00 , C 01 , C 10 , C 11 , C 02 , C 20 , C 30 , and C 31 . Further, the centroids are calculated, as given by Equations (7) and (8), as p and q.
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [25,[32][33][34]44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14): The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14): The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as p and q .
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12) Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14):

Membranes 2022, 11, x FOR PEER REVIEW
The centroid is a point from where all points are equivalently dispersed in al tions with a weighted average [45,[48][49][50]. The following equation, which uses th troid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as q .
The Hahn moments must be converted from 1D notation to a square matrix they can be calculated. Discrete Hahn's moments, also known as 2D moments, nece square matrix input data in a 2D structure [51]. Since these moments are orthogon sess inverse properties, therefore, the construction of original data can be construc ing the inverse discrete Hahn moment. The aforementioned remains observed, a positional and compositional features are somehow preserved in the measured mo [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate thogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the able parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: .
The raw values of Hahn moments are usually scaled using a square nor weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order th the following, Equation (14): The equation is further simplified by the Gamma operator: The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as p and q .
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): The centroid is a point from where all points are equivalently dispersed in all direc tions with a weighted average [45,[48][49][50]. The following equation, which uses the cen troid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as p and q .
The Hahn moments must be converted from 1D notation to a square matrix befor they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitat square matrix input data in a 2D structure [51]. Since these moments are orthogonal pos sess inverse properties, therefore, the construction of original data can be constructed us ing the inverse discrete Hahn moment. The aforementioned remains observed, and th positional and compositional features are somehow preserved in the measured moment [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the or thogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjust able parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), The centroid is a point from where all points are equivalently dispersed in all directions with a weighted average [45,[48][49][50]. The following equation, which uses the centroid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as p and q .
The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): The centroid is a point from where all points are equivalently dispersed in all dir tions with a weighted average [45,[48][49][50]. The following equation, which uses the ce troid, calculates the special characteristics of central moments up to order 3 (6).
Further, the centroids are calculated, as given by Equations (7) and (8), as p a q .
The Hahn moments must be converted from 1D notation to a square matrix befo they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessit square matrix input data in a 2D structure [51]. Since these moments are orthogonal p sess inverse properties, therefore, the construction of original data can be constructed ing the inverse discrete Hahn moment. The aforementioned remains observed, and t positional and compositional features are somehow preserved in the measured mome [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the thogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adju able parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: . ( The raw values of Hahn moments are usually scaled using a square norm a weighting formula, as seen in Equation (12): The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14): The special features based on the Hahn moments are represented by H00, H01, H10, H11, H02, H20, H12, H21, H30, and H03. For each protein sequence up to the third order, we produced 10 central, 10 raw, and 10 Hahn moments and added them to the miscellaneous Super Feature Vector at random (SFV).

Position Relative Incident Matrix (PRIM)
The amino acids' order and location in a protein sequence have crucial importance for the recognition of protein characteristics [47,50,53]. In any protein sequence, the relative position of an amino acid remains an essential pattern for understanding its physical properties. The Position Relative Incident Matrix (PRIM) uses a square matrix of order 20 Meanwhile, in Equation (13), The Hahn moments must be converted from 1D notation to a square matrix before they can be calculated. Discrete Hahn's moments, also known as 2D moments, necessitate square matrix input data in a 2D structure [51]. Since these moments are orthogonal possess inverse properties, therefore, the construction of original data can be constructed using the inverse discrete Hahn moment. The aforementioned remains observed, and the positional and compositional features are somehow preserved in the measured moments [32][33][34]25,44,52] Two-dimensional input data in the form N' is used to calculate the orthogonal Hahn moments, as seen in Equation (9).
where 'p' and 'q' (p > −1, q > −1) controlling the shape of polynomials by using the adjustable parameters. The Pochhammer symbol is defined by Equation (10), as follows: The equation is further simplified by the Gamma operator: The raw values of Hahn moments are usually scaled using a square norm and weighting formula, as seen in Equation (12): Meanwhile, in Equation (13), Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14): The special features based on the Hahn moments are represented by H00, H01, H10, H11, H02, H20, H12, H21, H30, and H03. For each protein sequence up to the third order, we produced 10 central, 10 raw, and 10 Hahn moments and added them to the miscellaneous Super Feature Vector at random (SFV).

Position Relative Incident Matrix (PRIM)
The amino acids' order and location in a protein sequence have crucial importance for the recognition of protein characteristics [47,50,53]. In any protein sequence, the relative position of an amino acid remains an essential pattern for understanding its physical properties. The Position Relative Incident Matrix (PRIM) uses a square matrix of order 20 Hahn moments are computed for the discrete 2D data up to the 3rd order through the following, Equation (14): The

Position Relative Incident Matrix (PRIM)
The amino acids' order and location in a protein sequence have crucial importance for the recognition of protein characteristics [47,50,53]. In any protein sequence, the relative position of an amino acid remains an essential pattern for understanding its physical properties. The Position Relative Incident Matrix (PRIM) uses a square matrix of order 20 to depict the relative location of amino acids in protein sequences, which is expressed by Equation (15): O i→j represents the position of the jth amino acid for the first occurrence of the ith amino acid in the chain.
The score is of biological evolutional process accomplished by amino-acid of type 'J'. The matrix, N PRIM has 400 coefficients based on the relative position of amino acids occurrence.
Ten central moments, 10 raw moments, and 10 Hahn moments are calculated using the 2D N PRIM and 30 additional special features randomly applied to the miscellaneous SFV.

Reverse Position Relative Incident Matrix (R-PRIM)
There are several instances of cell biology where biochemical sequences are homologous in origin. This normally occurs where a single ancestor is involved in the evolution process, and several sequences are derived from it. In such situations, using these homologous sequences has a significant impact on the classifier's output. For the purpose of obtaining correct results, successful and efficient sequence similarity searching is carried out. In machine learning, efficiency and accuracy are urgently needed for the preciseness of feature extraction algorithms through which the most relevant features are extracted from biological data [43,47,50,53].
The methods used in R-PRIM and PRIM computations are the same, but R-PRIM is only useful for reverse protein sequence ordering. The R-PRIM computations revealed hidden trends in the data and removed ambiguities between homologous sequences. R-PRIM was created as a 20 × 20 matrix containing 400 hundred coefficients, as seen in Equation (16): The N R-PRIM 2D matrix is used to measure 10 raw, 10 central, and 10 Hahn moments up to 3rd order, as well as more than 30 special features that are randomly applied to the SFV range.

Frequency Distribution Vector (FDV)
A frequency distribution vector (FDV) can be generated by using the distribution rate of each amino acid in a protein chain, as expressed in Equation (17).
Here a i is the occurrence of frequency of ith (1≤ i ≤ 20) amino acid in each protein chain. Twenty more special functions have been randomly added to the SFV's miscellany.

AAPIV (Accumulative Absolute Position Incidence Vector)
The AAPIV was used to retrieve relevant amino acid positional information, which retrieves and stores amino acid positional information for 20 native amino acids in a protein sequence [50,53]. This creates 20 critical features associated with each amino acid in a sequence, as expressed by Equation (18). These 20 new features were thrown into the SFV at random. AAPIV = {µ 1 , µ 2 , µ 3 , . . . , µ 20 }, where Φ i is expressed by Equation (19).
The µ i comes from the protein sequence R x , which has a cumulative amino acid count of 'n', which can be determined using Equation (19).

R-AAPIV (Reverse Accumulative Absolute Position Incidence Vector)
R-AAPIV uses reverse sequence ordering to extract and store positional information of amino acids with respect to 20 native amino acids in a protein sequence, which is in reverse order relative to AAPIV [50,53].
This creates 20 critical features associated with each amino, as expressed by Equation (20).
where Φ i is expressed by Equation (21).
where R 1 , R 2 , R 3 , . . . R n are the ordinal locations at which the residue of protein sequence occurs in the reverse sequence? The values of an arbitrary element of Φ i are given by Equation (21).

Machine Learning Classifiers
The Random Forest and Probabilistic Neural Network are used as a training model to predict acetylation and non-acetylation sites in this research. The following sections go into these classification algorithms in greater depth.

Probabilistic Neural Network
In this paper, we used a Probabilistic Neural Network (PNN) as a classifier. The Probabilistic Neural Network (PNN) is a powerful algorithm that is mostly employed for classification problems. PNN was first introduced by Specht in 1990 [54]. It is a feedforward neural network that works on the principle of Kernel Fisher Discriminant Analysis. PNN uses the probability density function and generates better prediction results compared to other neural network algorithms [55,56].
The PNN operates on four layers, i.e., (1) input layer, (2) hidden layer, (3) pattern layer/summation layer, and (4) output layer [57]. The input layer accepts the training samples as input; further, the input data is forwarded to the hidden layer to operate the computational functions. In the hidden layer, an individual node is a computational unit that has some weighted input connections. The third layer received the probability results along with its given classes [58]. The output layer takes the decision and assigns the respective class label to the unknown sample. The schematic view of the PNN is shown in Figure 3.

Random Forest
In this paper, we used Random Forest (RF) as a classifier [59]. Random Forest (RF) is an ensemble learning method for classification and regression problems that is widely used [60][61][62]. RF, as its name implies, contains a large number of individual decision trees that operate as an ensemble. Each individual tree in the Random Forest splits the dataset into training and testing subsets. The training subset is used for training the model, and the testing subset is employed for testing the prediction performance of the trained model classification. During the classification, the class label is assigned to the testing sample by receiving the majority votes of all trees. The variation or bias of a single tree has little effect on the overall prediction accuracy because of the majority voting principle. RF also implements the concept of the weight model by providing a weight value that is low when a particular tree consumes a high error rate. Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Random Forest has nearly the same hyper parameters as a decision tree or a bagging classifier. Random Forest adds additional randomness to the model while growing the trees [59]. The working of the Random Forest algorithm is illustrated by the following procedure.
The first step starts with the selection of random samples from a given dataset; in the second step, this algorithm will construct a decision tree for every sample, and, in the last

Random Forest
In this paper, we used Random Forest (RF) as a classifier [59]. Random Forest (RF) is an ensemble learning method for classification and regression problems that is widely used [60][61][62]. RF, as its name implies, contains a large number of individual decision trees that operate as an ensemble. Each individual tree in the Random Forest splits the dataset into training and testing subsets. The training subset is used for training the model, and the testing subset is employed for testing the prediction performance of the trained model classification. During the classification, the class label is assigned to the testing sample by receiving the majority votes of all trees. The variation or bias of a single tree has little effect on the overall prediction accuracy because of the majority voting principle. RF also implements the concept of the weight model by providing a weight value that is low when a particular tree consumes a high error rate. Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Random Forest has nearly the same hyper parameters as a decision tree or a bagging classifier. Random Forest adds additional randomness to the model while growing the trees [59]. The working of the Random Forest algorithm is illustrated by the following procedure.
The first step starts with the selection of random samples from a given dataset; in the second step, this algorithm will construct a decision tree for every sample, and, in the last step, voting will be performed for every predicted result [63].
We tested the proposed model performance using the different numbers of trees and found the best results when the number of trees is 500. The RF offers several advantages, including its optimal accuracy, it works efficiently with large datasets, and its detection of outliers. Many researchers used RF for solving several biological problems and achieve better performance. The working mechanism of the RF is depicted in Figure 4.

Performance Evaluation Parameters
The performance of the proposed model can be measured by effective evaluati metrics. We use the subsequent four metrics to measure the forecast quality: (1) Over Accuracy (ACC), (2) Sensitivity (Sn), (3) Specificity (Sp), and (4) Mathews Correlation C efficient (MCC). We computed these parameters using a binary confusion matrix. The metrics remain the most common metrics used to measure efficiency of the propos model. We computed these parameters using the following equations.

Performance Evaluation Parameters
The performance of the proposed model can be measured by effective evaluation metrics. We use the subsequent four metrics to measure the forecast quality: (1) Overall Accuracy (ACC), (2) Sensitivity (Sn), (3) Specificity (Sp), and (4) Mathews Correlation Coefficient (MCC). We computed these parameters using a binary confusion matrix. These metrics remain the most common metrics used to measure efficiency of the proposed model. We computed these parameters using the following equations.  (22) where TP is True Positive, FN is False Negative, TN is True Negative, and FP is False Positive, respectively.
As per the given confusion matrix as shown in the Table 1, we subsequently calculate the following: Table 1. Confusion matrix.

Status Person Predicted Patient (1) Predicted Healthy Person (0)
Actual patient (1) TP FN Actual healthy person (0) FP TN TP: Production prognosis, such as True Positive (TP), where we found that acetylation subject stays properly categorized, as well as classified, then the subjects have acetylation proteins.
TN: Production forecast, such as True Negative (TN), where we found that an nonacetylation protein remains properly classified, and then the subject remains nonacetylation protein.
FP: Production prognosis, such as false positive (FP), where we found that nonacetylation protein remains inaccurately classified as containing acetylation proteins known as "type 1 error".
FN: Forecast of production, such as false negative (FN), where we found that acetylation proteins remain inaccurately classified and that the subject has non-acetylation proteins, this is the "type 2 error".
ROC and AUC: The optimistic receiver curves evaluate the predictability of the machine learning classifiers at various threshold settings. The ROC exam remains a graphical demonstration that relates the "true positive rate" to the "false positive rate" in the grouping results of the machine learning algorithm. AUC describes a classifier of the ROC. The higher value of the AUC more than 0.5, suggest discrimination, whereas the value of 0.5 doesn't suggest any discrimination in true positive and true negative of classifier, more the AUC value more the efficiency in the performance of a classifier.

Testing Methods
Several cross-validation techniques have been used to examine the statistical forecaster's results in the literature. The jackknife test, independent dataset test, and k-fold cross-validation test are three experiments that are commonly used by various researchers.
When testing a forecaster designed for its efficiency, we use the following crossvalidation methods in this paper to estimate the expected accuracy of the forecaster, selfconsistency, independent, K-fold cross-validation, and jackknife testing for the assessment of the proposed model.
The following sub-sections contain the details.

K-Fold (10-Fold) Cross-Validation Test
The K-fold cross-validation (KFCV) test is a technique to estimate predictive models by partitioning the original dataset A into disjoint k-folds {A 1 , A 2 , A 3 , A i . . . , A k }, where it uses A-A i folds for training, and the remaining ith fold A i for testing, where i = 1, 2, 3, . . . k as shown in the Figure 5. The method iterates the process for each i, and calculate the performance that is the accuracy, sensitivity, precession, recall, F-Measure, and MCC. Further, for the overall result, the average is taken of all the iterations performed for each fold. This technique has many benefits, such as the fact that it validates the model based on multiple datasets to reduce the bias and reach to a stable evaluation that how the model performs. This technique is much more powerful compared to other cross-validation techniques. In literature, the K-fold method is quite popular for k = 10 and 5. In this research, a 10-fold cross-validation is used: the overall result obtained is 100% with random forest classifier as presented in Table 2, and through ROC curves shown in Figure 6.
The results obtained from three dataset using 10-folds Cross validation with the Random forest classifier, and achieved the best result with ACC 100%, MCC, Sn, Precision, and F-measure all are 1.
We also used a 10-fold cross-validation test to evaluate the PNN-based model on the three datasets, and obtained the ACC 66.83, MCC 0.36, Sn 0.72, Precision 0.65, and F-measure 0.72 as presented in the Table 3 and by the ROC curve in the Figure 7.

Jackknife Test
Several cross-validation tests are extensively useful to estimate the performance of the statistical predictors. Amongst these, the jackknife test is considered to be the supreme in being consistent and reliable. Consequently, the jackknife test is comprehensively applied by researchers to estimate the performance of the predictor model. In this test, if the dataset has N records in the dataset, then it trains the model for N − 1 records and tests the model for the remaining one record, which is why it is also called leave-one-out cross-validation. Further, this process is repeated N-times, and the label of each record is predicted. Finally, we accumulate all the results to make the overall prediction based on accuracy, sensitivity, precession, recall, F-measure, and MCC.
In present research work, the jackknife test is used to measure the performance of models by using the classifiers Random Forest and PNN, and we achieved the result of 100% through RF but got 66.87 through PNN, as presented in Tables 4 and 5, and by the ROC curves in Figures 8 and 9.
In this evaluation process, three datasets were used which give the overall accuracy, sensitivity, specificity, precession, MCC, Recall, and F-Measure, given in detail in Table 4.

Self-Consistency Test
Self-consistency test is a technique referred to as the ultimate test for the validation of efficiency and efficacy of the prediction model, this method uses the same data for both training and testing A representation of these proposed parameters, by conducting the self-consistency testing, the results for the acetylation protein prediction based on the Random Forest classifier as presented in the Table 6 and by the ROC curve is shown in the Figure 10.
We also used a 10-fold cross-validation test to evaluate the PNN-based model on the three datasets, and obtained the ACC 66.83, MCC 0.36, Sn 0.72, Precision 0.65, and F-measure 0.72 as presented in the Table 3 and by the ROC curve in the Figure 7. Table 3. 10-fold cross-validation Result for Probabilistic neural network.

Jackknife Test
Several cross-validation tests are extensively useful to estimate the performance of the statistical predictors. Amongst these, the jackknife test is considered to be the supreme in being consistent and reliable. Consequently, the jackknife test is comprehensively applied by researchers to estimate the performance of the predictor model. In this test, if the dataset has N records in the dataset, then it trains the model for N-1 records and tests the model for the remaining one record, which is why it is also called leave-one-out crossvalidation. Further, this process is repeated N-times, and the label of each record is predicted. Finally, we accumulate all the results to make the overall prediction based on accuracy, sensitivity, precession, recall, F-measure, and MCC.
In present research work, the jackknife test is used to measure the performance of models by using the classifiers Random Forest and PNN, and we achieved the result of 100% through RF but got 66.87 through PNN, as presented in Tables 4 and 5, and by the ROC curves in Figures 8 and 9.   The results obtained from three dataset using 10-folds Cross validation with the Random forest classifier, and achieved the best result with ACC 100%, MCC, Sn, Precision, and F-measure all are 1.
We also used a 10-fold cross-validation test to evaluate the PNN-based model on the three datasets, and obtained the ACC 66.83, MCC 0.36, Sn 0.72, Precision 0.65, and F-measure 0.72 as presented in the Table 3 and by the ROC curve in the Figure 7. Table 3. 10-fold cross-validation Result for Probabilistic neural network.

Jackknife Test
Several cross-validation tests are extensively useful to estimate the performance of the statistical predictors. Amongst these, the jackknife test is considered to be the supreme in being consistent and reliable. Consequently, the jackknife test is comprehensively applied by researchers to estimate the performance of the predictor model. In this test, if the dataset has N records in the dataset, then it trains the model for N-1 records and tests the model for the remaining one record, which is why it is also called leave-one-out crossvalidation. Further, this process is repeated N-times, and the label of each record is predicted. Finally, we accumulate all the results to make the overall prediction based on accuracy, sensitivity, precession, recall, F-measure, and MCC.
In present research work, the jackknife test is used to measure the performance of models by using the classifiers Random Forest and PNN, and we achieved the result of 100% through RF but got 66.87 through PNN, as presented in Tables 4 and 5, and by the ROC curves in Figures 8 and 9.     In this evaluation process, three datasets were used which give the overall accuracy, sensitivity, specificity, precession, MCC, Recall, and F-Measure, given in detail in Table 4.

Self-Consistency Test
Self-consistency test is a technique referred to as the ultimate test for the validation of efficiency and efficacy of the prediction model, this method uses the same data for both training and testing A representation of these proposed parameters, by conducting the self-consistency testing, the results for the acetylation protein prediction based on the Random Forest classifier as presented in the Table 6 and by the ROC curve is shown in the Figure 10.  Figure 9. Jackknife PNN ROC curve.  Similarly, the evaluation results based on the PNN for the three datasets, S1, S2, and S3, based on RF and PNN as presented in the Table 7 and by the ROC curve as shown in the Figure 11 Table 7. Self-consistency test result for probabilistic neural network.

Independent Test
An independent test is a cross-validation method that objectively finds out the predictor's performance metrics of the planned model, which obtains values from a confusion matrix to evaluate the performance of the model. In this method, the dataset is divided into two parts, training, and the testing part. In the proposed work the data is split in two parts that is 70% of the data for the training and the remaining 30% for the testing as   Similarly, the evaluation results based on the PNN for the three datasets, S1, S2, and S3, based on RF and PNN as presented in the Table 7 and by the ROC curve as shown in the Figure 11 Table 7. Self-consistency test result for probabilistic neural network.

Independent Test
An independent test is a cross-validation method that objectively finds out the predictor's performance metrics of the planned model, which obtains values from a confusion matrix to evaluate the performance of the model. In this method, the dataset is divided into two parts, training, and the testing part. In the proposed work the data is split in two parts that is 70% of the data for the training and the remaining 30% for the testing as shown in Figure 12. The method used to train and test the models based on the classifiers, the Random Forest and the PNN, and we obtained the result of 98% through RF, and 50.8 through PNN. The area under the curve (AUC), obtained by Random Forest and PNN, is 98% and 50.8%, respectively. The remaining detailed results, based on the two classifiers, is presented in Tables 8 and 9 and by the ROC curve in Figures 13 and 14, respectively.

Independent Test
An independent test is a cross-validation method that objectively finds out the predictor's performance metrics of the planned model, which obtains values from a confusion matrix to evaluate the performance of the model. In this method, the dataset is divided into two parts, training, and the testing part. In the proposed work the data is split in two parts that is 70% of the data for the training and the remaining 30% for the testing as shown in Figure 12. The method used to train and test the models based on the classifiers, the Random Forest and the PNN, and we obtained the result of 98% through RF, and 50.8 through PNN. The area under the curve (AUC), obtained by Random Forest and PNN, is 98% and 50.8%, respectively. The remaining detailed results, based on the two classifiers, is presented in Tables 8 and 9 and by the ROC curve in Figures 13 and 14, respectively.   Figure 13. Independent Random Forest ROC curve.     Figure 13. Independent Random Forest ROC curve.     Figure 13. Independent Random Forest ROC curve.

Results and Discussions
The most critical thing is to compare the proposed novel model to other state-of-the-art models in order to assess its prediction accuracy. When compared to well-known current classifiers, the RF and the PNN. In this work, the model with RF achieved significantly higher accuracy and efficiency in predicting acetylation from non-acetylation, as seen in Table 10 and by Figures 15 and 16.  Figure 14. Independent Probabilistic Neural Network ROC curve.

Results and Discussions
The most critical thing is to compare the proposed novel model to other state-of-theart models in order to assess its prediction accuracy. When compared to well-known current classifiers, the RF and the PNN. In this work, the model with RF achieved significantly higher accuracy and efficiency in predicting acetylation from non-acetylation, as seen in Table 10 and by Figures 15 and 16.

Comparison to Existing Models
To show the efficiency of our proposed model, iAcety-SmRF, which achieved the highest accuracy of 100% with Sensitivity, Precision, and MCC, is 1, as shown in Section 4.2 (B) (i), using a 10-fold cross-validation. The proposed model was compared to several models, including the latest iAcet-PseFDA by Reference [1]; they used the same dataset and developed a method for predicting acetylation proteins by extracting features from sequence conservation information using a gray frame model and an ANN score based on information from the annotation of the functional domains and subcellular localization. For model validation, they used a 5-fold cross-validation for all three datasets and achieved an average accuracy of 77.10%. The All-Mean JK model by Nakai and Horton [36] achieved an accuracy of 74.64%. Hunter constructed a predictive model InterPro, with accuracy of 68.25% reference [64]. Table 11 lists all comparative analyses of the proposed study, based on our two models iAcety-SmRF and iAcety-SmPNN, in which iAcety-SmRF achieved the superior accuracy as compared to all existing models.

Conclusions
A computational model for predicting Acetylation sites from non-Acetylation sites is developed in this paper. The model contained Statistical Moments used for extraction of features into the equivalent numerical form of the original biological data. Further, Random Forest and Probabilistic Neural Network were applied for its classification to predict acetylation from non-acetylation. Furthermore, independent testing, 10-fold crossvalidation, self-consistency test, and jackknife testing are used to evaluate accuracy, with results of 97%, 100%, and 100%, respectively, based on the Random Forest. Further, the model was compared with the already existing relevant models available in the literature, which revealed the remarkable performance of our work. Finally, the final step is to develop an influential website/GitHub resource for public use for the benefit of future research and development which can be accessed by the following link: https://github. com/shaistarahmanmcs/My-Website-identifying-Proteins-Acetylation-.git (accessed on 8 December 2021).