Neural Network Methodology for the Identiﬁcation and Classiﬁcation of Lipopeptides Based on SMILES Annotation

: Artiﬁcial Neural Networks can be applied for the identiﬁcation and classiﬁcation of prospective drug candidates such as complex compounds, including lipopeptide, based on their SMILES string representation. The training of neural networks is done with SMILES strings, which are predictive of structural identiﬁcation; the ANNs are efﬁcient of correctly classifying all compounds, substructures and their analogues distinguishing the drugs based upon atomic organization to obtain lead optimization in drug discovery. The proﬁciency of the trained ANN models in recognizing and classifying the analogous compounds was tested for analysis of similar compounds, which were not taken previously for training and achieved results with correct classiﬁcation in the validation set. The best result was achieved with 10 numbers of hidden layers. The R2 value for training is 0.90586; the R2 value for testing is 0.99508; the R2 value after validation is 0.94151; the ﬁnal value of R2 for total sets is 0.89456. The graphs are plotted between 21 epochs and mean square error (MSE) to report the performance of the model. The value of 798.1735 for the gradient of the curve after 21 iterations and 6 validation checks was obtained. A successful model was developed for the identiﬁcation and classiﬁcation of lipopeptides from their SMILES annotation that efﬁciently classiﬁes similar compounds and supports in decision making for analogue-based drug discovery. This will help in appropriate lead optimization studies for the prediction of potential anticancer and antimicrobial lipopeptide-based therapeutics.


Introduction
Currently, along with in vitro and in vivo, in silico analysis such as machine learning for the prediction of chemical properties of compounds has become an efficient way in chemical analysis. One such example is the prediction of protein-ligand interaction, which facilitates the identification of novel compounds through screening of lead compounds in the process of drug discovery [1].
For computational analysis, various file formats are used to define the chemical compounds digitally, which facilitates the reading of compounds through computers. Some of the file formats such as SDF (structure data file), MOL (molfile) (.sdf and .mol are extensions developed by MDL Molecular design limited for saving chemical files in the computer), SMILES (simplified molecular line-entry system) and fingerprints are used widely used in computer-aided drug discovery [2]. MOL format is used to represent a compound in graph connection table form, wherein an atom is represented by each node and the bonds are in the form of edges between atoms. SDF is used to write more than one compound in a single file and is an extension of MOL format, or it can be said its extended version [1].
SMILES, abbreviated as a simplified molecular line-entry system and proposed by Weininger, is broadly perceived and utilized as a standard interpretation system of compounds currently for processing chemical information [3]. A linear notation method is provided by SMILES for the representation of chemical compounds in a distinct way, Figure 1. The primary structure of lipopeptide Surfactin; n = 9-11.

Data Collection
We first tried to obtain the SMILES strings of the lipopeptide compounds. A broadspectrum antimicrobial and potential anticancer lipopeptide surfactin were retrieved for the preparation of the compound library, which are found as substructures, analogs, isoforms and similar compounds of surfactin lipopeptide using virtual screening [2]. The library preparation for surfactin analogs is done for the study of structural modification in drug compounds and identify the compounds with greater potency for further anticancer studies [15,16]. The canonical smiles of the peptidolipidic compounds are obtained from the PubChem compound database. A total of 22 compounds were obtained, and the SMILES strings of these compounds were retrieved from National Centre for Biotechnology Information (NCBI) database PubChem compound (https://pubchem.ncbi.nlm.nih.gov) (accessed on 20 May 2021) [17]. Ten of the lipopeptides are substructures of surfactin and have a modification in the varied carbon chain length. The rest of the 12 compounds are analogous structures, isoforms and similar compounds.

Data Preparation
Data encoding has an essential part to improve the performance of the network. ANNs have the capability to process presented data of diverse forms to the network in an appropriate format. Inputs of various classes are distinguished properly by a neural network. Methodology of data encoding is used for data presentation. Lipopeptide-based compounds are biomolecules that are produced as secondary metabolites by various strains of Bacillus genus [1,12]. These are composed of a cyclic hexa to deca peptide core attached to a fatty acid side chain. Surfactin is one of the prominent compounds produced by the Bacillus subtilis [15,18]. Out of the total 20 amino acids, there are some specific amino acids, which are found predominantly in lipopeptides. A two-dimensional structure of surfactin lipopeptide is depicted in Figure 1 [14,19,20]. A library of similar compounds is prepared through a virtual screening-based method. SMILES string format of each peptidolipidic compound is used to prepare the input file. The annotation of canonical SMILES is represented for Lipopeptide Surfactin A below in Table 1.

Data Collection
We first tried to obtain the SMILES strings of the lipopeptide compounds. A broadspectrum antimicrobial and potential anticancer lipopeptide surfactin were retrieved for the preparation of the compound library, which are found as substructures, analogs, isoforms and similar compounds of surfactin lipopeptide using virtual screening [2]. The library preparation for surfactin analogs is done for the study of structural modification in drug compounds and identify the compounds with greater potency for further anticancer studies [15,16]. The canonical smiles of the peptidolipidic compounds are obtained from the PubChem compound database. A total of 22 compounds were obtained, and the SMILES strings of these compounds were retrieved from National Centre for Biotechnology Information (NCBI) database PubChem compound (https://pubchem.ncbi.nlm.nih.gov) (accessed on 20 May 2021) [17]. Ten of the lipopeptides are substructures of surfactin and have a modification in the varied carbon chain length. The rest of the 12 compounds are analogous structures, isoforms and similar compounds.

Data Preparation
Data encoding has an essential part to improve the performance of the network. ANNs have the capability to process presented data of diverse forms to the network in an appropriate format. Inputs of various classes are distinguished properly by a neural network. Methodology of data encoding is used for data presentation. Lipopeptide-based compounds are biomolecules that are produced as secondary metabolites by various strains of Bacillus genus [1,12]. These are composed of a cyclic hexa to deca peptide core attached to a fatty acid side chain. Surfactin is one of the prominent compounds produced by the Bacillus subtilis [15,18]. Out of the total 20 amino acids, there are some specific amino acids, which are found predominantly in lipopeptides. A two-dimensional structure of surfactin lipopeptide is depicted in Figure 1 [14,19,20]. A library of similar compounds is prepared through a virtual screening-based method. SMILES string format of each peptidolipidic compound is used to prepare the input file. The annotation of canonical SMILES is represented for Lipopeptide Surfactin A below in Table 1.  SMILES strings are basically a way of representation through an elementary arrangement of composed atoms such as C, O, N, S, Ca and Na with certain bonds, which are used to establish the compound's structure. SMILES, abbreviated for Simplified Molecular Line-Entry System, is a specific form of line annotation to describe the structure of a compound through short ASCII strings. SMILES strings are basically imported by molecular editors, which can be back converted to their 2-D drawing format or in 3-D models of the molecule. Usually, various valid SMILES strings can be used to represent a molecule in a 1-D format, such as canonical SMILES and isomeric SMILES. Certain algorithms have been established for generating such SMILES representation of one-dimensional strings. Similar to DNA and protein sequences, SMILES are also unique for each structure. A canonicalized algorithm is used for the generation of the string is called canonical SMILES [2]. The surfactin substructures and similar compounds are retrieved from the PubChem compound database. The canonical smile strings are utilized to conduct classification studies using an artificial neural network. The symbols of SMILES strings are converted into numerical values to generate a tabulated file [1,12,20]. The interpretation of symbols is represented in Table 2. Table 2. Data Interpretation: The replacement of atoms and symbol of SMILES annotation in numerical values.

Symbol
Replaced with 15 Empty space 0

Experimental Protocol
We retrieved all the required strings from the PubChem compound database for data collection. The SMILES strings are arranged according to the requirement based on varied fatty acids carbon chain length, side chain modification, specific additional functional group modification and other similar compounds. The compounds, which were analyzed in the present work, are listed below in Table 3. The SMILES strings are converted into numerical values and tabulated to generate an input data file. The sequences are completed up to 153 places in the columns of the input sheet. Two files are prepared, one sheet for input data (ipp) and another sheet for output (opp). Further, MATLAB software is used to run the program for maximum output.

Neural Network Methodology
Various methodologies are used to analyze chemical compounds from the perspective of drug discovery. A few of such methods are Artificial Neural Network preliminaries, Gradient Descent Algorithm (GDA), Multi-layered Feed Forward ANN (MLFANN), Leven-Marquardt algorithm (LM), Conjugate Gradient Descent Algorithm etc. Here, the LM algorithm and Scaled Conjugate Gradient (SCG) algorithm are used for the performed work of the article. It is a technique used for solving problems of nonlinear least squares. In the case of nonlinear functions in parameters, it is aroused as nonlinear quadrangles glitches. To decrease the errors of quadrangle sum between sedate data joints and parameters, the nonlinear least square method incorporates iterative progress to the parameter value. Actually, the LM curve-fitting is a combined method of two minimization techniques that is the Gauss-Newton method and grade lineage technique.
The summation of squared errors gets reduced in the incline descent method through updating the parameters in the direction of steepest-descent. The Gauss-Newton method includes the summation of squared errors, which is abridged through an assumption of a locally quadratic function of minimum squares and ultimately finds out the minimum of quadratic. The functionality of the LM method is quite comparable with the gradientdescent method, wherein the parameters are actually distant from the optimal value. It also acts similar to the Gauss-Newton method if structures are nearer to the optimal value [9,12]. In current work, we have taken input file as (ipp), which will be trained, and an output file is received as opp after training. Hidden layers interplay an indispensable role during the process of training. The selection of the number of hidden nodes is a very prominent factor on which the final output is dependent. Meanwhile, training, testing and validation processes take place. The two networks are run parallelly. The structure of the neural network is shown in Figure 2. validation processes take place. The two networks are run parallelly. The structure of the neural network is shown in Figure 2.

Results and Discussion
The artificial neural networks are applied due to their efficiency in tackling a huge amount of data with the ease and good convergence rate through which it can be trained, and as a result, discreet modeling of big datasets. This helps in the prediction of identification and classification of lead molecule out of the library of a similar category of com-

Results and Discussion
The artificial neural networks are applied due to their efficiency in tackling a huge amount of data with the ease and good convergence rate through which it can be trained, and as a result, discreet modeling of big datasets. This helps in the prediction of identification and classification of lead molecule out of the library of a similar category of compounds, which ultimately supports novel drug discovery. The principal purpose of the current work is to develop a neural network model for accurate identification and classification of lipopeptide-based candidate drugs, which are of various medicinal properties such as anticancer, antiviral, antibacterial and antifungal and deduce a relationship between them [18][19][20].

Neural Network Training Results
The parameters are set in software MATLAB R2016a for conducting an artificial neural network. The canonical SMILES strings of 22 lipopeptides are used to develop neural network model. Out of the total 22, 16 compounds are used for training the network, 3 are used for validation and 3 are used for testing. The performance measures of the network such as R2 and MSE are shown in the figures below. The ANN is performed with 70%, 15% and 15% of the total 22 lipopeptides being used for training, validation and testing, respectively. Similarly, the training, testing and validation are performed using all various possible combinations. Different numbers of the feasible mixture and hidden layers are used to generate the architecture of the network with which the least error is provided. The network is trained with various kinds of hidden nodes with validation. The sets for testing were changed accordingly every time. The output as the best network was obtained after rigorous training, which is mentioned in the figures below. It is depicted from these figures that the optimum performance was obtained with 10 hidden nodes with a 15% testing set and 15% validation set. Out of the total 22 sets, 16 were allocated for training, 3 for testing and 3 for validation. The results for the Levenberg-Marquardt algorithm (LM) are depicting the best validation performance (Figure 3), error histogram (Figure 4), regression plot ( Figure 5) and training state ( Figure 6). The R2 value is 0.90586 with training, the R2 value is 0.99508 with testing, the R2 value after validation is 0.94151, and the R2 value obtained with the total number of sets is 0.89456 ( Figure 5). The results have shown that the experimental results are found closer to the predicted neural network results. Figure 3 is depicting the best validation performance. The figures are drawn against MSE vs. epochs. In total, nine epochs were taken for the modeling. From Figure 4, the error histogram can be seen. In each plot, the dashed line denotes the perfect result-outputs = targets, wherein the solid line indicates the line of best fit linear regression between targets and outputs. The R2 value signifies the relationship between the targets and outputs. The greater value of R2 (close to 1) denotes the greater accuracy in the linear relationship between targets and outputs. The value of MSE and R2 for training, testing, validation and overall data is shown in Figures 3 and 5, respectively. The validation for predicted and actual output is depicted in Figure 6 for the training of the neural network. Similarly, the results for the Scaled Conjugate Gradient algorithm SCGA) are showing the best validation performance (Figure 7), error histogram (Figure 8), regression plot (Figure 9), and training state ( Figure 10). In Figure 10, the training state can be seen for the SCG algorithm, wherein decreasing error with each iteration is observed. idation and overall data is shown in Figures 3 and 5, respectively. The validation for predicted and actual output is depicted in Figure 6 for the training of the neural network. Similarly, the results for the Scaled Conjugate Gradient algorithm SCGA) are showing the best validation performance (Figure 7), error histogram (Figure 8), regression plot ( Figure  9), and training state ( Figure 10). In Figure 10, the training state can be seen for the SCG algorithm, wherein decreasing error with each iteration is observed.   idation and overall data is shown in Figures 3 and 5, respectively. The validation for predicted and actual output is depicted in Figure 6 for the training of the neural network. Similarly, the results for the Scaled Conjugate Gradient algorithm SCGA) are showing the best validation performance (Figure 7), error histogram (Figure 8), regression plot ( Figure  9), and training state ( Figure 10). In Figure 10, the training state can be seen for the SCG algorithm, wherein decreasing error with each iteration is observed.                  The ANN model results are quite comparable with the RSM modeling (Response surface methodology). Although, in current work, advanced neural network methodology is represented for the classification studies for the library of compounds in drug discovery. The epoch along 0-9 hidden nodes and varied iteration were undergone training. Though, in the course of 23 epochs and 10 hidden layers have shown optimum results and good output. The performance is mentioned as, random data division (dividerand) algorithm was used, scaled conjugated gradient (tracing) was used for training and graphs are plotted between 21 epochs and mean squared error (MSE) to report the performance. Default was used as derivative (defaultderiv). The progress is mentioned as Time: 0.00.05 s epoch with nine iterations (maximum stated-1000). Performance: Gradient: 798 with six validation checks. The value of R2 quantizes the correlation among targets and output in the regression curve. The R2 value close to one depicts the close relationship, and a zero value shows a random relationship between the test compounds. Mean squared error (MSE) is a regular squared difference among targets and productions. According to the errors, adjustment of the network is done. The application of parameters is done for measuring the generalization of the network and to stop the training accordingly when generalization stops to improve. The training is not affected by this; hence it generates an independent measure of the network performance in the duration and after the training of sample output. The training sample output was taken 70% of the total 22 samples. Three samples were used for validation, which gave 15% output. Similarly, three samples gave 15% output for testing. The correlation among targets and output was measured in terms of regression R2 and MSE of the measured performance. The zero value of R2 denotes a random relationship, while one denotes a close relationship. With the training of the model multiple times, it generates different results because of different conditions and initial sampling. Once the generalization stops further improving, the training stops automatically, which is an indication of increased mean square error for validation samples. In parallel network 1 and network 2, 10 hidden nodes were used, and both consist of different numbers of hidden layers. The optimal results were achieved with 10 hidden nodes, and further training stops when generalization also stopped improving. Training does not get affected by the nodes of validation, and it provides independent results for network performance or training.

Conclusions
In this work, we have developed a neural network model based upon the SMILES annotation of lipopeptide-based compound surfactin and its library of similar compounds. The model is capable of adequately discriminate between the compounds [2,19]. Majorly, computer-aided drug discovery is based upon the three-dimensional visualization of chemical space and its interaction with the surrounding atoms of the target protein. The current approach of utilizing linear notation will further help in a combinatorial chemistry-based chemical library preparation for the identification of novel molecules generated through certain modifications in the given parent compound. The identification and classification model developed using LM and SCG algorithm with the appropriate value of R2 close to one measure the closely related compounds, and R2 value close to zero depicts the random compounds. Such categorization will help in the screening of compounds of interest with slight structural variation to obtain the better affinity in the vicinity of the binding pocket of target proteins of diseases; hence, it will give an insight for analogue-based drug discovery for lead generation and lead optimization. Such model-based classification studies will give a boost to the computational drug discovery process for complex large molecules such as peptide therapeutics. The aim of the work is to utilize the chemical information of complex compounds for the efficient categorization of substructures and analogous compounds of the parent compound. It is concluded from the current study that the performance of predicted results can be increased with a varied number of neurons of hidden layers, even with the increased complexity of the algorithm with equal computational time. Evidently, prediction performance also increases with enhanced iterations; therefore, higher training of artificial neural network is correspondence to better accuracy with certain limitations.