Application 2D Descriptors and Artificial Neural Networks for Beta-Glucosidase Inhibitors Screening

Beta-glucosidase inhibitors play important medical and biological roles. In this study, simple two-variable artificial neural network (ANN) classification models were developed for beta-glucosidase inhibitors screening. All bioassay data were obtained from the ChEMBL database. The classifiers were generated using 2D molecular descriptors and the data miner tool available in the STATISTICA package (STATISTICA Automated Neural Networks, SANN). In order to evaluate the models’ accuracy and select the best classifiers among automatically generated SANNs, the Matthews correlation coefficient (MCC) was used. The application of the combination of maxHBint3 and SpMax8_Bhs descriptors leads to the highest predicting abilities of SANNs, as evidenced by the averaged test set prediction results (MCC = 0.748) calculated for ten different dataset splits. Additionally, the models were analyzed employing receiver operating characteristics (ROC) and cumulative gain charts. The thirteen final classifiers obtained as a result of the model development procedure were applied for a natural compounds collection available in the BIOFACQUIM database. As a result of this beta-glucosidase inhibitors screening, eight compounds were univocally classified as active by all SANNs.

Glucosidase inhibitors are interesting from several viewpoints. The common feature of this group is the presence of both hydrogen bonds donors and acceptors, its hydrophobic nature, and backbone flexibility [27]. In general, glucosidase inhibitors can be divided into two major categories-glycosidic compounds, such as saccharides and their analogues (thiosugars, iminosugars, carbasugars) and non-glycosidic compounds [1,28]. These compounds affect important metabolic pathways and their pharmacological applications including obesity, diabetes, hyperlipoproteinemia, cancer, HBV, HCV, and HIV treatment were documented [1,[29][30][31][32]. Furthermore, glucosidase inhibitors have been applied for investigating the biochemical paths of various metabolic processes [1,33,34]. From the pharmacological viewpoint, human liposomal glucosidase inhibitors deserve special attention, since these compounds exhibit beneficial effects on the lysosomal storage disorders treatment (Gaucher disease) [35][36][37].
Nowadays, the inhibiting properties can be easily obtained from various sources like the ChEMBL (https://www.ebi.ac.uk/chembl/) [38,39] and PubChem (https://pubchem.ncbi.nlm.nih.gov/) [40] databases. These ligands' libraries along with molecular descriptor calculations allow for developing useful and effective QSAR/QSPR (quantitative structure-activity relationship/quantitative structure-property relationship) models. The main purpose of this study is to develop a simple and efficient classifier utilizing 2D indices for beta-glucosidase inhibitors. The choice of these descriptors was guided by their low computational cost, since these parameters can be computed using only molecular structure represented by the Simplified Molecular Input Line Entry Specification (SMILES) code. Noteworthy model efficiency is particularly important from the computer-aided drug design perspective, due to the possibility of screening thousands of compounds in a short period of time. This purpose is in general more difficult to accomplish using time-consuming computational methods based on molecular dynamics or quantum-chemical calculations. Furthermore, many studies showed the great usefulness of 2D structure-derived features in the modeling of physicochemical properties [41][42][43][44][45][46][47][48][49][50]. In this study, 2D molecular descriptors, calculated for a large dataset built with the aid of available beta-glucosidase inhibition bioassays results, were used to generate artificial neural networks (ANNs) classifiers. Due to their high accuracy, non-linear methods have found wide application in biological activities and the modelling of physicochemical properties. However, the use of these techniques including ANNs is often associated with the risk of the overfitting problem. Therefore, it is reasonable to create the simplest models containing the smallest possible number of variables, which was also taken into account when constructing the model presented in this paper.

Descriptors Selection
Due to the very large number of descriptors which can be efficiently computed using various tools such as PaDEL [51], it is necessary to make an appropriate molecular features selection. Therefore, prior to the machine learning procedure, the set of the most suitable descriptors according to the χ 2 ranking method was selected. This method has been implemented in STATISTICA for automatic descriptor selection and is part of the Data Miner module. It is worth noting that the χ 2 method and other similar methods of feature selection have been widely used in QSPR/QSAR problem solving including artificial neural networks classifiers [52][53][54][55][56][57].
Noteworthily, it happens that many of the selected features are strongly correlated with each other. The list of selected descriptors was summarized in Table 1, while in Figure 1, the correlation matrix was provided. There are significant statistical differences between selected molecular descriptors distributions corresponding to class 0 and class 1 populations, as evidenced by very low p-values (Table 1). These differences can be visualized on boxplots corresponding to descriptors characterized by the highest and the lowest χ 2 values ( Figure 2). As can be seen, even the lowest ranked descriptors quite clearly distinguish active and non-active cases. This high ability of individual descriptors to separate class 0 and class 1 populations is important for prevention of the overfitting problem, since a high quality of prediction can be achieved using a low number of variables. Therefore, each STATISTICA Automated Neural Network (SANN) model developed in this study involves only one orthogonal pair of descriptors (R 2 < 0.5) in the input layer. variables. Therefore, each STATISTICA Automated Neural Network (SANN) model developed in this study involves only one orthogonal pair of descriptors (R 2 < 0.5) in the input layer.  The selected descriptors should reflect the relevant physicochemical features important for a particular QSAR task. However, very often the interpretation of molecular features is not clear. Among the selected descriptors, many typical QSAR parameters, such as Burden-modified eigenvalues (SpMax4_Bhp, SpMin4_Bhi, SpMax8_Bhs) [58][59][60], Broto-Moreau autocorrelation (ATS3s, ATS4e, ATS4i, ATS2i) [58,[61][62][63][64], and atom-type electrotopological state (SHCsats, maxHBint3) indices [65][66][67][68][69], can be found. Noteworthily, several of them are associated with the relevant intermolecular interactions' perspective features such as electronegativity (ATS4e), polarizabilities (SpMax4_Bhp), first ionization potential (SpMin4_Bhi, ATS4i, ATS2i), and intrinsic state (SpMax8_Bhs, ATS3s). The enzymes' inhibition is determined by the ligand-active center interaction's nature related to the molecules' polarity. However, the "global" polarity does not provide sufficient information, since the molecule being an effective inhibitor often contains both polar and non-polar fragments interacting with the appropriate sites in the biocatalyst. The appearance of the SHCsats descriptor in QSAR models, defined as the hydrogen E-states sum on the sp 3 -hybridized carbon atom of the saturated bond, reflects the weak interactions with non-polar sites in the biomolecular targets [70,71]. It is worth noting that many of the active compounds employed for model development are typical Gaucher treatment agents containing a hydrocarbon chain. The presence of both polar (piperidine, pyrrolidine, and pyrrolizidine iminosugar analogues) and nonpolar hydrocarbon moieties (e.g., miglustat, CHEMBL1029) in the active ingredients used for the treatment of lysosomal storage disorders is strictly associated with the specific interaction of these compounds with glucocerebrosidase (human acid β-glucosidase) active centers [72,73].

The SANN Models
In this study, several well-known parameters, such as percentage of properly classified cases (%All), Matthews correlation coefficient (MCC), and area under the receiver operating characteristic (ROC) curve AUCROC, were used to analyze the obtained models. The concept of a popular classifiers quality measure, namely MCC, is based on Pearson's correlation analysis adapted for binary results The selected descriptors should reflect the relevant physicochemical features important for a particular QSAR task. However, very often the interpretation of molecular features is not clear. Among the selected descriptors, many typical QSAR parameters, such as Burden-modified eigenvalues (SpMax4_Bhp, SpMin4_Bhi, SpMax8_Bhs) [58][59][60], Broto-Moreau autocorrelation (ATS3s, ATS4e, ATS4i, ATS2i) [58,[61][62][63][64], and atom-type electrotopological state (SHCsats, maxHBint3) indices [65][66][67][68][69], can be found. Noteworthily, several of them are associated with the relevant intermolecular interactions' perspective features such as electronegativity (ATS4e), polarizabilities (SpMax4_Bhp), first ionization potential (SpMin4_Bhi, ATS4i, ATS2i), and intrinsic state (SpMax8_Bhs, ATS3s). The enzymes' inhibition is determined by the ligand-active center interaction's nature related to the molecules' polarity. However, the "global" polarity does not provide sufficient information, since the molecule being an effective inhibitor often contains both polar and non-polar fragments interacting with the appropriate sites in the biocatalyst. The appearance of the SHCsats descriptor in QSAR models, defined as the hydrogen E-states sum on the sp 3 -hybridized carbon atom of the saturated bond, reflects the weak interactions with non-polar sites in the biomolecular targets [70,71]. It is worth noting that many of the active compounds employed for model development are typical Gaucher treatment agents containing a hydrocarbon chain. The presence of both polar (piperidine, pyrrolidine, and pyrrolizidine iminosugar analogues) and non-polar hydrocarbon moieties (e.g., miglustat, CHEMBL1029) in the active ingredients used for the treatment of lysosomal storage disorders is strictly associated with the specific interaction of these compounds with glucocerebrosidase (human acid β-glucosidase) active centers [72,73].

The SANN Models
In this study, several well-known parameters, such as percentage of properly classified cases (%All), Matthews correlation coefficient (MCC), and area under the receiver operating characteristic (ROC) curve AUC ROC , were used to analyze the obtained models. The concept of a popular classifiers quality measure, namely MCC, is based on Pearson's correlation analysis adapted for binary results [74,75]. Hence, MCC characterizes the quality of a correlation between estimated and actual data. In the case of a perfect classifier, the MCC is 1, while in the case of the worst model, MCC is −1. It is worth noting that in this study, a tenfold division into a training, validation, and test set was carried out, which enables a reliable selection of descriptor pairs characterized by the highest predicting power. Since, as a result of each machine learning step performed for a particular data split, the top five SANNs were saved, the total number of classifiers for each descriptor pair was 50. The predicting abilities of descriptor pairs can be evaluated based on the average parameters calculated for the test sets' predictions. Noteworthily, in this study, the SANN classifiers were externally tested, which confirms the reliability of this analysis. As it can be inferred from Table 2, all two-variable models proved to be characterized by a very high accuracy. In the case of most classifiers, MCC values are higher than 0.7, which is characteristic of very strong positive correlations [76]. The AUC ROC values, which are in general very close to 1, also suggest good performance of the models. It is noteworthy that the AUC ROC parameter has been very commonly applied as a useful machine learning tool [77][78][79][80]. It is not clear which of the AUC ROC or MCC parameters are more appropriate to assess the accuracy of the model since there are critical opinions on both criterions [81][82][83][84][85]. Taking into account all three criterions (MCC, AUC ROC and %All) determined for all SANNs, pair 1 (maxHBint3, SpMax8_Bhs) appears to exhibit the highest predicting abilities. Although pair 3 (maxHBint3, SpMin4_Bhi) is characterized by a slightly higher AUC ROC value, the other two parameters (MCC, %All) indicate a significant advantage of pair 1. Therefore, the models employing this pair of descriptors were further analyzed. Among them, classifiers exhibiting the highest MCC values calculated for the test set were selected. The selected models' details including neural net architecture, structural features, and predictions are summarized in Table 3. Although ten dataset splits were performed, the number of SANNs is thirteen, since for splits 1, 5, and 9, the highest accuracy expressed by the MCC was achieved in the case of two different classifiers. These models were saved in the universal Predictive Model Markup Language (PMML) format (supplementary SANNs.zip file), which allows for their implementation. As can be inferred from Table 3, in most cases, the redundant Byzantine fault tolerance (RBFT) learning algorithm was applied during the SANN procedure. Another algorithm was based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach. In most cases, except two models, the entropy was used as an error function. During the machine learning steps, both multilayer perceptron (MLP) and radial basis function (RBF) networks were allowed to be generated. Due to the SANN methodology limitations, in all cases, only one hidden layer was applied. As it can be inferred from Table 3, the number of neurons in the hidden layer ranges from 4 to 30. Taking into account the relatively high number of instances in the training set (N = 228), the complexity of the SANNs seems to be quite low. In the case of most dataset splits, the RBF networks were preferred. The parameters characterizing summary prediction ability are clearly not sufficient to fully describe the model. Therefore, it is worth analyzing the relationships showing the whole population. The exemplary ROC and cumulative gain charts for SANNs characterized by the highest (split 9, RBF 2-22-2) and the lowest MCC values calculated for the test set (split 5, MLP 2-4-2) are summarized in Figures 3 and 4. As one can see, in both cases, fairly typical shapes of ROC and gain charts characteristic of high classification quality can be observed. In the case of the ROC plot, the good performance of the model is reflected by the high sensitivity observed for low false-positives rate values. In the case of a gain chart, good classification quality is reflected by a significant distance from the baseline corresponding to the random classifier. As can be inferred from each gain chart, in the case of both exemplary classifiers, there are noticeable differences between class 0 and class 1 (Figure 4). The overall accuracy can be illustrated by ROC curves. In the case of the best classifier, the test set is characterized by a much steeper trend than in the case of the training and validation sets. The AUC ROC values for training, validation, and test sets are equal to 0.869, 0.819, and 0.931, respectively. It is worth noting that the higher fraction of properly classified cases in the test set than in the training set indicates the optimal model complexity. This can be explained by the fact that in the case of overfitted classifiers, the training set cases are well reproduced in contrast to the externally excluded ones. Additionally, in the case of the model characterized by the lowest accuracy among selected SANNs, analysis of the ROC charts indicates the optimal complexity of the neural network. In the case of this model, the AUC ROC values corresponding to the training, validation, and test sets are 0.812, 0.935, and 0.813, respectively. This indicates similar training and test set prediction quality. On the other hand, in the case of the validation set, a much steeper plot can be observed, resulting in a higher AUC ROC value. This high quality of prediction can also be observed in the case of the gain chart ( Figure 4). As it can be inferred, this is mainly caused by the high properly assigned active compounds score (class 1). The role of the validation set is associated with model selection during the SANN machine learning process. Therefore, it is somewhat involved in the development of the model, and hence, the external test set is more appropriate for the accuracy evaluation. Nevertheless, a satisfactory result for the validation sets can also be regarded as important information. Fortunately, in most cases, the MCC values calculated for the validation set are higher than 0.7 (Table 3).  From the model validation perspective, it is important to develop a classifier characterized by the application domain, which comprises the range of descriptors values calculated for the test set. This is very important since the applicability domain determines the area in which the obtained results are reliable. According to common practice, the applicability domain is determined by the training set descriptors' values ranges. However, because in the case of SANN, the validation set is also involved in the machine learning process, it seems to be reasonable to assume the total ranges  From the model validation perspective, it is important to develop a classifier characterized by the application domain, which comprises the range of descriptors values calculated for the test set. This is very important since the applicability domain determines the area in which the obtained results are reliable. According to common practice, the applicability domain is determined by the training set descriptors' values ranges. However, because in the case of SANN, the validation set is also involved in the machine learning process, it seems to be reasonable to assume the total ranges From the model validation perspective, it is important to develop a classifier characterized by the application domain, which comprises the range of descriptors values calculated for the test set. This is very important since the applicability domain determines the area in which the obtained results are reliable. According to common practice, the applicability domain is determined by the training set descriptors' values ranges. However, because in the case of SANN, the validation set is also involved in the machine learning process, it seems to be reasonable to assume the total ranges corresponding to these two sets as the applicability domain. As one can see from Table 4, the range of test sets for both the maxHBint3 and SpMax8_Bhs descriptors in most cases does not exceed or only slightly exceeds the applicability domain (Table 4). Nevertheless, the comparability of these ranges is not sufficient. It is understandable that both populations used for model development and external validation ought to be statistically similar [86][87][88]. As was established, training sets do not differ significantly from the corresponding test sets, since in most cases, the p-values calculated using two different statistical tests (Mann-Whitney U and Kolmogorov-Smirnov tests) are higher than 0.1.

Application of SANNs
To evaluate the usefulness of the generated models, additional predictions were performed with the use of a set of naturally occurring compounds. The BIOFACQUIM (https://biofacquim.herokuapp.com/) database containing 423 compounds represented by SMILES codes was selected for this purpose. This database was co-created by four countries (Brazil, France, Panama, Vietnam) with the intention of providing a diverse set of compounds useful for various chemometric purposes, including data mining and drug design [89]. In Figure 5, the potential beta-glucosidase inhibitors selected using thirteen SANN models employing maxHBint3 and SpMax8_Bhs descriptors were presented. The results of the prediction were summarized in Table S3. As a result of such in silico screening, eight compounds were selected. Of course, many more compounds can be classified as active when less restrictive criteria will be used. If the criterion of assignment by the majority of SANNs is taken into account, 49 compounds can be distinguished as being potential inhibitors. As can be noticed from Figure 5, the majority of compounds contain characteristic carbohydrate moieties. The presence of substrate mimicking structural features, such as iminosugar rings, is characteristic for beta-glucosidase inhibitors. Indication of such compounds by SANNs, from the set of various structures available in the BIOFACQUIM database, additionally confirms their usefulness. It is noteworthy that the maxHBint3 and SpMax8_Bhs values corresponding to the compounds assigned by the majority of SANNs as active were within the range of applicability domain determined by the training and validation sets (Table 4). Furthermore, only 16 outliners, among 423 records, can be found when all SANNs are considered.

Dataset Selection and Pretreatment
All data used in this study were obtained from the ChEMBL database (https://www.ebi.ac.uk/ chembl/) [38,39], which is a comprehensive tool allowing researchers to obtain bioactivity information of approved pharmaceuticals and drug-like compounds [38]. Among collected data, only those records for which IC 50 (half maximal inhibitory concentration) information could be found were taken into account. When more than one IC 50 value was available for a particular compound, the arithmetic mean was taken into account. The compounds were classified as low-active or inactive (class 0, IC 50 > 50 µM) and significantly active (class 1, IC 50 ≤ 50 µM). This is an arbitrary criterion; however, the 50 µM threshold has been widely used for evaluating various biological activities including enzymes inhibitors [90][91][92][93][94][95]. All data used for model development and validation are summarized in the Supplementary Materials (Table S1).

Descriptors Calculation
The molecular descriptors were calculated using PaDEL software [51]. As the input file, a list of SMILES codes obtained from the ChEMBL database were used. Among the 1444 available 1D and 2D indices, 293 were removed, since they were not calculable for all of the dataset's objects or because of zero variance.

The Artificial Neutral Networks Models
In this study, a popular and universal STATISTICA Automated Neural Networks (SANNs) approach was applied. Prior to the models' generation, descriptors' predicting abilities were evaluated using the χ 2 test. Based on this analysis, ten selected molecular features characterized by the highest χ 2 were further applied for SANN classifiers generation. All computations including variables selection, model development, and validation were performed automatically using datamining tools available in STATISTICA12 software [96]. During each machine learning procedure, both multilayer perceptron (MLP) and radial basis function (RBF) algorithms were taken into account. The number of generated SANNs was set to 1000, from which 5 of the highest predicting ability were selected automatically by the software. In this study, ten random data splits to the training (70%), test (15%), and validation (15%) sets were applied. The model training, validation, and external testing procedure were performed using default STATISTICA settings. All dataset splits were summarized in Supplementary  Table S2. The selected models were saved in PMML (Predictive Model Markup Language) format (http://dmg.org/pmml/pmml-v3-0.html) and are available in the Supplementary Materials (SANNs.zip).

Classification Quality Evaluation
In order to evaluate the SANNs' performance, several well-known statistical measures were applied. A detailed and comprehensive description of the model can be made by analyzing parameters such as the percentage of properly classified cases (%All), the Matthews correlation coefficient (MCC) [74,75], and area under the receiver operating characteristic (ROC) curve (AUC ROC ). The former two parameters can be directly calculated using the following equations: where TP, FP, TN, and FN denote the number of true positives, false positives, true negatives, and false negatives, respectively. The P and N stand for all positive or negative cases, while the %TP, %FP, %TN, %FN, TPR, FPR, TNR, and FNR parameters are the percentages or rates of true positives, false positives, true negatives, and true positives. The AUC ROC parameter is determined based on the ROC curve which is the relationship between sensitivity expressed by the TPR and 1-specificity term equal to FPR.

Conclusions
The screening of new biologically active compounds requires performing various time-consuming and expensive experiments including synthesis of new compounds and bioassays. Therefore, theoretical models are particularly useful, since they can direct the experimental effort to a selected group of compounds, which are expected to be active. Therefore, accuracy, effectiveness, and simplicity are of a significant importance when developing new models. Beta-glucosidase inhibitors are an important class of compounds, widely used in medicine due to their beneficial effects on health. Probably, the main application of beta-glucosidase inhibitors is the treatment of lysosomal storage disorders including Gaucher disease. In this paper, an accurate QSAR beta-glucosidase classification model based on simple and intuitive methodology was developed. The models described in this study were developed and validated using a universal SANN methodology implemented in STATISTICA. Prior to the SANN models' generation, a variables selection procedure was performed, which resulted in the collection of orthogonal pairs of descriptors. As it was established, the combination of maxHBint3 and SpMax8_Bhs indices leads to the highest accuracy expressed by the highest average MCC value calculated for ten random dataset splits. This simple and intuitive concept of model development seems to be promising in the case of other enzymes' inhibitors. It is noteworthy that the application of the final SANNs for a collection of natural compounds available in the BIOFACQUIM database resulted in the selection of eight compounds univocally classified as active by all models. Most of these compounds contain carbohydrate moieties, which is characteristic for a substrate mimicking glucosidase inhibitors.
Supplementary Materials: The following are available online, Table S1: The dataset used for model development and validation obtained from ChEMBL database, Table S2: The dataset splits into training, validation and test sets, Table S3: The results of prediction performed for BIOFACQUIM database, SANNs.zip file containing PMML codes of SANNs.
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.