(Q)SAR Models of HIV-1 Protein Inhibition by Drug-Like Compounds

Despite the achievements of antiretroviral therapy, discovery of new anti-HIV medicines remains an essential task because the existing drugs do not provide a complete cure for the infected patients, exhibit severe adverse effects, and lead to the appearance of resistant strains. To predict the interaction of drug-like compounds with multiple targets for HIV treatment, ligand-based drug design approach is widely applied. In this study, we evaluated the possibilities and limitations of (Q)SAR analysis aimed at the discovery of novel antiretroviral agents inhibiting the vital HIV enzymes. Local (Q)SAR models are based on the analysis of structure–activity relationships for molecules from the same chemical class, which significantly restrict their applicability domain. In contrast, global (Q)SAR models exploit data from heterogeneous sets of drug-like compounds, which allows their application to databases containing diverse structures. We compared the information for HIV-1 integrase, protease and reverse transcriptase inhibitors available in the EBI ChEMBL, NIAID HIV/OI/TB Therapeutics, and Clarivate Analytics Integrity databases as the sources for (Q)SAR training sets. Using the PASS and GUSAR software, we developed and validated a variety of (Q)SAR models, which can be further used for virtual screening of new antiretrovirals in the SAVI library. The developed models are implemented in the freely available web resource AntiHIV-Pred.


Data Selection and Curation
Experimental data for building models of quantitative structure-activity relationships were downloaded from three databases (DB) and then analyzed.
NIAID ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database (NIAID DB) is the publicly available DB containing extracted from scientific literature on the structure and activity of compounds tested against HIV, HIV enzymes or opportunistic pathogens [https://www.niaid.nih.gov/]. It is created and supported by the National Institute of Allergic and Infectious Disease, NIH. To analyze the structure-activity relationships, we extracted information about structure and antiretroviral activity of 10377, 7604 and 8936 molecules tested as HIV-1 integrase, protease and reverse transcriptase inhibitors from the DB.
ChEMBL is the publicly available DB containing information about structure and biological activity of drug-like molecules [Davies M. et al., 2015, https://www.ebi.ac.uk/chembl/]. It is created and supported by the European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI) for the past ten years. In the current study, we used ChEMBL version 24, which includes data on about 20,000 compounds, interacted with seven HIV-1 molecular targets, and more than 50,000 records with information about the associated bioactivities. To analyze the structure-activity relationships, we extracted information about structure and antiretroviral activity of 2283, 2387 and 2149 molecules tested as HIV-1 integrase, protease and reverse transcriptase inhibitors from the DB.
Integrity is commercially available DB provided by Clarivate Analytics [https://clarivate.com/products/integrity/]. To analyze the structure-activity relationships, we extracted information about structure and antiretroviral activity of 563, 316 and 731 molecules tested as HIV-1 integrase, protease and reverse transcriptase inhibitors from the DB.
Data derived from scientific literature turned out to be highly heterogeneous having different experimental conditions and various ligand efficiency indices were measured (Ki, IC50, percent of inhibition etc.). The curation process was realizied according to the pipeline (Figure 1). It corresponds to the "Best QSAR practice" (Tropsha,A., et al. Molecular Informatics, 2010, 29, 476;Fourches,D., et al. Journal of Chemical Information andModeling, 2016, 56, 1243). The process could be subdivided to three parts -deletion of incorrect entries, chemical data curation and biological data curation.

Deletion of incorrect entries and standardization:
-Deletion of incomplete entries and entries that are flagged as unreliable (e.g. DATA_VALIDITY_COMMENT field in ChEMBL DB) and leaving only IC50 entries with conversion of values to molar concentrations.
Biological data curation: -Deletion of entries for mutants, drug-resistant stamps and invalid targets; -Data split basing on experimental protocols; -Potential duplicates were checked either for equal compunds with different activity values or for compounds having the same set of descriptors. Basing on each case the entries were deleted or the median value was calculated; -External laboratory error estimation; -Handling activity cliff data In addition, the data validity was verified by analysis of the information presented in the original publications. Curated data were used to create training sets for HIV targets. Information about data sets before curation is aggregated in tables 1 and 2. Information about curated data sets is aggregated in tables 3, 4,5 and on figure 2.   Figure 2. Total number of molecules extracted from three DBs and intersection between different DBs after the cleaning procedure.

Selected Descriptors
The modelling was accomplished using two types of descriptors -Multilevel Neighborhoods Atoms (MNA) and Quantitative Neighborhoods of Atoms (QNA) descriptors.
MNA descriptors are based on the molecular structure representation, which includes the hydrogen atoms according to the valences and partial charges of other atoms and does not specify the types of bonds. MNA descriptors are generated as recursively defined sequence: zero-level MNA descriptor for each atom is the mark of the atom itself; any next-level MNA descriptor for the atom is the sub-structure notation ( 1 , 2 , … , , … ), where is the previous-level MNA descriptor for i-th immediate neighbor's of the atom .
The mark of atom may include not only the atomic type but also any additional information about the atom. In particular, if the atom is not included into the ring, it is marked by «-». The neighbor descriptors 1 , 2 , … , , … are arranged in unique manner, e.g., in lexicographic order. Iterative process of MNA descriptors generation can be continued covering first, second, etc. neighborhoods of each atom.
Descriptors for each molecule are generated on the basis of connectivity matrix and table of atoms types . Connectivity matrix contains data on the covalent bonds in a molecule. Various bond types are not specified (topological approximation). All hydrogens based on valences and partial charges of atoms are taken into account. The types of atoms are specified according to the data presented in Table  6.   The values of and collected from many different sources and used in this work are represented in Table 7. Though the value ( − ) can be considered by convention as the partial atomic charge, where is the chemical potential, in general, the and values are not the estimate of partial atomic charges, hardness or so forth.
QNA describes each of the atoms in a molecule, and, at the same time, each of the and values depends on the whole composition and structure of a molecule ( Figure 4).    (1) and (2), and values for each atom of formic acid molecule.
From Figure 3c it is clear that any atom influences the others, though the influence decreases with the increase of the distance between them; e.g. components of matrix Exp (− 1 2 ) for atom 1 (C) are: 1.40 for atom 1 itself, -0.59 for its immediate neighbor atom 2 (O), -0.57 for atoms 3 (O) and 4 (H), and 0.14 for atom 5 (H).
The algorithm of QNA descriptor calculation is quite simple due to uselessness of the matrix

QSAR Modeling Using GUSAR Program
QNA describe each particular atom of a molecule, at the same time, each or value depends on the total molecule composition and structure. To use this molecule structure representation we have proposed to calculate each ( ) function of the structure of a molecule as the average value of ( , ) function of and variables for those atoms in a molecule, which have two or more immediate neighbors: The orthonormal and have zero mean, unit variance and they are uncorrelated.
Chebyshev polynomials were chosen as the family of functions ( , ), and the orthonormal and values have been additionally transformed by using hyperbolic tangent, so, the "normalized QNA" vary from -1 to 1. After this, the functions ( , ) in equation (4) are represented using Chebyshev polynomials as: ( , ) = ( , ) = Cos ( * ArcCos(TanH( ))) * Cos ( * ArcCos(TanH( ))) ( ) where the integer numbers , = 0,1,2 … define 2-dimesional Chebyshev polynomial degree. The final equation for estimate using QNA descriptors is: = 1 ∑ ( 0 + ∑ ( , )) ( ) QNA descriptors and their polynomial transformations (5)-(7) do not provide information on the shape and volume of a molecule although this information may be important for determination of the structure-activity relationships. Therefore, these parameters were added to QNA descriptors. Topological length of a molecule was calculated as the maximal distance between any two atoms and the volume of a molecule -as the sum of each atom's volume, 4 3 3 , where is the atomic radius (see Table 2 (AR)).
The Chebyshev polynomials are arranged in ascending order of their degrees + . For + = 1 they are 1,0 , 0,1 ; for + = 2 they are 2,0 , 1,1 , 0,2 etc. The first, second and third power of topological length and volume of a molecule were used. The number of initial variables equals to the number of Chebyshev polynomials plus the number of the first, second and third power of topological length and volume of a molecule.
The number of initial variables depends on the number of compounds in the training set. If the number of compounds in the training set less than 25, then the initial variables are 24. If the number of compounds in the training set varied from 25 to 100, then the initial variables depends on the following equation: = ( ( ) × 18.755) − 36.37, where -initial variables and -the number of compounds in the training set. If the number of compounds in the training set varied from 100 to 2000, then the number of initial variables equal to one-half of the number of compounds in the training set. If the number of compounds in the training set exceeds 2000, then the initial variables are 1000.
GUSAR algorithm uses three procedures described below for generation of different QSAR models based on QNA descriptors. 1) and values from (7) are multiplied by value, which are randomly chosen from 0.98 to 1.09.
3) Calculation of QNA descriptors is randomly chosen from two cases. First case is when and values are calculated for all atom types. Second case is when and values are calculated for each atom, which has connection with more than one carbon.
All these procedures are used for creation of each QSAR model based on QNA descriptors. These procedures allow obtaining the different QNA models.

Self-Consistent Regression (SCR)
The classical MLR has a number of limitations. Particularly, the number of objects in the training set should significantly exceed the number of independent variables, and it is important to use the non-collinear variables only. To overcome these limitations, we have employed the approach based on statistical regularization of ill-posed problems. It has resulted in the regularized leastsquares method: variable of ith object, is the k th value of the regression coefficients, is the k th value of the regularization parameters. Equation (9) has the following solution: where is the transposed regression matrix , is a diagonal matrix of the regularization parameters. The best regularization satisfies to the equations: ( 2 + 2 ) = 2 , = 1, … , where is the standard deviation of residuals, is the k th diagonal element of matrix .

Use of Nearest Neighbours
Algorithm of GUSAR allows using nearest neighbours for improvement of prediction value obtained from regression model. The corrected value is estimated by taking an average of the three chemicals values in the training set that are the most similar to the chemical under study. Similarity of the any pair of chemical compounds is estimated as Pirson's coefficient calculated in the space of independent variables obtained after SCR. This function usually increases the prediction accuracy for the external test sets for QSAR models created on the basis of training sets with 500 and more structures. Sometimes the use of this function may lead to decrease of accuracy prediction, which depends on the quality of the training sets.

Consensus method
The predicted value is estimated by taking into account a weighted average of the predicted values from the obtained QSAR models (QSAR models provide the predictions that are within the respective applicability domains). Prediction obtained from each developed model is weighted on its applicability domain values (AD). GUSAR algorithm calculates the similarity value; leverage value and estimation of three nearest neighbours' accuracy of prediction for the applicability domain assessment (see AD description, page 70). Weighting of the predicted value is performed using similarity and leverage values only.
If all predicted values from the obtained QSAR models exceed the similarity threshold and leverage threshold, then the consensus prediction value is calculated from similarity weighted predicted values. If all predicted values from the obtained QSAR models are less than the leverage and similarity thresholds, then the consensus prediction value is calculated from leverage weighted predicted values. If the predicted values from the obtained QSAR models exceed similarity and is less than leverage thresholds, then the consensus prediction value is calculated as average from leverage weighted predicted values and similarity weighted predicted values. If the predicted values from the obtained QSAR models are less than similarity and exceed the leverage thresholds, then the consensus prediction value is calculated as average from all weighted predicted values.

Similarity
Three nearest neighbours from the training set are calculated for each chemical compound under study using similarity estimation. Similarity of the pair of chemical compounds is estimated as Pirson's coefficient calculated in the space of independent variables obtained after SCR. The average similarity of three nearest neighbours is used for assessment of the applicability domain (AD) of the model. If the average similarity exceeds the threshold, then the studied chemical compound falls in AD of the model and vice-versa. The higher value of the threshold was selected, the more similar compounds fell in AD of the model.

Leverage
Hat value of leverage was used for domain applicability assessment. Hat values from the leverage matrix representing the "distance" of the molecule to the model structural space were calculated as: where is a vector of descriptors of a query compound, and is a matrix formed with rows corresponding to the descriptors of molecules from the training set. Hat value of leverage was calculated for each compound of the training set and then a distribution of the obtained values was estimated.
A warning level of hat value was considered as percentile. The end-user could select 99 th , 95 th and 90 th percentile. Therefore, if a chemical compound from the external test set has hat value exceeded this warning level, then this compound is considered as being out of the applicability domain.

Accuracy of three nearest neighbours' predictions
For the assessment of this type of the applicability domain the following equation is used: is the applicability domain value, 3 is root mean square error of prediction of three most similar compounds from the training set, is root mean square error of predictions for the training set.
Three nearest neighbours from the training set are calculated for each studied chemical compound using similarity value. Similarity of the two chemical compounds is estimated as Pirson's coefficient calculated in the space of independent variables obtained by SCR.
If the is less than the threshold , then the studied chemical compound falls into AD of the model and vice-versa. Therefore, any chemical compound, whose three nearest neighbours are predicted worse than the prediction accuracy of the whole training set, is considered as being out of the applicability domain.

Leave-Many-Out procedure
This procedure is used for assessment of external prediction accuracy. The initial data is randomly divided onto the training and the external test sets. End-user could select the following proportions: 10% for the external test set and 90% for the training set; 20% for the external test set and 80% for the training set; 30% for the external test set and 70% for the training set; 50% for the external test set and 50% for the training set.
For obtaining the objective assessment of predictive accuracy and robustness of the developed models each splitting should be repeated many times. End-user could select the following number of repeats: 1, 5, 10 and 20.

Y-randomization procedure
This procedure allows to be ensuring that the developed models do not have the overfitting. In this procedure the dependent-variable vector, Y vector, is randomly shuffled and new QSAR model is developed using the original independent-variable matrix. It is expected that the resulting models should generally have low Q 2 values. This procedure could be repeated many times for each model, and then the average Q 2 value is calculated.

PASS (Prediction of Activity Spectra of Substances) Method
In PASS biological activities are described qualitatively (active or inactive). Any property of chemical compounds, which is determined by their structural peculiarities, can be used for prediction by PASS. It is clear, that the applicability of PASS is broader than the prediction of biological activity spectra.
The MNA descriptors (both for predicting the activity spectrum of the compound and for adding substances to SAR Base) are generated only if molecular structure corresponds to the following criteria: • each of the atoms in a molecule must be presented by atom symbol from the periodic table. Symbols of unspecified atom A, Q, *, or R group labels are not allowed; • each of the bonds in a molecule must be covalent bond presented by single, double or triple bond types only; • molecular structure must include three or more carbon atoms; • molecular structure must include only one component. Single atom parts like HCl, Cl-, OH-, Na+, etc., (hydrogen atoms do not take into account) are excluded from MNA descriptors generation; • structure (main part, see the previous sentence) must be uncharged; • the absolute molecular weight of a substance must be less than 1250 Da.
A relevant error will be generated whenever a molecular structure does not correspond to these criteria or input data contains some other errors.
The PASS training sets for HIV-associated comorbidites include about 240,000 thoroughly selected data records about structure and biological activity of organic compounds from the general PASS training set. The knowledgebase SAR Base is created in process of training on the basis of the PASS training set.
The PASS estimations of biological activity spectra of new compounds are based on the Structure-Activity Relationships knowledge-base (SAR Base), which accumulates the results of the training set analysis. SAR Base includes a biological activities dictionary and MNA descriptors dictionary, data and knowledge on the "structure -biological activity" relationships, the DB of structure of compounds from the training set with their biological activity spectra. The structure of the compound represented as a set of MNA descriptors. Unfortunately, it is currently impossible to make a large collection of biologically active compounds using only publicly available sources, for which assay results for all types of biological activity were known.
Algorithm of activity spectrum prediction description may be done easy using well known Bayesian approach. For the chemical compound , which molecular structure is represented by the set { 1 , . . . , } of MNA descriptors, estimate the probability ( | ) that the compound has an activity . According to the Bayes formula: where ( ) is the activity prior probability; ( | ) is the conditional probability of compound providing that it has activity ; ( ) is the compound prior probability.
On the assumption that descriptors 1 , . . . , are independent one can write the probability ( | ) as the product of conditional probabilities for particular descriptors: This expression is approximately true since the MNA descriptors are a fortiori dependent due to their generation method. However, we have not acceptable alternatives, and we cannot forget about the approximation of obtained formulas.
After simple transformations, we obtain the expression for the log-likelihood ratio of the conditional probabilities ( | ) for activity and ( | ) for activity as: In a particular case, where is the lack of activity , we obtain: The implication of the expression is quite clear: the logarithm of the posterior likelihood ratio is the sum of the logarithm of the a priori likelihood ratio and the sum of individual descriptors contributions. And, if the activity is not dependent on this descriptor, then ( | ) = ( ) and the such descriptor has no affect on the results and its contribution to the sum is zero. This is the classical result of the probabilistic approach. But, apart from the already marked proximity, this result has another significant, well-known disadvantage: the contribution of some descriptors is too large and suppresses all other terms of the sum for which the conditional probability of activity is too close to 0 or 1, when its present in the structure. The most pronounced effect could be in the situation, when for probabilities ( | ) used frequency estimates based on analysis of the training set and the values 0 and 1 are the rule rather than the exception.
To overcome this problem one can offer many different approaches and they were tested in the PASS development. The best result was obtained by using of so-called Fischer (2 − 1) conversion instead of [ /(1 − )]: its shape coincides with the shape of [ /(1 − )] for almost all values of , but (2 − 1) values are bounded by the values /2. The accuracy of prediction also improved after changing the sum of descriptor contributions by the their average value, that apparently compensates for the assumption of descriptors independence. Logarithm of the a priori likelihood ratio has little information about a specific predicted organic compound and can be omitted.
Bayesian approach described above explains why PASS prediction algorithm based on the following specific statistics: on the basis of a molecular structure represented by the set { 1 , . . . , } of MNA descriptors, the values are calculated for each activity : The PASS prediction algorithm uses the following data on the "structure-activity" relationships: is the total number of compounds in the SAR Base; is the number of compounds contained descriptor in the structure description; is the number of compounds contained the activity in the activity spectrum; is the number of compounds contained both the activity and the descriptor .
The simplest frequency estimations of probabilities ( ) и ( | ) are given by: ( ) = , ( | ) = Estimation of PASS prediction accuracy and dependency required for calculation probabilities and on the basis of statistics, are the end result of the training procedure, which consists in the following. According to the SAR Base, formed on the basis of the training set, for each kind of activity , for each active, and for each − inactive compound, statistics values are calculated. Calculations are carried out in the Leave-One-Out Cross-Validation (LOO CV), i.e., after the "exclusion" of the compound from SAR Base, for what is enough to not include it in sum. Smooth estimations of the distribution functions ( ) and ( ) are based on the obtained sets of statistics.
The probabilities and are both the measures of belonging to subsets of "active" and "inactive" compounds, and the probabilities of the 1 st and 2 nd kinds of prediction error, respectively. These two interpretations of the probabilities and are equivalent and can be used for understanding the results of prediction. It allows constructing different critera for analyzing the results of prediction corresponding to the solution of specific practical problems.
An important feature of the PASS prediction algorithm is its robustness to the imperfection of information on the structure and biological activity spectrum of organic compounds in the training set. In a special study was showed that halving the actual well-known information about the structure or activity of organic compounds in the training set only slightly reduces the accuracy of prediction in cross-validation. There's demonstrated that the accuracy base on a LOO CV is even more rigid than for cross-validation.

Qualitative Characterstics of the (Q)SAR Models
Here we summarize the information on developed quantitative models for HIV-1 protease, reverse transcriptase, integrase.
Determination coefficient (R 2 ), Cross-validated determination coefficient (Q 2 ), Fisher coefficient (F) and root-mean-square-errors (RMSE) for quantitative models based on training sets before cleaning procedure are shown in Tables 8, 9, 10. Observed versus predicted plots for these models on figures 5, 6, 7.   Figure 6. Observed versus predicted plots for Integrity training set based QSAR models.   Determination coefficient (R 2 ), Cross-validated determination coefficient (Q 2 ), Fisher coefficient (F) and root-mean-square-errors (RMSE) for quantitative models based on training sets after cleaning procedure are shown in Tables 11, 12, 13. Observed versus predicted plots for these models on figures 8, 9, 10.   Figure 9. Observed versus predicted plots for Integrity training set based QSAR models after curation.  Figure 10. Observed versus predicted plots for NIAID training set based QSAR models after curation.

Models based on combined data sets and their validation
Taking into account that NIAID set was the biggest one, for each of IN, PR and RT inhibitors we developed three models using as the training set: I -NIAID + ChEMBL; II -NIAID + Integrity; III -NIAID + ChEMBL + Integrity.
Determination coefficient (R2), Cross-validated determination coefficient (Q2), Fisher coefficient (F) and root-mean-square-errors (RMSE) for quantitative models based on integrase combined training sets are shown in Table 14. Observed versus predicted plots for these models on figure 11. IAP for classification models based on integrase combined training sets are shown in Table 15. Figure 11. Observed versus predicted plots for Combined training sets based QSAR models for integrase.  Model ChEMBL+NIAID for integrase was validated on external test set obtained by excluding ChEMBL and NIAID structures from Integrity data set. Prediction results for continuous data model and classification model are presented on figure 12 and table 16.

Figure 12.
Observed versus predicted plots for predicting all structures (R 2 = 0.399) and structers in AD (R 2 = 0.400) for ChEMBL+NIAID integrase model. Model Integrity+NIAID for integrase was validated on external test set obtained by excluding Integrity and NIAID structures from ChEMBL data set. Prediction results for continuous data model and classification model are presented on figure 13 and table 17.
All structures prediction (R 2 = 0.667) Structures in AD prediction (R 2 = 0.678) Figure 13. Observed versus predicted plots for predicting all structures (R 2 = 0.667) and structers in AD (R 2 = 0.668) for Integrity+NIAID integrase model. Determination coefficient (R2), Cross-validated determination coefficient (Q2), Fisher coefficient (F) and root-mean-square-errors (RMSE) for quantitative models based on protease combined training sets are shown in Table 18. Observed versus predicted plots for these models on figure 14. IAP for classification models based on integrase combined training sets are shown in Table 19.

Figure 14.
Observed versus predicted plots for Combined training sets based QSAR models for protease.  Model ChEMBL+NIAID for protease was validated on external test set obtained by excluding ChEMBL and NIAID structures from Integrity data set. Prediction results for continuous data model and classification model are presented on figure 15 and table 20.    Determination coefficient (R2), Cross-validated determination coefficient (Q2), Fisher coefficient (F) and root-mean-square-errors (RMSE) for quantitative models based on reverse transcriptase combined training sets are shown in Table 22. Observed versus predicted plots for these models on figure 17. IAP for classification models based on integrase combined training sets are shown in Table 23. Figure 17. Observed versus predicted plots for Combined training sets based QSAR models for reverse transcriptase.  Model ChEMBL+NIAID for reverse transcriptase was validated on external test set obtained by excluding ChEMBL and NIAID structures from Integrity data set. Prediction results for continuous data model and classification model are presented on figure 18 and table 24. Figure 18. Observed versus predicted plots for predicting all structures (R 2 = 0.261) and structers in AD (R 2 = 0.264) for ChEMBL+NIAID reverse transcriptase model. Model Integrity+NIAID for reverse transcriptase was validated on external test set obtained by excluding Integrity and NIAID structures from ChEMBL data set. Prediction results for continuous data model and classification model are presented on figure 19 and table 25. Figure 19. Observed versus predicted plots for predicting all structures (R 2 = 0.500) and structers in AD (R 2 = 0.502) for Integrity+NIAID reverse transcriptase model.