Towards the De Novo Design of HIV-1 Protease Inhibitors Based on Natural Products

Acquired immunodeficiency syndrome (AIDS) caused by the human immunodeficiency virus (HIV) continues to be a public health problem. In 2020, 680,000 people died from HIV-related causes, and 1.5 million people were infected. Antiretrovirals are a way to control HIV infection but not to cure AIDS. As such, effective treatment must be developed to control AIDS. Developing a drug is not an easy task, and there is an enormous amount of work and economic resources invested. For this reason, it is highly convenient to employ computer-aided drug design methods, which can help generate and identify novel molecules. Using the de novo design, novel molecules can be developed using fragments as building blocks. In this work, we develop a virtual focused compound library of HIV-1 viral protease inhibitors from natural product fragments. Natural products are characterized by a large diversity of functional groups, many sp3 atoms, and chiral centers. Pseudo-natural products are a combination of natural products fragments that keep the desired structural characteristics from different natural products. An interactive version of chemical space visualization of virtual compounds focused on HIV-1 viral protease inhibitors from natural product fragments is freely available in the supplementary material.


Introduction
The acquired immunodeficiency syndrome (AIDS) caused by the human immunodeficiency virus (HIV) is a major global public health concern. In 2020, the World Health Organization (WHO) reported that approximately 37.7 million people live with HIV out of 24.5 million from the African region. In 2020, 680,000 people died from HIV-related causes and 1.5 million people acquired it [1]. There is no definite treatment for AIDS. Therefore, it is necessary to collaborate to develop a treatment since the antiretroviral drugs currently approved by Food and Drug Administration (FDA) to clinical use only control AIDS and prevent HIV-1 transmission between individuals ( Figure 1 and Table 1) [2][3][4].
Drug design and development demand many years of hard work and economic investment. Most drug candidates are prone to fail [5]. From 25,000 compounds that start in the laboratory, only 25 make it through preclinical testing to human testing, and just five of those reach the actual clinical use [6]. Computer-aided drug design (CADD) has contributed to yielding several drugs into the clinic, yet it has several challenges ahead [7]. Among the CADD methods, de novo design has gained relevance due to the diversity of structures generated by optimizing the algorithms used. From a methodological point of view, artificial intelligence as boosted the development and application of de novo design [5,8,9]. Notably, de novo design is a structure-based drug design method that benefits from the experimental information available of the binding sites of molecular targets.  Chemical structures of ten FDA-approved HIV-1 protease inhibitors (Amprenavir, Atazanavir, Darunavir, Fosamprenavir, Indinavir, Lopinavir, Nelfinavir, Ritonavir, Saquinavir, Tipranavir). The EC50 is the concentration of drug required to produce 50% of the maximum possible effect. Table 1. FDA-approved HIV-1 protease inhibitors which will be used as a reference for the de novo design of the new chemical compounds. a Fosamprenavir is the phosphate ester prodrug of amprenavir.

Generic Name
Brand Name EC 50 [ The main goal of de novo design is to suggest novel molecular structures from scratch with desired activity on a pharmacological target and desired properties [10]. The new structures can be made using two general approaches: fragment-based and atom-based. The advantage of the fragment-based approach is that it narrows down the search in chemical space and maintains good chemical structure diversity [11][12][13]. Additionally, fragments form fewer interactions that should be able to bind to a greater number of sites on a greater number of proteins. Fragments are small (less than 20 heavy atoms) and typically soluble; they are likely to have better pharmaceutical properties as well as the new chemical compounds generated from them [14]. Over the last 20 years, four drugs from fragment-based drug discovery (FBDD) have been approved, and 40 compounds are currently in clinical trials [15].
Recently, de novo design and artificial intelligence have been combined to propose novel molecules for the treatment of SARS-CoV-2 based on HIV-1 protease and the approved drugs that inhibit this viral protease [8]. Another successful example of de novo design focusing on HIV research led to four molecules from a new compound library generated from the ZINC database [16]. Other approaches de novo design was based on enumerating libraries using chemical reactions [17,18] and are also promising to expand the epigenetic relevant chemical space [19].
The development of new chemical compounds using de novo design can begin from natural product-derived fragments. Natural products have been attractive chemical compounds because they are characterized by a larger number of sp 3 carbon atoms, chiral centers (associated with structural complexity), the larger scaffold diversity, and functional groups, hence their relevance for use as building-blocks [20,21]. Indeed, larger structural complexity of small organic molecules has been associated with increased selectivity and drug-likeness. In previous studies, we showed that natural products cover regions of chemical space that have not yet been explored by synthetically accessible compounds and those with biological activity [22]. For this reason, natural products could be used as building-blocks to develop novel synthetic molecules or pseudo-natural products which combine the desired structural characteristics from different natural products [23].
The goal of this work was to develop a virtual focused compound library of HIV-1 protease inhibitors from natural products fragments through de novo design. The focused library was compared with two virtual libraries of HIV-1 protease inhibitors developed from commercially available fragment libraries that were used as reference. The commercial reference libraries were 4063 ChemDiv's fragments (enriched with sp 3 carbons) [24], and 4150 natural product fragments from Enamine [25]. The natural product fragments were built from the COlleCtion of Open NatUral producTs (COCONUT), the currently largest accessible database of natural products with more than 400,000 nonredundant compounds [26]. Of note, the novel chemoinformatics protocol presented herein is general and can be adapted to generate the compound libraries using de novo design, different molecular templates and molecular targets. Herein we focus on HIV-1 protease because of its current relevance in public health. Thus, we aim that the present work will contribute towards the research that leads to effective HIV treatments.

Materials and Methods
The virtual focused compound libraries of HIV-1 viral protease inhibitors from natural product fragments and two commercially available fragments libraries were developed using the protocol outlined in Figure 2.
Biomolecules 2021, 11, x FOR PEER REVIEW 4 of 23 and can be adapted to generate the compound libraries using de novo design, different molecular templates and molecular targets. Herein we focus on HIV-1 protease because of its current relevance in public health. Thus, we aim that the present work will contribute towards the research that leads to effective HIV treatments.

Materials and Methods
The virtual focused compound libraries of HIV-1 viral protease inhibitors from natural product fragments and two commercially available fragments libraries were developed using the protocol outlined in Figure 2.

Dataset Curation
The preparation of compounds, encoded in Simplified Molecular Input Line System (SMILES) [27], was performed using the open-source cheminformatics toolkit RDKit version 2021.03.3 [28], tool MolVS version 0.1.1 [29], and python programming language, version 3.7.10. Compounds with valence errors or any chemical element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I were deleted. Stereochemistry information was removed because not all compounds in datasets have it defined. Compounds with multiple components were split, and the largest component was retained. The remaining compounds were neutralized and reionized to subsequently generate a canonical tautomer. Repeated compounds were deleted. To narrow down the search chemical space, physicochemical properties were computed: hydrogen bond donors (HBD), hydrogen bond acceptors (HBA), topological polar surface area (TPSA), number of rotatable bonds (RB), molecular weight (MW), and partition coefficient octanol/water (SlogP). Molecular compounds with the "rule of five" [30] and Veber [31] (MW ≤ 500, HBD ≤ 5, HBA ≤ 10, SlogP ≤ 5, TPSA ≤ 140, RB ≤ 10) were retained. Of note, despite the fact some of the fragments used in this work are generated from natural products (as illustrated in Figure 2), the type of molecules designed are small organic drug-like molecules.

Dataset Curation
The preparation of compounds, encoded in Simplified Molecular Input Line System (SMILES) [27], was performed using the open-source cheminformatics toolkit RDKit version 2021.03.3 [28], tool MolVS version 0.1.1 [29], and python programming language, version 3.7.10. Compounds with valence errors or any chemical element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I were deleted. Stereochemistry information was removed because not all compounds in datasets have it defined. Compounds with multiple components were split, and the largest component was retained. The remaining compounds were neutralized and reionized to subsequently generate a canonical tautomer. Repeated compounds were deleted. To narrow down the search chemical space, physicochemical properties were computed: hydrogen bond donors (HBD), hydrogen bond acceptors (HBA), topological polar surface area (TPSA), number of rotatable bonds (RB), molecular weight (MW), and partition coefficient octanol/water (SlogP). Molecular compounds with the "rule of five" [30] and Veber [31] (MW ≤ 500, HBD ≤ 5, HBA ≤ 10, SlogP ≤ 5, TPSA ≤ 140, RB ≤ 10) were retained. Of note, despite the fact some of the fragments used in this work are generated from natural products (as illustrated in Figure 2), the type of molecules designed are small organic drug-like molecules.

Generation of Unique Fragments Using Retrosynthetic Rules
Fragment libraries were produced with the Retrosynthetic Combinatorial Analysis Procedure (RECAP) as implemented in RDKit. The RECAP algorithm [32] cleaves a molecule into fragments if this had any of the following bonds: amide, ester, amine, urea, ether, olefin, quaternary nitrogen, aromatic nitrogen-aliphatic carbon, lactam nitrogenaliphatic carbon, aromatics carbon-aromatic carbon, and sulphonamide.

De Novo Design
The new chemical structures were built based on the template previously proposed by Zhao et al. developed from the structure-activity relationship (SAR) analysis for the optimization of bevirimat ( Figure 3), a compound derived from betulinic acid ( Figure 4) [33]. Bevirimat [34,35] is a compound in clinical trials that targets the Gag polyprotein inhibiting the action of HIV protease at its the last cleavage event of the capsid protein and spacer peptide 1 (CA-SP1) [36,37]. The template proposed for building new chemical compounds related to bevirimat is shown in Figure 5.

Generation of Unique Fragments Using Retrosynthetic Rules
Fragment libraries were produced with the Retrosynthetic Combinat Procedure (RECAP) as implemented in RDKit. The RECAP algorithm [32] ecule into fragments if this had any of the following bonds: amide, ester ether, olefin, quaternary nitrogen, aromatic nitrogen-aliphatic carbon, lac aliphatic carbon, aromatics carbon-aromatic carbon, and sulphonamide.

De Novo Design
The new chemical structures were built based on the template previou by Zhao et al. developed from the structure-activity relationship (SAR) an optimization of bevirimat ( Figure 3), a compound derived from betulinic a [33]. Bevirimat [34,35] is a compound in clinical trials that targets the Ga inhibiting the action of HIV protease at its the last cleavage event of the c and spacer peptide 1 (CA-SP1). [36,37]. The template proposed for building compounds related to bevirimat is shown in Figure 5.

Generation of Unique Fragments Using Retrosynthetic Rules
Fragment libraries were produced with the Retrosynthetic Combinatorial Analy Procedure (RECAP) as implemented in RDKit. The RECAP algorithm [32] cleaves a m ecule into fragments if this had any of the following bonds: amide, ester, amine, ur ether, olefin, quaternary nitrogen, aromatic nitrogen-aliphatic carbon, lactam nitrog aliphatic carbon, aromatics carbon-aromatic carbon, and sulphonamide.

De Novo Design
The new chemical structures were built based on the template previously propo by Zhao et al. developed from the structure-activity relationship (SAR) analysis for optimization of bevirimat ( Figure 3), a compound derived from betulinic acid (Figur [33]. Bevirimat [34,35] is a compound in clinical trials that targets the Gag polyprot inhibiting the action of HIV protease at its the last cleavage event of the capsid prot and spacer peptide 1 (CA-SP1). [36,37]. The template proposed for building new chem compounds related to bevirimat is shown in Figure 5.

4.
Chemical structures of betulinic acid, betulin, cyclic system skeleton derived from betulinic acid, COCONUT's nt with betulinic acid ring skeleton derived from the 24-nor-3α,11α-dihydroxy-lup-20(29)-en-23,28-dioic acid.  New molecules were generated using the Python programming language and t toolkit RDKit [28], following the protocol described for Saldívar-González et al. to en merate chemical libraries [18]. We used COCONUT fragments with a cyclic system sk eton similar to betulinic acid, a hydroxyl group attached to carbon 3, and a carboxylic a group attached to carbon 17, as shown in Figure 4. The COCONUT's fragment select was derived from 24-nor-3α,11α-dihydroxy-lup-20(29)-en-23,28-dioic acid (COCONU ID: CNP0243494 or Reaxys ID: 6547020). Betulinic acid was used to build new chemi compounds from ChemDiv fragments and Enamine fragments because there were fragments of cyclic system skeleton derived from betulinic acid or analogous triterpen Chemical reactions were represented in SMIRKS, a hybrid notation of SMILES a SMARTS (SMILES Arbitrary Target Specification). Reaction 1, esterification, was ma between triterpene alcohol and 2,2-dimethyl succinic acid using SMIRKS 1, as shown Table 2. Reaction 2, amidation, was built from the carboxyl group attached to carbon as shown in Figure 4 using fragments attached to piperazine, 1,3-diaminoethane, and 1 diaminopropane find in COCONUT fragments, ChemDiv fragments, and Enamine fra ments. The SMIRKS 2.1-2.3 were used in reaction 2 and shown in Table 2. The compoun and fragments were selected using the functional groups in SMARTS notation describ in Table 3. Newly generated chemical structures with valence errors were removed. C nonical SMILES were generated, and duplicate molecules were deleted. New molecules were generated using the Python programming language and the toolkit RDKit [28], following the protocol described for Saldívar-González et al. to enumerate chemical libraries [18]. We used COCONUT fragments with a cyclic system skeleton similar to betulinic acid, a hydroxyl group attached to carbon 3, and a carboxylic acid group attached to carbon 17, as shown in Figure 4. The COCONUT's fragment selected was derived from 24-nor-3α,11α-dihydroxy-lup-20(29)-en-23,28-dioic acid (COCONUT ID: CNP0243494 or Reaxys ID: 6547020). Betulinic acid was used to build new chemical compounds from ChemDiv fragments and Enamine fragments because there were no fragments of cyclic system skeleton derived from betulinic acid or analogous triterpenes.
Chemical reactions were represented in SMIRKS, a hybrid notation of SMILES and SMARTS (SMILES Arbitrary Target Specification). Reaction 1, esterification, was made between triterpene alcohol and 2,2-dimethyl succinic acid using SMIRKS 1, as shown in Table 2. Reaction 2, amidation, was built from the carboxyl group attached to carbon 17 as shown in Figure 4 using fragments attached to piperazine, 1,3-diaminoethane, and 1,3-diaminopropane find in COCONUT fragments, ChemDiv fragments, and Enamine fragments. The SMIRKS 2.1-2.3 were used in reaction 2 and shown in Table 2. The compounds and fragments were selected using the functional groups in SMARTS notation described in Table 3. Newly generated chemical structures with valence errors were removed. Canonical SMILES were generated, and duplicate molecules were deleted. Table 2. SMIRKS used for building the new chemical compounds from natural products fragments.

Description Scheme
Reaction 1 Table 2. Reaction 2, amidation, was built from the carboxyl group attached to carbon 17 as shown in Figure 4 using fragments attached to piperazine, 1,3-diaminoethane, and 1,3diaminopropane find in COCONUT fragments, ChemDiv fragments, and Enamine fragments. The SMIRKS 2.1-2.3 were used in reaction 2 and shown in Table 2. The compounds and fragments were selected using the functional groups in SMARTS notation described in Table 3. Newly generated chemical structures with valence errors were removed. Canonical SMILES were generated, and duplicate molecules were deleted.  Table 3. Functional groups using SMARTS notation to filter fragments from natural products.

Functional Groups SMARTS
Aliphatic alcohol (cyclohexanol) Cyclic system skeleton derived from betulinic acid

Structural Diversity and Complexity
The structural diversity of the new chemical compounds generated was evaluated to compute the median value of the distribution of the pairwise similarity values generated with the Tanimoto coefficient for Morgan fingerprint with radius 2 (Morgan2, 1024-bits) [38] and Molecular ACCes System (MACCS) Keys (166-bits) [39].

Chemical Space Visualization
The chemical space visualization was done using two methods, principal component analysis (PCA) based on physicochemical properties and the Tree MAP (TMAP) algorithm based on molecular fingerprints [40,41].
PCA is a linear dimensionality reduction technique to transform data with many dimensions into a lower dimensional space and preserve the different relationships between the data points as much as possible [42]. PCA was generated from six physicochemical properties (MW, HB, HBA, SlogP, TPSA, and RB).
TMAP allows the visual representation of many chemical compounds through the distance between the clusters and the cluster's detailed structure through Local Sensitive Hashing (LSH) forest data structure, enabling c-approximate k-nearest neighbors (k-NN). Morgan fingerprints for chemical compounds were encoded using the MinHash algorithm. The number of nearest-neighbors, k = 50, and the factor used by the augmented query algorithm, kc = 10, were used to develop the TMAP graphs. Morgan fingerprints with radius 2 (Morgan2, 1024-bits) were generated to generate TMAP graphs [38]. Applications of TMAP for chemical space visualization of other compound datasets have been reported [43,44].

Filtering of the New Chemical Compounds Generated
To narrow down the search in chemical space and set the conditions for the newly generated compounds, physicochemical properties were computed for libraries generated and FDA-approved HIV-1 protease inhibitors (Table 1 and Figure 1). The maximum values of the physicochemical properties obtained from the HIV-1 protease inhibitors was HBD ≤ 6, HBA ≤ 13, SlogP ≤ 6.7, MW ≤ 720.30, TPSA ≤ 174.60, and RB ≤ 17 (Table 4). Molecules with at least four rules were retained. SlogP strictly must be complied. These sets of properties and values were used as a heuristic rule that is slightly less stringent than the Lipinski and Veber rules [30,31].

Synthetic Feasibility
The complexity of the compounds generated was estimated using the synthetic accessibility score (SAscore) previously reported [45]. The SAscore implemented in this work is the difference between fragment score and complexity penalty. The fragment score captures common structural features in a large number of already synthesized molecules (934,046 representative molecules from the PubChem). Molecules are fragmented using extended connectivity fragments (ECFP_4# fragments), and the fragment score is calculated as a sum of contributions of all fragments in the molecule divided by the number of fragments in the molecule. The fragment frequency is related to their synthetic accessibility, and hence easy-to-prepare substructures are present in molecules quite often. The complexity score is calculated as the sum of ring complexity (ring bridge atoms and spiro atoms), the number of stereocenters, large rings (ring size greater than eight, molecular complexity increases), and molecule size. The SAscore was calculated for the virtual focused libraries of HIV-1 viral protease inhibitors generated, and two reference datasets of FDA-approved drugs, and FDA-approved HIV-1 protease inhibitors [46]. The SAscore was calculated using the Python script published by Ertl and Schuffenhauer [45].

ADME-Tox Profiling
Absorption, distribution, metabolism, excretion, and toxicity (ADME-Tox) properties of virtual focused libraries of HIV-1 viral protease inhibitors generated were calculated using the SwissADME server [47] and the pkCSM-pharmacokinetics server [48]. The ADME-Tox properties of FDA-approved drugs were also computed as reference. The SwissADME server was used to compute descriptors associated with absorption and metabolism. The pkCSM-pharmacokinetics server was used to compute descriptors associated with absorption, distribution, excretion, and toxicity. The evaluation of descriptors related to ADME-Tox properties was computed as previously described [49]. The descriptors calculated were absorption broken down into solubility, Silico-IT LogSw; lipophilicity, consensus LogPo/w, and human intestinal absorption (HIA). The blood-brain barrier (BBB) permeability, P-glycoprotein substrate, P-glycoprotein I inhibitor, and P-glycoprotein II (take binary values: yes/no) for distribution. Inhibition of five main cytochrome enzymes (CYP-1A2, CYP-2C19, CYP-2C9, CYP-2D6, CYP-3A4) for metabolism (take binary values: yes/no). Total clearance log (mL/min/kg) to excretion. The hERG I/II inhibition, AMES toxicity, and hepatotoxicity to toxicity (take binary values: yes/no).

Results and Discussion
As mentioned in the Introduction and Methods sections, new chemical compounds were built from two commercially available libraries: 4063 ChemDiv fragments enriched with sp 3 carbons, 4160 Enamine natural products fragments, and 184,769 COCONUT fragments computationally generated in house. The total number of molecules generated were: 1534 from COCONUT's fragments, 62 molecules from ChemDiv fragments, and 11 molecules from Enamine fragments. Fragments attached to 1,3-diaminopropane were not found in ChemDiv and Enamine's fragment collections. Similarly, fragments attached to 1,2-diaminoethane were not found in Enamine fragments.

Structural Diversity
The median of similarity generated using Morgan2 and MACCS keys fingerprints are shown in brackets, respectively, and described in Table S1 in the supplementary material. FDA-approved drugs (0.096, 0293) and FDA-approved HIV-1 protease inhibitors (0.253, 0.558) were the most diverse datasets, following by compounds derived from COCONUT fragments (0.605, 0.817), ChemDiv fragments (0.676, 0.821), and Enamine fragments (0.682, 0.823). Compounds computationally generated from fragment datasets were less diverse because these datasets are focused on bevirimat-like compounds.

Chemical Space Visualization
A visual representation of the chemical space based on physicochemical properties (MW, HB, HBA, SlogP, TPSA, and RB, as stated in the Methods Section 2.5) using PCA is shown in Figure 6. Principal component 1 recovered 73.6% of the variance, and principal component 2 recovered 21.2% of the variance. The accumulated variance recovered by the first two principal components represented in Figure 6 was 94.8%. In this chemical space visualization, the compounds generated from the three fragment libraries are within the space of physicochemical properties of FDA-approved drugs. Likewise, some compounds generated from COCONUT fragments had physicochemical properties similar to FDAapproved HIV-1 protease. The visual representation of the chemical space based on molecular fingerprint using the TMAP algorithm is shown in Figure 7. An interactive version of the TMAP is available at https://figshare.com/s/ceb58d58e8f5585ce67e (accessed on 5 November 2021). The chemical structures of new chemical compounds generated were very different in comparison with FDA-approved drugs and FDA-HIV-1 protease inhibitors. The chemical structures of the new compounds generated from ChemDiv and Enamine fragments were very similar compared to compounds derived from COCONUT fragments. In some cases, the chemical structures of compounds generated from COCONUT's fragments were very similar to some FDA-approved drugs, for instance, palbociclib and pipecuronium. In these cases where there are not commercially available fragments like COCONUT's fragments could be used palbociclib and pipecuronium. To quantitatively define which dataset is the most diverse, coverage space obtained by convex hull analysis derived from PCA was computed for each dataset ( Figure S1). The convex hull is defined as the minimum convex polygon so that the point set is either inside this polygon or at its border [50,51]. The convex hull area computed were for FDAapproved drugs (737.59), HIV-1 protease inhibitors (1.11), compounds from COCONUT's fragments (3.18), compounds from ChemDiv's fragments (0.79), and compounds from Enamine fragments (0.18). The outcome of this analysis was similar to the results of the structural diversity analysis based on fingerprints (Section 3.1): reference datasets were more diverse than the new chemical compounds generated from fragments datasets. The new chemical compounds derived from COCONUT fragments were the most diverse, followed by new chemical compounds derived from ChemDiv and Enamine fragments.
The visual representation of the chemical space based on molecular fingerprint using the TMAP algorithm is shown in Figure 7. An interactive version of the TMAP is available at https://figshare.com/s/ceb58d58e8f5585ce67e (accessed on 5 November 2021). The chemical structures of new chemical compounds generated were very different in comparison with FDA-approved drugs and FDA-HIV-1 protease inhibitors. The chemical structures of the new compounds generated from ChemDiv and Enamine fragments were very similar compared to compounds derived from COCONUT fragments. In some cases, the chemical structures of compounds generated from COCONUT's fragments were very similar to some FDA-approved drugs, for instance, palbociclib and pipecuronium. In these cases where there are not commercially available fragments like COCONUT's fragments could be used palbociclib and pipecuronium.   Figure 8 shows box-whisker plots of physicochemical properties after applying the empirical rules proposed (Section 2.6). The summary of descriptive statistics is shown in Tables S2-S7 in the supplementary material. 352 compounds generated from COCONUT fragments (20%) and 1 compound generated from ChemDiv fragments were retained (2%), and compounds generated from Enamine fragments were not retained (0%). Based on the properties' distribution shown in the box-whisker plots, the physicochemical properties of compounds generated from COCONUT fragments, ChemDiv fragments, and Enamine fragments were different regarding FDA-approved HIV-1 protease inhibitors  Figure 8 shows box-whisker plots of physicochemical properties after applying the empirical rules proposed (Section 2.6). The summary of descriptive statistics is shown in Tables S2-S7 in the supplementary material. 352 compounds generated from COCONUT fragments (20%) and 1 compound generated from ChemDiv fragments were retained (2%), and compounds generated from Enamine fragments were not retained (0%). Based on the properties' distribution shown in the box-whisker plots, the physicochemical properties of compounds generated from COCONUT fragments, ChemDiv fragments, and Enamine fragments were different regarding FDA-approved HIV-1 protease inhibitors and FDAapproved drugs.

Compound Filtering Based on Physicochemical Properties
6.70, MW ≤ 720.31, RB ≤ 17, TPSA ≤ 174.56, HBA ≤ 13, HBD ≤ 6 for FDA-approved HIV-1 protease inhibitors; SlogP ≤ 6.69, MW ≤ 998.63, RB ≤ 15, TPSA ≤ 198.54, HBA ≤ 13, HBD ≤ 7 for compounds generated from COCONUT fragments, and SlogP = 6.4, MW = 737.47, RB = 10, TPSA = 187.47, HBA = 12, HBD = 5 for the compound generated from ChemDiv's fragments. The SlogP, RB, and HBA values of compounds generated from COCONUT fragments and ChemDiv fragments were less than FDA-approved HIV-1 protease inhibitors. HBA values were equal or less than FDA-approved HIV-1 protease inhibitors. The SlogP values of compounds derived from Enamine fragments were larger than FDA-approved HIV-1 protease inhibitors as shown in Figure S2; accordingly, no compound was retained. The MW, TPSA, and HBD values of compounds generated from COCONUT fragments were larger than for FDA-approved HIV-1 protease inhibitors and less than for FDA-approved drugs. As mentioned above Ganesan [52], natural products that violate the Lipinsky rules remain largely compliant in terms of log P and HBD. He considers that "nature has learned to maintain low hydrophobicity and intermolecular H-bond donating potential when it needs to make biologically active compounds with high molecular weight and a large number of rotatable bonds". In drugs, the molecules that exceed HBD 5 or HBA 10 the majority are natural product-related [53].  of compounds generated from COCONUT fragments and ChemDiv fragments were less than FDA-approved HIV-1 protease inhibitors. HBA values were equal or less than FDAapproved HIV-1 protease inhibitors. The SlogP values of compounds derived from Enamine fragments were larger than FDA-approved HIV-1 protease inhibitors as shown in Figure S2; accordingly, no compound was retained. The MW, TPSA, and HBD values of compounds generated from COCONUT fragments were larger than for FDA-approved HIV-1 protease inhibitors and less than for FDA-approved drugs. As mentioned above Ganesan [52], natural products that violate the Lipinsky rules remain largely compliant in terms of log P and HBD. He considers that "nature has learned to maintain low hydrophobicity and intermolecular H-bond donating potential when it needs to make biologically active compounds with high molecular weight and a large number of rotatable bonds". In drugs, the molecules that exceed HBD 5 or HBA 10 the majority are natural product-related [53].

Filtering Based on Synthetic Feasibility
The synthetic feasibility was computed for FDA-approved drugs, FDA-approved HIV-1 protease inhibitors, and compounds generated from COCONUT and ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors. Figure 9 summarizes the results of synthetic feasibility. Molecules with a low SAscore value < 6 are easily synthetically accessible [45]. A total of 97% FDA-approved drugs had SAscore < 6, and FDA-approved HIV-1 protease inhibitors had SAscore ≤ 4.24. Similarly, 75% of compounds generated from COCONUT fragments had SAscore ≤ 6.03 and the compound generated from ChemDiv had SAscore = 5.54. Although, compounds generated from COCONUT fragments had 5.50 ≤ SAscore ≤ 6.03, still in recommended range so that can be synthetically accessible; moreover, the high SAscore, in compounds generated regarding FDA-approved HIV-1 protease inhibitors, was influenced by the ten stereocenters of betulinic acid and 24-nor-3α,11α-dihydroxy-lup-20(29)-en-23,28-dioic acid. Considering that these stereocenters do not have to be generated within the organic synthesis, the SAscore value would be lower.

Filtering Based on Synthetic Feasibility
The synthetic feasibility was computed for FDA-approved drugs, FDA-approved HIV-1 protease inhibitors, and compounds generated from COCONUT and ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors. Figure 9 summarizes the results of synthetic feasibility. Molecules with a low SAscore value < 6 are easily synthetically accessible [45]. A total of 97% FDA-approved drugs had SAscore < 6, and FDA-approved HIV-1 protease inhibitors had SAscore ≤ 4.24. Similarly, 75% of compounds generated from COCONUT fragments had SAscore ≤ 6.03 and the compound generated from ChemDiv had SAscore = 5.54. Although, compounds generated from COCONUT fragments had 5.50 ≤ SAscore ≤ 6.03, still in recommended range so that can be synthetically accessible; moreover, the high SAscore, in compounds generated regarding FDA-approved HIV-1 protease inhibitors, was influenced by the ten stereocenters of betulinic acid and 24-nor-3α,11α-dihydroxy-lup-20(29)-en-23,28-dioic acid. Considering that these stereocenters do not have to be generated within the organic synthesis, the SAscore value would be lower. Figure 9. Box-whisker plot of synthetic feasibility calculated for FDA-approved drugs (blue), FDAapproved HIV-1 protease inhibitors (purple), and new chemical compounds generated from CO-CONUT fragments (orange) and ChemDiv (red) fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors. Black diamonds represent outliers.

ADME-Tox Profiling
The ADME-Tox profiling was computed for 251 compounds generated from COCO-NUT fragments and 1 compound generated from ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors and estimated as easy synthesizable (i.e., SAscore ≤ 6). Similarly, ADME-Tox profiling was computed for FDA-approved drugs and FDA-approved HIV-1 protease inhibitors.

Absorption
Solubility, lipophilicity, and HIA are summarized in Figure 10 and Tables S9-S11 in the supplementary material. Solubility was expressed by Silicos-IT LogSw and lipophilicity was expressed by consensus LogP. Silicos-IT LogSw and consensus LogP were computed with the SwissADME server. Percentage of HIA was computed with the pkCSM- Figure 9. Box-whisker plot of synthetic feasibility calculated for FDA-approved drugs (blue), FDAapproved HIV-1 protease inhibitors (purple), and new chemical compounds generated from CO-CONUT fragments (orange) and ChemDiv (red) fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors. Black diamonds represent outliers.

ADME-Tox Profiling
The ADME-Tox profiling was computed for 251 compounds generated from CO-CONUT fragments and 1 compound generated from ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors and estimated as easy synthesizable (i.e., SAscore ≤ 6). Similarly, ADME-Tox profiling was computed for FDAapproved drugs and FDA-approved HIV-1 protease inhibitors.

Absorption
Solubility, lipophilicity, and HIA are summarized in Figure 10 and Tables S9-S11 in the supplementary material. Solubility was expressed by Silicos-IT LogSw and lipophilicity was expressed by consensus LogP. Silicos-IT LogSw and consensus LogP were computed with the SwissADME server. Percentage of HIA was computed with the pkCSMpharmacokinetics server.
Biomolecules 2021, 11, x FOR PEER REVIEW 15 of 23 for solubility. LogP values above 2, the lipophilicity is the main factor [54]. FDA-approved HIV-1 protease inhibitors were highly soluble, followed by compounds derived from CO-CONUT and ChemDiv fragments, both had Log P > 2; in this case, solubility is strongly influenced by lipophilicity. Contrary to FDA-approved drugs that had Log P close to 2 and were less soluble, solubility mainly depends on the crystal lattice. Compounds derived from COCONUT and ChemDiv fragments had higher HIA in comparison to FDAapproved HIV-1 protease inhibitors.

Distribution
The relative frequency of BBB permeability is described in Figure 11. The median value of BBB permeability was −0.38 for FDA-approved drugs; −1.21 for compounds generated from COCONUT and ChemDiv fragments, and −1.25 for FDA-approved HIV-1 New drug candidates have poor water solubility, and it is often the result of highly lipophilic compounds. Log P < 2, the crystal lattice becomes the main determining factor for solubility. LogP values above 2, the lipophilicity is the main factor [54]. FDA-approved HIV-1 protease inhibitors were highly soluble, followed by compounds derived from COCONUT and ChemDiv fragments, both had Log P > 2; in this case, solubility is strongly influenced by lipophilicity. Contrary to FDA-approved drugs that had Log P close to 2 and were less soluble, solubility mainly depends on the crystal lattice. Compounds derived from COCONUT and ChemDiv fragments had higher HIA in comparison to FDA-approved HIV-1 protease inhibitors.

Distribution
The relative frequency of BBB permeability is described in Figure 11. The median value of BBB permeability was −0.38 for FDA-approved drugs; −1.21 for compounds generated from COCONUT and ChemDiv fragments, and −1.25 for FDA-approved HIV-1 protease inhibitors. Compounds generated from COCONUT and ChemDiv fragments had similar BBB permeability.
The percentage of compounds that are P-glycoprotein substrate, P-glycoprotein I inhibitor, and P-glycoprotein II inhibitor were summarized in Figure 12 and Table S13 in the supplementary material. All FDA-approved HIV-1 protease inhibitors and 96% of compounds generated from COCONUT and ChemDiv fragments were P-glycoprotein substrates. Similarly, 66.67% of HIV-1 Approved protease inhibitors and 82.9% of compounds generated from COCONUT and ChemDiv fragments were P-glycoprotein II inhibitors. Whereas no compounds generated from COCONUT and ChemDiv fragments were Pglycoprotein I inhibitors, against 100% FDA-approved HIV-1 proteases inhibitors were P-glycoprotein I inhibitors.
molecules 2021, 11, x FOR PEER REVIEW 16 o protease inhibitors. Compounds generated from COCONUT and ChemDiv fragme had similar BBB permeability.
The percentage of compounds that are P-glycoprotein substrate, P-glycoprotein I hibitor, and P-glycoprotein II inhibitor were summarized in Figure 12 and Table S1 the supplementary material. All FDA-approved HIV-1 protease inhibitors and 96% compounds generated from COCONUT and ChemDiv fragments were P-glycopro substrates. Similarly, 66.67% of HIV-1 Approved protease inhibitors and 82.9% of co pounds generated from COCONUT and ChemDiv fragments were P-glycoprotein II hibitors. Whereas no compounds generated from COCONUT and ChemDiv fragme were P-glycoprotein I inhibitors, against 100% FDA-approved HIV-1 proteases inhibi were P-glycoprotein I inhibitors.

Metabolism
The percentage of compounds CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4 inhibitors is described in Figure 13 and Table S14 in the supplementary material. No compounds generated from COCONUT and ChemDiv fragments were CYP1A2, CYP2C19 CYP2C9, CYP2D6, and CYP3A4 inhibitors. FDA-approved HIV-1 inhibitors were not CYP1A2 and CYP2D6 inhibitors similar to compounds generated from COCONUT and ChemDiv fragments. Whereas for FDA-approved HIV-1protease inhibitors, 89% were CYP3A4 inhibitors, and 33% were CYP2C19 and CYP2C9 inhibitors. Figure 12. Percentage of compounds that are P-glycoprotein substrate, P-glycoprotein I inhibitor, and P-glycoprotein II inhibitor. Colors represent compounds: new chemical compounds generated from COCONUT fragments and ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors and easily synthetically accessible (orange), FDA-approved drugs (blue), FDA-approved HIV-1 protease inhibitors (purple).

Metabolism
The percentage of compounds CYP1A2, CYP2C19, CYP2C9, CYP2D6 and CYP3A4 inhibitors is described in Figure 13 and Table S14 in the supplementary material. No compounds generated from COCONUT and ChemDiv fragments were CYP1A2, CYP2C19, CYP2C9, CYP2D6 and CYP3A4 inhibitors. FDA-approved HIV-1 inhibitors were not CYP1A2 and CYP2D6 inhibitors similar to compounds generated from COCONUT and ChemDiv fragments. Whereas for FDA-approved HIV-1protease inhibitors, 89% were CYP3A4 inhibitors, and 33% were CYP2C19 and CYP2C9 inhibitors. Biomolecules 2021, 11, x FOR PEER REVIEW 18 of 23 Figure 13. Percentage of compounds that inhibit the main cytochromes, CYP1A2, CYP2C19, CYP2C9, CYP2D6, CYP3A4. Colors represent compounds: new chemical compounds generated from COCONUT fragments and ChemDiv fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors and easily synthetically accessible, FDAapproved drugs (blue), FDA-approved HIV-1 protease inhibitors (purple).

Excretion
Clearance quantitates the irreversible removal of a drug from the measured matrix, generally, blood or plasma [55]. The total clearance logarithm expressed in units of (mL/min/Kg) is shown in Figure 14. The summary of descriptive statistics is shown in Table S15 in the Supplementary Materials. The median values of the total clearance logarithm were 0.591 for FDA-approved drugs; 0.494 for FDA-approved HIV-1 protease inhibitors, and −0.618 for compounds derived from COCONUT and ChemDiv fragments. The total clearance of FDA-approved HIV-1 protease inhibitors (0.20 ≤ total clearance ≤ 0.94) was similar to 75% FDA-approved drugs (0.27 ≤ total clearance ≤ 0.85). Whereas the total clearance of compounds generated from COCONUT and ChemDiv fragments (−1.34 ≤ total clearance ≤ 0.13) was similar to 25% FDA-approved drugs (-13.94 ≤ total clearance ≤ 0.27). The total clearance of compounds derived from COCONUT and ChemDiv fragments and FDA-approved HIV-1 inhibitors were different.

Excretion
Clearance quantitates the irreversible removal of a drug from the measured matrix, generally, blood or plasma [55]. The total clearance logarithm expressed in units of (mL/min/Kg) is shown in Figure 14. The summary of descriptive statistics is shown in Table S15 in the Supplementary Materials. The median values of the total clearance logarithm were 0.591 for FDA-approved drugs; 0.494 for FDA-approved HIV-1 protease inhibitors, and −0.618 for compounds derived from COCONUT and ChemDiv fragments. The total clearance of FDA-approved HIV-1 protease inhibitors (0.20 ≤ total clearance ≤ 0.94) was similar to 75% FDA-approved drugs (0.27 ≤ total clearance ≤ 0.85). Whereas the total clearance of compounds generated from COCONUT and ChemDiv fragments (−1.34 ≤ total clearance ≤ 0.13) was similar to 25% FDA-approved drugs (-13.94 ≤ total clearance ≤ 0.27). The total clearance of compounds derived from COCONUT and ChemDiv fragments and FDA-approved HIV-1 inhibitors were different.

Toxicity
Percentage of compounds from datasets that are hERG I inhibitor, hERG II inhibitor, hepatotoxicants (hepatotoxicity), and carcinogens (positive in AMES test) were described in Figure 15 and Table S16 in the supplementary material. FDA-approved HIV-1 protease inhibitors and compounds generated from COCONUT and ChemDiv fragments were not carcinogens. However, 77.22% of compounds derived from COCONUT and ChemDiv fragments were hepatotoxicants, lower than FDA-approved HIV-1 protease inhibitors (100%), and higher than FDA-approved drugs (47.42%). A total of 100% and 98.81% of compounds generated from COCONUT and ChemDiv fragments were not hERG I/II inhibitors, respectively.

Toxicity
Percentage of compounds from datasets that are hERG I inhibitor, hERG II inhibitor, hepatotoxicants (hepatotoxicity), and carcinogens (positive in AMES test) were described in Figure 15 and Table S16 in the supplementary material. FDA-approved HIV-1 protease inhibitors and compounds generated from COCONUT and ChemDiv fragments were not carcinogens. However, 77.22% of compounds derived from COCONUT and ChemDiv fragments were hepatotoxicants, lower than FDA-approved HIV-1 protease inhibitors (100%), and higher than FDA-approved drugs (47.42%). A total of 100% and 98.81% of compounds generated from COCONUT and ChemDiv fragments were not hERG I/II inhibitors, respectively.

Toxicity
Percentage of compounds from datasets that are hERG I inhibitor, hERG II inhibitor, hepatotoxicants (hepatotoxicity), and carcinogens (positive in AMES test) were described in Figure 15 and Table S16 in the supplementary material. FDA-approved HIV-1 protease inhibitors and compounds generated from COCONUT and ChemDiv fragments were not carcinogens. However, 77.22% of compounds derived from COCONUT and ChemDiv fragments were hepatotoxicants, lower than FDA-approved HIV-1 protease inhibitors (100%), and higher than FDA-approved drugs (47.42%). A total of 100% and 98.81% of compounds generated from COCONUT and ChemDiv fragments were not hERG I/II inhibitors, respectively.

Conclusions
We developed an HIV-1 virtual focused library using de novo design based on enumerated libraries of compounds from fragment libraries. The fragments library in-house was built from the COCONUT database, the currently largest accessible database of natural products. Using bevirimat as template, 251 out of 1534 compounds generated from COCONUT fragments, had physicochemical properties like FDA-approved HIV-1 protease inhibitors and were estimated as easy synthesizable.
Compounds generated from COCONUT fragments were more diverse than compounds generated from ChemDiv and Enamine fragments, based on chemical structure and physicochemical properties. Visual representation of the chemical space based on TMAP showed that some compounds generated from COCONUT fragments had chemical structures similar to FDA-approved drugs, such as palbociclib and pipecuronium. ADME/Tox profiling showed that compounds generated from COCONUT fragments had adsorption (solubility and lipophilicity) and distribution (BBB permeability, P-glycoprotein substrate, and P-glycoprotein II inhibitor) similar to FDA-approved HIV-1 protease inhibitors. Concerning estimations of metabolism, no compounds generated from COCONUT fragments were CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4 inhibitors. As per excretion, the total clearance of compounds derived from COCONUT fragments and FDA-approved HIV-1 inhibitors were different, but similar to FDA-approved drugs. Compounds derived from COCONUT fragments were predicted to be no inhibitors of hERG I/II, like 97.7% and 66.4% of FDA-approved drugs, respectively. Compounds derived from COCONUT fragments were predicted to be no carcinogens.
The 251 compounds derived from COCONUT fragments with physicochemical properties like FDA-approved HIV-1 protease inhibitors, estimated as easy synthesizable, and good ADME/Tox profiling can be used in future analysis such as virtual screening to select candidates to test in biological assays. The next logical perspective of this project that this is beyond the scope of this manuscript is to conduct the chemical synthesis and experimental screening of selected compounds.
The protocol presented in this work is general and can be used to build other chemical compounds like bevirimat or other maturation inhibitors of HIV-protease. Notably, the code used for generated new chemical compounds from chemical fragments is freely available (see Data Availability statement). This can be achieved from the SMARTS and SMIRKS proposed to filter functional groups and build new chemical compounds.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/biom11121805/s1, Figure S1: Convex hull area from PCA based on physicochemical properties of new chemical compounds generated and two compound reference libraries. Table S1: Summary of fingerprint-based structural diversity of new chemical compounds generated from COCONUT, ChemDiv, and Enamine fragments, and two compound reference libraries. Figure S2: Box-whisker plots of physicochemical properties of FDA-approved drugs (blue), FDA-approved HIV-1 protease inhibitors (purple), and new chemical compounds generated from COCONUT (orange), ChemDiv (red), and Enamine (green) fragment libraries, before applying physicochemical properties filtering. Table S2: Summary of the descriptive statistics of SlogP. Table S3: Summary of the descriptive statistics of MW. Table S4: Summary of the descriptive statistics of RB. Table S5: Summary of the descriptive statistics of TPSA. Table S6: Summary of the descriptive statistics of HBA. Table S7: Summary of the descriptive statistics of HBD. Table S8: Summary of the descriptive statistics of SAscore. Table S9: Summary of the descriptive statistics of solubility (Silicos-IT LowSw). Table S10: Summary of the descriptive statistics of lipophilicity (Consensus Log P). Table S11: Summary of the descriptive statistics of HIA. Table S12: Summary of the descriptive statistics of BBB permeability. Table S13: Percentage of compounds that are P-glycoprotein substrate, P-glycoprotein I inhibitor, and P-glycoprotein II inhibitor. Table S14: Percentage of compounds that inhibit the main cytochromes, CYP1A2, CYP2C19, CYP2C9, CYP2D6, CYP3A4. Table S15: Summary of the descriptive statistics of total clearance. Table S16: Summary of the descriptive statistics of toxicity descriptors.