A SAR and QSAR Study of New Artemisinin Compounds with Antimalarial Activity

The Hartree-Fock method and the 6-31G** basis set were employed to calculate the molecular properties of artemisinin and 20 derivatives with antimalarial activity. Maps of molecular electrostatic potential (MEPs) and molecular docking were used to investigate the interaction between ligands and the receptor (heme). Principal component analysis and hierarchical cluster analysis were employed to select the most important descriptors related to activity. The correlation between biological activity and molecular properties was obtained using the partial least squares and principal component regression methods. The regression PLS and PCR models built in this study were also used to predict the antimalarial activity of 30 new artemisinin compounds with unknown activity. The models obtained showed not only statistical significance but also predictive ability. The significant molecular descriptors related to the compounds with antimalarial activity were the hydration energy (HE), the charge on the O11 oxygen atom (QO11), the torsion angle O1-O2-Fe-N2 (D2) and the maximum rate of R/Sanderson Electronegativity (RTe+). These variables led to a physical and structural explanation of the molecular properties that should be selected for when designing new ligands to be used as antimalarial agents.


Introduction
Malaria is a very serious infectious disease caused by protozoans of the genus Plasmodium and is transmitted through the bite of infected female Anopheles mosquitoes. Every year, over one million people die from malaria, especially in tropical and subtropical areas. Most of the deaths are attributed to the parasite species Plasmodium falciparum. Many drugs have been investigated for their efficacy in the treatment of the disease, but strains of P. falciparum resistant to some of these drugs have appeared. Hence, the discovery of new classes of more potent compounds to treat the disease is necessary [1][2][3][4][5][6]. Artemisinin (qinghaosu) has been used in traditional Chinese medicine to treat disease for more than two million years. The medicine is extracted from the plant Artemisia annua L. and is used to combat 52 species of diseases in the People's Republic of China [7]. Artemisinin has a unique structure with a stable endoperoxide lactone (1, 2, 13-trioxane) that is totally different from previous antimalarials in its structure and mode of action. Artemisinin is remarkably effective against Plasmodium falciparum and cerebral malaria [8]. Currently, semi-synthetic artemisinin derivatives play an important role in the treatment of P. falciparum malaria [9][10][11]. Although the true mechanism of their biological activity against malaria has not been completely elucidated, various studies suggest that the trioxane ring is essential for antimalarial activity due to the properties displayed by the endoperoxide linkage. The literature also suggests that free heme could be the target of artemisinin in biological systems and that Fe 2+ interacts with the peroxide when artemisinin reacts with heme [12][13][14][15]. Artemisinin and its derivatives induce a rapid reduction in the number of parasites when compared with other known drugs. Consequently, they are of particular interest for severe cases of malaria. The initial decline in the number of parasites is also beneficial for combination therapies. Therefore, there is an enormous interest in the mechanism of action, chemistry and drug development of this new class of antimalarials. The endoperoxide group is essential for the antimalarial activity and is mediated by activated oxygen (superoxide, H 2 O 2 and/or hydroxyl radicals) or carbon free radicals [16][17][18][19].
In the evolution of computational chemistry, the use of molecular modeling (MM) has been one of the most important advances in the design and discovery of new drugs. Currently, MM is an indispensable tool in not only the process of drug discovery but also the optimization of existing prototypes and the rational design of drug candidates [20][21][22][23]. According to IUPAC, MM is the investigation of molecular structures and properties using computational chemistry and graphical visualization techniques to provide a three-dimensional representation of the molecule under a given set of circumstances [21]. The nature of the molecular properties used and the extent to which they describe the structural features of molecules can be related to biological activity, which is an important part of any Structure-Activity Relationship (SAR) or Quantitative Structure-Activity Relationship (QSAR) study. QSAR studies use chemometric methods to describe how a given biological activity or a physicochemical property varies as a function of the molecular descriptors describing the chemical structure of the molecule. Thus, it is possible to replace costly biological tests or experiments using a given physicochemical property (especially those involving hazardous and toxically risky materials or unstable compounds) with calculated descriptors that can, in turn, be used to predict the responses of interest for new compounds [24]. Recently, Cristino et al. studied nineteen 10-substituted deoxoartemisinin derivatives and artemisinin with activity against D-6 strains of malarial falciparum in Sierra Leone. They used chemometric modeling to reduce dimensionality and determine which subset of descriptors are responsible for the classification between more active (MA) and less active (LA) artemisinins. A predictive study was performed with a new set of eight artemisinins using chemometric methods, and five of them were predicted to be active against D-6 strains of falciparum malaria [25].
In this paper, a SAR and QSAR study of artemisinin and 20 derivatives (see Figure 1) with different antimalarial activities, tested in vitro against P. falciparum (W-2), was performed. Initially, the structures were modeled, and many different molecular descriptors were computed. Maps of the molecular electrostatic potential (MEP) and molecular docking were employed to better understand the correlation between structure and activity and the interaction between the ligands (artemisinin and derivatives) and the receptor (heme). Multivariate analysis methods were used to deal with the large number of descriptors and generate a predictive model [26]. Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) were employed to choose the molecular descriptors that are most related to the biological property investigated. Then, a QSAR model was elaborated through the Principal Component Regression (PCR) and Partial Least Square (PLS) methods that were used to perform predictions of 30 new artemisinin compounds with unknown antimalarial activity and to aid in future studies searching for other new antimalarial drugs [27][28][29].

Optimization of the Geometry of Artemisinin in Different Methods and Basis Sets
In all three basis sets (HF/6-31G, HF/6-31G*, HF/6-31G**), the Hartree-Fock method describes all structural parameters very well in terms of magnitude and sign when compared to the experimental values (see Table 1). This is in contrast to the AM1, PM3, ZINDO and DFT (B3LYP/3-21G, B3LYP/3-21G*, B3LYP/3-21G**) methods, in which there is not good agreement between the experimental and theoretical values for the torsion angles, especially the angle formed by atoms C3O13C12C12a, with deviations <−13.900° (AM1), <−22.489° (PM3), <−7.880° (ZINDO), >0.020° (HF/6-31G), >2.132° (HF/6-31G*), >2.100° (HF/6-31G**) > −3.759° (B3LYP/ 3-21G), >−3.760° (B3LYP/3-21G*) and >−3.780° (B3LYP/3-21G**) and standard deviations of 4.776, 8.388, 4.372, 1.663, 2.484, 1.762, 1.915, 1.855 and 1.987, respectively. By comparing these methods with the HF method, we find that the HF/6-31G and HF/6-31G** basis sets have low standard deviations in relation to the semiempirical and DFT methods. The variation was ±0.099 between HF/6-31G and HF/6-31G**.   Figure 1. Cont. Table 1 shows that the HF/6-31G, HF/6-31G*, HF/6-31G** basis sets show excellent results for bond length compared to the experimental data. The 6-31G basis set described the bond angles well, with values close to the experimental results. However, the minimum bases (6-31G and 3-21G) have several deficiencies; thus, a polarization function was included to improve upon these bases (i.e., p orbitals represented by *). These orbitals follow restricted functions that are centered at the nuclei. However, the atomic orbitals become distorted or polarized when a molecule is formed. Therefore, one must consider the possibility of non-uniform displacement of electric charges outside of the atomic nucleus, i.e., polarization. Thus, it is possible to obtain a better description of the charges and deformations of atomic orbitals within a molecule. A mode of polarization can be considered by introducing functions for which the values of l (quantum number of the orbital angular momentum) are larger than those of the fundamental state of a given atom. For these types, the basis set names denote the polarization functions. Thus, 6-31G* refers to basis set 6-31G with a polarization function for heavy atoms (i.e., atoms other than hydrogen), and 6-31G** refers to the inclusion of a polarization function for hydrogen and helium atoms [30]. When basis sets with polarization functions are used in calculations involving anions, good results are not obtained due to the electronic cloud of anionic systems, which tend to expand. Thus, appropriate diffuse functions must be included because they allow for a greater orbital occupancy in a given region of space. Diffuse functions are important in the calculations of transition metals because metal atoms have "d" orbitals, which tend to be diffuse. It then becomes necessary to include diffuse functions in the basis function associated with the configuration of a neutral metal atom to obtain a better description of the metal complex. The 6-31G** basis is particularly useful in the case of hydrogen bonds [30][31][32][33][34]. This study highlighted that the HF/6-31G** basis set, which is closer to the experimental results and shows good performance in the description when comparing the C3O13C12 and C12aO1O2 bond angles. The torsion angles or dihedral angle also showed good agreement with the experimental values reported in the literature, showing that with the 6-31G** basis set, the torsion angles O1O2C3O13 and C13C12C12aO1 are closer to the crystallographic data. Artemisinin derivatives with antimalarial activity against Plasmodium falciparum, which is resistant to mefloquine, were studied using quantum chemical methods (HF/6-31G*) and the partial least-squares (PLS) method. Three main components explained 89.55% of the total variance, with Q 2 = 0.83 and R 2 = 0.92. From a set of 10 proposed artemisinin derivatives (artemisinin derivatives with unknown antimalarial activity against Plasmodium falciparum), a novel compound was produced with superior antimalarial activity compared to the compounds previously described in the literature [35]. Cardoso et al. [36] used HF/3-21G** ab initio and PLS methods to design new artemisinin derivatives with activity against P. falciparum malaria. The PLS method was used to build a multivariate regression model, which led to new artemisinin derivatives with unknown antimalarial activity. Additionally, MEP maps for the studied and proposed compounds were built and evaluated to identify common features in active molecules.
Cardoso et al. [37] studied artemisinin and some of its derivatives with activity against D-6 strains of Plasmodium falciparum using the HF/3-21G method. To verify the reliability of the geometry obtained, Cardoso et al. compared the structural parameters of the artemisinin trioxane ring with theoretical and experimental values from the literature. Ferreira et al. [16] studied artemisinin and 18 derivatives with antimalarial activity against W-2 strains of Plasmodium falciparum through quantum chemistry and multivariate analysis. The geometry optimization of structures was performed using the Hartree-Fock method and the 3-21G** basis set. Recently, Santos et al. [38] validated the HF/6-31G** computational methods applied in the molecular modeling of artemisinin, proposing a combination of chemical quantum methods and statistical analysis to study geometrical parameters of artemisinin in the region of the 1, 2, 13-trioxane endoperoxide ring. In determining the most stable structures of the studied compounds as well as the molecular properties, the Hartree-Fock method with the 6-31G** valence basis set separately has been used instead of semiempirical approaches such as AM1, PM3 and ZINDO, due to the number of relatively small compounds.

Molecular Docking
Docking calculations showed that the entire ligand molecule is placed parallel to the plane of the porphyrin ring of heme, and the polar part of the ligand, which contains the peroxide bond, is directed toward the polar part of the heme system containing Fe 2+ . This interaction is visualized in Figure 2 for most active compounds (1, 3, 4, 10, 11, 15, 16, 19 and 20). These orientations were assumed to be the most favorable and therefore to represent the real system under investigation, given that they were chosen based on the lowest free-energy of binding (interaction energy). For the compounds in the studied set, the values of d(Fe-O1) ranged from 2.310 to 2.727 Å; however, this interval for the d(Fe-O2) distances ranged from 2.760 to 3.808 Å. The d(Fe-O13) distances ranged from 4.811 to 5.434, and the d(Fe-O11) distances ranged from 4.897 to 5.525, as shown in Table 2.  For artemisinin (1), the d(Fe-O1) calculated distance was 2.542 Å, which is very close to the value reported (2.7 Å) in other theoretical studies [40,41]. There is a clear trend involving interatomic separation between Fe 2+ and the oxygen atom in the trioxane ring because the distances are shorter for the O1 atom than for the O2 atom. This result reinforces the idea that the O1 atom from artemisinin preferentially binds to the Fe 2+ from heme instead of the O2 atom.
Compounds 4, 10, 11 and 20 have higher activity than artemisinin and also higher values of d(Fe-O1). They have a large substituent that certainly causes repulsion due to steric effects, which prevents them from binding closer to the heme. Compounds 5 and 6 were designed to increase lipophilicity because it was observed that higher lipophilicity of artemisinin correlates with greater biological activity. Compounds 15, 16 and 20 present large substituent groups on the -methylene carbon (*C) that substantially increase the antimalarial activity of the compounds due to electronic and steric effects, respectively. Compound 3 demonstrated that the sugar-containing dihydroartemisinin acetylation derivatives have similar or better activities than artemisinin. However, the deacetylation of sugars reduces the antimalarial activity considerably.
The interaction energy for the ligand/receptor complex showed good linear correlation with activity (r = 0.389177) and ranged from −6.54 to −5.03 kcal·mol −1 when compared with Fe-O1, Fe-O2, Fe-O13 and Fe-O11 distances (Å) ( Table 2). In fact, even though some orientations were associated with the lowest interaction energy, they seemed to have strong activity against malaria because they presented the endoperoxide bond away from Fe 2+ . Currently, the most accepted mechanisms of antimalarial action involve the formation of a complex between heme and artemisinin derivatives in which the iron of heme interacts with O1 of the endoperoxide. Moreover, substituent and conformation effects may affect the charge distribution at the oxygen and even the Fe-O1 bond [35]. An increase in the polar area of artemisinin increases the polar interactions between heme, the ligand and the globin.

Molecular Electrostatic Potential Maps
To identify key characteristics of compounds derived from artemisinin, maps of molecular electrostatic potential (MEPs) were evaluated and used for qualitative comparisons in the region of the 1, 2, 13-trioxane ring of artemisinin and its derivatives. The geometrical form of the potential in the region of the 1, 2, 13-trioxane ring is similar for all active compounds and is characterized by negative electrostatic potential (red region) according to the literature [42].
The MEP visualization is shown in Figure 3. Compounds 2-21 have a region of negative potential near the trioxane ring, similar to the MEP of artemisinin (compound 1), which has an electrostatic potential maximum of 0.13378 u.a. (blue region) and a minimum of −0.12617 u.a. (red region). The maximum positive MEP (blue region) varied from 0.14234 u.a. 0.10429 u.a. for active compounds, while less active compounds ranged from 0.18555 u.a. to 0.14360 u.a. The values corresponding to the minimum negative electrostatic potential (red region) for the most active compounds ranged from −0.10750 u.a. to −0.12617 u.a., presenting potential values close to those of artemisinin. The minimum negative electrostatic potential (red region) for less active compounds ranged from −0.10384 u.a. to −0.12065 u.a., which are higher than those of artemisinin.
The region of negative electrostatic potential is due to the binding of the endoperoxide (C-O-O-C), which is the most notable feature of MEP. The distribution of the electron density around the trioxane ring is thought to be responsible for activity against malaria, a belief supported by the fact that the complexation of artemisinin with heme involves an interaction between the peroxide bond, the most negatively charged zone on the ligand, and Fe 2+ , the most positively charged zone on heme (the receptor molecule) [15,43].
The presence of a negative surface close to the trioxane ring suggests that these compounds have a reactive site for electrophilic attack and must possess antimalarial potency; consequently they are being investigated. Thus, in the case of an electrophilic attack of the iron of heme against an electronegative zone, there is a preference for it to occur through the endoperoxide linkage. By analyzing MEP maps, the selection of inactive compounds can be avoided.

PCA Results
The PCA results showed that the most important descriptors were the following: the hydration energy (HE), charge on the oxygen atom O11 (QO11), torsion angle D2 (O2-O1-Fe-N2) and the maximum rate of R/Sanderson electronegativity (RTe + ). The hydration energy is the energy released when water molecules are separated from each other and are attracted by solute molecules or ions. Hydration energy comprises solvent-solvent and solute-solvent interactions [44]. The charge on the O11 atom (QO11) is a measure of the force with which a particle can electrostatically interact with another particle [45]. O RTe + is a GETAWAY (geometry, topology and set of atomic weights) type descriptor associated with the form, symmetry size and molecular distribution of the atom [46,47]. The torsion angle D2 (O2-O1-Fe-N2) is of great importance in our study; according to the proposal of Jefford and colleagues, the iron of heme attacks artemisinin at O1 and generates a free radical in position O2 after the C3-C4 bond is broken, generating a carbon radical at C4 [48]. This free radical at C4 has been suggested to be an important component of antimalarial activity [49]. Molecular docking of artemisinin and its receptor, the heme group, performed by Tonmunphean, Parasuk and Kokpol also indicated that the iron of the heme group preferentially interacts with O1 rather than O2 [41].  The values of the important descriptors of each selected compound identified via PCA as well as the values of logRA, relative activity (RA) and the IC 50 is the 50% inhibitory concentration are shown in Table 3. The Table 3 shows the Pearson correlation matrix between the descriptors and logRA, and the correlation between pairs of descriptors is less than 0.70, while the correlation between the descriptors and logRA is less than 0.87. The descriptors selected by PCA represent the characteristics necessary to quantify the antimalarial activity of these compounds against Plasmodium falciparum W-2.
The results of the SAR model are presented in Table 4. The model was constructed with three main components (3 PCs). The first principal component (PC1) describes 40.8865% of the total information, the second principal component (PC2) describes 22.7045%, and the third (PC3) 11.5660%. PC1 contains 51.1081% of the original data, and the combination of the first two components (PC1 + PC2) contains 79.4887%, and all three (PC1 + PC2 + PC3) explain 93.9461% of the total information, losing only 6.0539% of the original information. The descriptors HE, D2 and QO11 contribute the most to PC1, while in PC2, the descriptor RTe + is the primary contributor. The main components can be written as a linear combination of the selected descriptors. Mathematical expressions for PC1 and PC2 are shown below.      Figure 5 also shows that the higher the contribution of the descriptor RTe + in the second principal component, i.e., the higher the value of the maximum index of R/Sanderson electronegativity for a certain compound, the higher the score value will be, indicating that the compound is less potent than others. The other descriptors contribute to a lesser degree. For example, the descriptor HE has negative weight in PC2, demonstrating that the most potent compounds generally have higher values of this descriptor. Costa et al. [40] showed that the presence of water changed the dihedral angle involved in the heme-artemisinin complex (C-Fe--O1-O2). Thus, this effect is believed to influence the process of molecular recognition between artemisinin and derivatives and heme in aqueous biological systems. The selection of the torsion angle D2 (O2-O1-Fe-N2) descriptor suggests that the action of drugs against malaria depends on electrophilic attack on the endoperoxide bond, particularly on the O1 atom. This result was confirmed by both an analysis of the MEP maps and by molecular docking as discussed previously.

HCA Results
The statistical analysis utilized in this study should group similar compounds into categories. The categories are represented by a two-dimensional diagram known as dendrogram that illustrates the fusions or divisions made at each successive stage of the analysis. Single samples (compounds) are represented by the branches on the bottom of the dendrogram. The similarity among the clusters is given by the length of their branches, so compounds presenting low similarity have long branches whereas compounds of high similarity have short branches. The HCA method classified the compounds into three classes (more active, less active and less active containing sugar) and was based on the Euclidean distance and the incremental method [50]. In the incremental linkage, the distance between two clusters is the maximum distance between a variable in one cluster and a variable in the other cluster. The descriptors employed to perform HCA were the same as those used for PCA, i.e., HE, QO11, D2 (O2-O1-Fe-N2) and RTe + . In the HCA technique, the distances between pairs of samples are computed and compared. Small distances imply that compounds are similar, while dissimilar samples will be separated by relatively large distances. The dendrogram in Figure 6 shows the HCA graphic as well as the compounds separated into three main classes. The scale of similarity varies from 0 for samples with no similarity to 1 for samples with identical similarity. By analyzing the dendrogram, some conclusions can be drawn even though the compounds present some structural diversity. HCA showed results similar to those obtained with PCA. The compounds are grouped according to their biological activities. The most potent compounds are 1, 3, 4, 10, 11, 15, 16,  19 and 20. The less potent compounds are grouped into two clusters, one of which contains compounds 8, 9, 12, 13, 14, 17, 18 and 21, and the other cluster contains artemisinin derivatives that possess a sugar (2, 5, 6 and 7).

Partial Least Squares (PLS) and Principal Component Regression (PCR) Results
The statistical quality [51] of the PLS and PCR models was gauged by parameters such as correlation coefficient or squared correlation coefficient (R 2 ), explained variance (R 2 ajust , i.e., adjusted R 2 ), standard deviation (s), variance ratio (F), cross-validated correlation coefficient (Q 2 ), standard error of validation (SEV), predicted residual error sum of squares (PRESS) and standard deviation of cross-validation (S PRESS ) [52][53][54]. The best regression models were selected based on high values of R 2 , R 2 ajust , Q 2 and F (a statistic of assessing the overall significance) and low values of s, SEV, PRESS and S press . The calculated properties and the experimental activity values for the compounds studied (Table 5) were used to build the regression models. The models built using the PLS and PCR methods were based on three latent variables, 18 test compounds and 3 compounds (2, 12 and 13) from the external validation set.  (4)) models that relate the descriptors and biological activity are the following:  Table 5. For the PLS, model only five compounds (2, 3, 12, 13 and 15) had high validation errors, and the PCR model yielded five compounds (3, 12, 13, 15 and 17) with high residual values. Our PLS and PCR models present the best fit for compounds with high activity because compounds with low activity showed high residuals values.

The regression Equations obtained for PLS (Equation (3)) and PCR (Equation
The measured versus predicted values using our PLS and PCR models are presented in (Figure 7a,b) respectively. The PLS and PCR plots identify compounds with higher activity (blue) and compounds with lower activity (red), including compounds from the external validation set. According to the PLS and PCR models, the four variables present different magnitudes of regression coefficients (in absolute value). The models reveal that compounds with high biological potency against P. falciparum have a combination of higher values of HE and D2 and lower values of QO11 and RTe + for the PLS model, but for the PCR model, compounds have higher values of HE and D2, lower QO11 values and positive values for RTe + . The validation parameters support the fact that the models are efficient and hence satisfactory given the complexity of the antimalarial mechanisms and the small number of descriptors (four) selected to build the QSAR model. The compounds of the set test were molded from the most stable structure of artemisinin, compound 1 of Figure 1, and constructed using GaussView 5.0 program, carrying the complete optimization of the geometry of each compound with the basis set of separated valence 6-31G** using the Hartree-Fock method as implemented in Gaussian 03 program. After obtain the most stable geometry of each compound was determined only selected descriptors in PCA and used in the construction of the QSAR (PLS and PCR) models, namely EH, QO11, RTe + and D2, shown in Table 6.
The QSAR models (PLS and PCR) were built used to predict the unknown antimalarial activity of thirty new artemisinin derivatives shown in Figure 8, compounds 22-51. Table 7 shows the results of the logRA by PCR and PLS models. According to Table 7 the PLS model showed that fifteen compounds of the test set (22, 23, 27, 30, 32, 34, 36-40, 44, 45, 47, 51) are predicted to be more active, they had values of logRA greater than zero (logRA > 0). However, the PCR model only nine compounds of all test sets (23, 25, 37, 38, 43, 46, 48-50) were predicted as most active, which showed values of logRA higher than zero (logRA > 0), a total of 24 compounds proposed as more active of thirty suggested compounds. However, compounds 23, 37 and 38 were the ones that had values of logRA greater than zero (logRA > 0) in both models (PLS and PCR) with residues of prediction ranging from 0.028951 to −0.1351, suggesting that these new compounds in the two models (PLS and PCR) are more potent than artemisinin may be synthesized and tested for antimalarial activity.   Figure 8. Cont. Table 7. Antimalarial activity predicted (log AR) by PCR and PLS models for the test set compounds and residues of prediction between models. The most potent compounds have logRA  0; the less potent compounds have logRA < 0.

Compounds Studied
Initially, 21 molecules were selected from the literature (Figure 1). Compounds 2-7 were proposed by Lin et al. [55], who found that the acetylation of dihydroartemisinin derivatives containing a sugar leads to similar or better activity than that of artemisinin. However, the deacetylation of sugars considerably reduces the antimalarial activity. Compounds 8-13 were chosen to examine the impact of the stereospecificity of the alkyl side chain on the biological properties and were proposed by Lin et al. [56] to obtain compounds with better biological activity than the antimalarials artelinic acid, artemisinin, artemether and arteether. However, the conversion of esters to their corresponding acids dramatically reduces their antimalarial activity. Compounds 14-21 were proposed by Lin et al. [57,58] because large substituents on the α-methylene carbon (*C) substantially increase the antimalarial activity of the compounds on the basis of electronic and steric effects and because the increased lipophilicity of artemisinin derivatives results in increased antimalarial activity. They were chosen for their in vitro bioactivity against the drug-resistant malarial strain P. falciparum (W-2 clone), which is chloroquine resistant but mefloquine sensitive. The numbering of the atoms used in this study is shown in Figure 1  (compound 1-artemisinin). Because biological data were obtained from different sources, the logarithm of the IC 50 value of artemisinin over the IC 50 value of the compounds (logarithm of relative activity, log RA) was used to reduce inconsistencies caused by individual experimental environments: analog) an of IC n / artemisini of (IC log = log 50 50 RA (5) where IC 50 is the 50% inhibitory concentration. In this study, the following classification based on the antimalarial responses was adopted: compounds with logRA ≥ 0.00, ranging from 0.00000 to 0.86031, were assumed to be more potent analogs (1, 3, 4, 10, 11, 15, 16, 19 and 20), and those with logRA < 0.00, ranging from −0.00634 to −2.40049, were considered to be less potent analogs (2, 5-9, 12-14, 17, 18 and 21). Based on the relative activity (RA) values, compounds 3, 4, 10, 15, 16, and 19 are 2-7 times more potent than artemisinin. Compound 15 is the most potent compound in the series studied.

Geometric Optimization and Descriptor Calculations
Molecular modeling started with the construction of the structure of artemisinin using GaussView 3.0 program [59], which was then optimized with different methods and basis sets-semiempirical (AM1, PM3 and ZINDO), ab initio/Hartree-Fock (HF/6-31G, HF/6-31G* and HF/6-31G**) and DFT (B3LYP/3-21G, B3LYP/3-21G* and B3LYP/3-21G**). These calculations were performed to find the method and basis sets with the best fit between the computational time and accuracy of the information compared to the experimental data [39]. After initial determination and structural optimization of artemisinin, the theoretical geometrical parameters of artemisinin in the region of the 1,2,13-trioxane ring (bond length, bond angle and torsion angle of the atoms that form this ring) were determined with the aim of evaluating the quality of the molecular wave function comparing the theoretical geometrical parameters with the experimental data ( Table 1). The experimental structure of artemisinin was taken from the Cambridge Structural Database CSD, with REFCODES: QNGHSU10, crystallographic R factor 3.6 [60]. All the other structures ( Figure 7) were built with the optimized structure of artemisinin using the Gaussian 03 program [61] with the Hartree-Fock (HF) method and 6-31G** basis set. After the structures were determined in three dimensions, various descriptors for each molecule of the set studied were calculated. They represent different sources of chemical information (features) regarding the molecules and include geometric, electronic, quantum-chemical, physical-chemical and topological descriptors, among others. They are important for the quantitative description of molecular structure and to finding appropriate predictive models [62]. The computation of the descriptors was performed employing the following software: Gaussian 03 program, [61] e-Dragon [63,64], Autodock 4.0 [65], Molekel [66] and HyperChem 6.02 [67]. quantum-chemical descriptors: total energy (ET), energy of the highest occupied molecular orbital (HOMO), a level below the energy of the highest occupied molecular orbital (HOMO-1), lowest unoccupied molecular orbital energy (LUMO), a level above the energy of the lowest unoccupied molecular orbital (LUMO + 1), difference in energy between HOMO and LUMO (GAP = HOMO-LUMO), Mulliken electronegativity (χ), molecular hardness (η), molecular softness (1/η), and charge on the atom n (where n = 1, 2, 3, 4, 5, 5a, 6, 7, 8, 8a, 9, 10, 11, 12, 12a, 13). The atomic charges used in this study were obtained with the key word POP = CHELPG using the electrostatic potential [68]. With this strategy, it was possible to obtain the best potential quantum molecular series of points defined around the molecule, and atomic charges offer the general advantage of being physically more satisfactory than Mulliken charges [69]. (c) Descriptors related to quantitative properties of chemical structure and biological activity: In our data matrix, QSAR descriptors were included, i.e., total surface area (TSA), molecular volume (MV), molar refractivity (MR), molar polarizability (MP), coefficient of lipophilicity (logP), molecular mass (MM) and hydration energy (HE) according to the HyperChem 6.02 program. The molecular descriptors were selected to provide valuable information about the influence of electronic, steric, hydrophilic and hydrophobic features on the antimalarial activity of artemisinins.

Interaction between Artemisinins and Heme
The interaction between the ligands (artemisinins) and the receptor (heme) was studied with molecular docking to determine the best geometry for the complex formed between these two molecules. The geometry of artemisinin and its derivatives (ligands) was designed with HF/6-31G**, whereas the geometry of heme (receptor) was obtained from the 1A6M structure in the RCSB protein data bank (PDB) from Vojtechovsky et al. [70]. The arrangement in the docking calculation took into account the presence of the proximal histidine residue under the plane of the porphyrin ring. This histidine moiety is, as usual, perpendicularly coordinated to Fe 2+ through the sp 2 nitrogen atom of its imidazole ring. Such an arrangement allows the Fe 2+ to attain a nearly octahedral hexacoordinated arrangement after binding to the artemisinin molecule [14]. The orientation of the ligand was set just above the plane of the heme. Then, for each ligand/receptor interaction, 100 conformations were calculated, and the most probable one was determined based on the lowest energies of interaction. Automated docking calculations were performed to develop possible conformations for the complex employing the Lamarckian Genetic Algorithm implemented in the package Autodock 4.0. This program starts the docking by displaying the ligand in an arbitrary conformation and position and looking for favorable dockings with the receptor using both simulating annealing and genetic algorithms. AutoDock uses a random number generator to create new poses for the ligand during its search and estimates the free energy of binding of a ligand to its target. The resulting conformations were ranked in order of increasing binding energy of the lowest binding energy conformation in each cluster.

Molecular Electrostatic Potential Maps
An important concept explored in this research was the correlation of the structure activity of the species studied here through the characteristics of the electrostatic potential in the region of the 1,2,13-trioxane ring because in the literature, artemisinin and derivatives with antimalarial activities present similar patterns in their MEP maps [36,37,71]. Such a method enables the use of a qualitative analysis to locate reactive sites in a molecule and determine the roles played by both the electronic and steric (size/shape) effects on its potency. It is worthwhile to note that the visualization of MEP maps provides qualitative information on molecules, such as the behavior of the interaction between a ligand and a receptor. The MEP at a given point (x, y, z) in the vicinity of a molecule is defined in terms of the interaction energy between the electrical charge generated from the molecules electrons and nuclei and a positive test charge (a proton) located at r. For the studied compounds, the V(r) values were calculated by Equation (6) as described previously (see [72]): where Z A is the charge of nucleus A, located at R A , (r') is the electronic density function of the molecule, and r' is the dummy integration variable. The MEP maps for artemisinin and its derivatives were computed from the atomic charge at the HF/6-31G** level using the Gaussian 03 program, and the results are displayed with Molekel software.

Variable Selection and Model Building
After the determination of all molecular descriptors, it was possible to construct a data matrix to develop step multivariate analysis. Before we began the multivariate analysis, it was necessary to make the autoscale or standardizing data matrix X = (n, m) consisting of 21 lines (the compounds studied) and 1,733 columns (in this case, the calculated descriptors for each molecule), where n is the number of studied compounds and m is the number of variables. The aim of using the standardizing matrix is to give each variable equal weight in mathematical terms, so each variable was centered on the mean and scaled to unit variance. To reduce the data set, variables were selected based on the analysis of the correlation matrix between variables (descriptors) and the logarithm of the relative activity (logRA). Those with small or no correlation (under the 0.30 correlation value cutoff) were discarded, except for QSAR and quantum chemical descriptors, resulting in only 230 descriptors remaining from the initial set of 1,733 descriptors. After this data compression, two complementary methods for exploratory data analysis were employed (PCA and HCA) to study intersample and intervariable relationships and to select the properties that contribute the most to the classification of the compounds into two groups. One group contained more potent analogs and the other less potent analogs. PCA was employed to reduce the dimensionality of the data, find descriptors that could be useful in characterizing the behavior of the compounds acting against malaria and look for natural clustering in the data and outlier samples. While performing PCA, several attempts to obtain a good classification of the compounds were made. At each attempt, the score and loading plots were analyzed based on the variables employed in the analysis. The score plot gives information about the compounds (similarity and differences). The loading plot gives information about the variables (how they are connected to each other and which best describe the variance in the original data). The descriptors selected by PCA were used to perform HCA, PLS and PCR. The objective of HCA was to present the compounds distributed in natural groups and the results confirm the PCA results. Thus, several approaches were attempted to establish links between samples/cluster. All of them were of an agglomerative type because each sample was first defined as its own cluster, and then others were grouped together to form new clusters until all the samples were part of a single cluster.
The QSAR models for the artemisinin compounds studied were constructed by the PCR and PLS methods based on the autoscaled data and the leave-one-out crossvalidation procedure [28,29]. The final purpose of the multivariate analysis (PLS and PCR) was the construction of a mathematical model that can be used to predict antimalarial activity. The samples selected to compose the external validation set were 2, 12 and 13. The statistical parameters used to assess the quality of the models were the Prediction Residual Error Sum of Squares (PRESS), Equation (7), the Standard Error of Validation (SEV), Equation (8), the total variance explained, R 2 (correlation between the estimated values predicted by the model built with the full data set and actual values of y), Q 2 (the cross-validated correlation coefficient) and S PRESS (standard deviation of cross-validation) given by Equations (9)-(11), respectively [28,29,[73][74][75] In Equations (7) and (8), n is the number of compounds used for the calibration or validation model, y i is the experimental value of the physicochemical property for the sample and ŷ i is the value predicted by a calibration or validation model. In Equations (9) and (10), PRESS cal is the Calibration Prediction Error Sum of Squares and PRESS val is the Validation Prediction Error Sum of Squares. Both PRESS cal and PRESS val are evaluated from Equation (7) by changing ŷ i for a calibration or validation model. The values of explained variance (R 2 A , i.e., adjusted R 2 ), standard deviation (s) and F (Fisher test) were determined. The multivariate data analyses (PCA, HCA, PLS and PCR) were performed by employing the Pirouette 3.01 software [50].

Conclusions
The HF method and the 6-31G** basis set revealed themselves to be adequate to optimize the structures of artemisinin and derivatives for consequent study. The molecular docking studies reinforced the idea that the Fe 2+ ion from heme preferentially binds the O1 atom from artemisinins rather than the O2 atom and that such a preference may be due to a greater steric hindrance at O2 than O1 and a more negative charge on the latter atom. Both factors are essential for intermolecular approach. MEP maps characterize the region of the 1,2,13-trioxane ring in artemisinin and derivatives as a region of negative electrostatic potential, and the use of MEP maps identified key structural features necessary for antimalarial activity. Investigation of the interaction with the molecular receptor (heme) showed that the presence of a red surface near the 1,2,13-trioxane ring suggested that these compounds have a reactive site for electrophilic attack. This attack preferentially occurs through the endoperoxide linkage. The predictive classification models for artemisinin derivatives were obtained with a set of molecular descriptors selected by chemometric approaches. PCA and HCA methods classified the studied compounds into groups according to their degree of antimalarial activity against P. falciparum (W-2 clone). The descriptors hydration energy (HE), charge on oxygen atom of O11 (QO11), torsion angle O1-O2-Fe-N2 (D2) and maximum rate of R/Sanderson Electronegativity (RTe + ) were responsible for distinguishing compounds with higher and lower antimalarial activity. The molecular features represented by these descriptors are in good agreement with previous SAR analysis performed on artemisinin derivatives. The combination of these structural attributes is believed to govern the antimalarial effects of the compounds studied in this work. The PLS and PCR models obtained here showed not only statistical significance but also predictive ability. Through this strategy and our findings, useful information was obtained that could be of use in experimental syntheses and biological evaluation to understand the molecular and structural requirements for designing new ligands to be used as antimalarial agents.