Quantitative Structure–Activity Relationships for Structurally Diverse Chemotypes Having Anti-Trypanosoma cruzi Activity

Small-molecule compounds that have promising activity against macromolecular targets from Trypanosoma cruzi occasionally fail when tested in whole-cell phenotypic assays. This outcome can be attributed to many factors, including inadequate physicochemical and pharmacokinetic properties. Unsuitable physicochemical profiles usually result in molecules with a poor ability to cross cell membranes. Quantitative structure-activity relationship (QSAR) analysis is a valuable approach to the investigation of how physicochemical characteristics affect biological activity. In this study, artificial neural networks (ANNs) and kernel-based partial least squares regression (KPLS) were developed using anti-T. cruzi activity data for broadly diverse chemotypes. The models exhibited a good predictive ability for the test set compounds, yielding q2 values of 0.81 and 0.84 for the ANN and KPLS models, respectively. The results of this investigation highlighted privileged molecular scaffolds and the optimum physicochemical space associated with high anti-T. cruzi activity, which provided important guidelines for the design of novel trypanocidal agents having drug-like properties.


Introduction
Chagas' disease, which is a neglected tropical disease (as defined by the World Health Organization, WHO) caused by the protozoan Trypanosoma cruzi, is the leading cause of heart failure in Latin America, where it is endemic [1]. According to the WHO, the disease affects 8 million people worldwide and causes 10,000 deaths every year. Moreover, more than 25 million people live in vulnerable areas under the risk of infection [2]. Current chemotherapy for Chagas' disease is limited to nifurtimox and benznidazole, which are two obsolete drugs identified in 1965 and 1971, respectively ( Figure 1). These nitroheterocyclic compounds cause several adverse effects, such as weight loss, neurological damage, anorexia, dermatitis, depression, nausea, and gastrointestinal problems [3,4]. Furthermore, they lack effectiveness in the chronic phase of the disease. Given these drawbacks, novel, effective, and safe drugs for Chagas' disease are urgently needed [5]. T. cruzi parasites interconvert into different morphological phases during their life cycle as they circulate between the insect host (the Triatomine bugs Triatoma infestans, Rhodnius prolixus, and Triatoma dimidiata) and the human host. Replicative epimastigotes and infective metacyclic trypomastigotes develop in Triatomine bugs, whereas replicative intracellular amastigotes and nonreplicative bloodstream trypomastigotes develop in humans [6]. Intracellular amastigotes, which are found in tissues such as cardiac muscles and the digestive system, are the clinically relevant form of the parasite, and thus are the targets of antichagasic agents [6,7]. Occasionally, compounds that are active against isolated macromolecular targets lose their activity when tested in whole-cell phenotypic assays [8][9][10][11]. This activity loss can stem from inappropriate physicochemical properties, which play a key role in the ability of compounds to permeate biological membranes and reach their molecular targets [12,13]. In this context, drug discovery players have unprecedentedly relied on chemoinformatics to better understand the relationships between structure, physicochemical properties, and biological activity [14,15,16]. Quantitative structure-activity relationships (QSAR) have played a major role in this field [17][18][19]. In this study, we developed artificial neural networks (ANNs) and kernel-based partial least squares models (KPLS) aimed at investigating the molecular events underlying the activity of structurally diverse trypanocidal agents [20,21]. The outcome of these models was used to generate a focused fragment collection and physicochemical heat maps, which provide insights into privileged chemotypes and optimum physicochemical property spaces associated with enhanced trypanocidal activity.
ANNs are aimed to mimic biological neural networks and their processing units, the neurons, are composed of dendrites, a cell body, and axons. All input values (the dendrites) are summed and then are assigned to a learning function (the cell body). The input values are the independent variables and the output values are the dependent variables. The signal (axon) can be propagated or inhibited if the value returned by the activation function is above or below a predetermined threshold, respectively [22]. The multi-layer back-propagation algorithm was used in the ANN models. In particular, the back-propagation method uses the forward and backward steps [23]. First, weights are determined, and the biological activity value is predicted for a compound. The error between experimental and predicted values provides support for adjusting the input weight in the first intermediate layer. The main limitation of the algorithm is the convergence of the network due to low and high values in the learning rate. To reduce this limitation, the term momentum ensures that the learning rate is stabilized.
The fingerprint descriptors in the KPLS models are calculated from the smiles representation of each structure in the dataset [24]. These descriptors can be classified as linear, dendritic, radial, and molprint2D [25]. These four descriptors allow the visualization of atomic contribution maps, which depict the contribution of each atom to the dependent variable. The linear fingerprint descriptor uses the information from the linear fragments and ring closure to convert the structures into binary sequences. The dendritic fingerprint includes branched parts of the molecule during the generation of the binary sequence. Also referred to as extended connectivity fingerprints, the radial fingerprint identifies all heavy atoms and encodes the compounds by assigning fragments that emerge radially T. cruzi parasites interconvert into different morphological phases during their life cycle as they circulate between the insect host (the Triatomine bugs Triatoma infestans, Rhodnius prolixus, and Triatoma dimidiata) and the human host. Replicative epimastigotes and infective metacyclic trypomastigotes develop in Triatomine bugs, whereas replicative intracellular amastigotes and non-replicative bloodstream trypomastigotes develop in humans [6]. Intracellular amastigotes, which are found in tissues such as cardiac muscles and the digestive system, are the clinically relevant form of the parasite, and thus are the targets of antichagasic agents [6,7]. Occasionally, compounds that are active against isolated macromolecular targets lose their activity when tested in whole-cell phenotypic assays [8][9][10][11]. This activity loss can stem from inappropriate physicochemical properties, which play a key role in the ability of compounds to permeate biological membranes and reach their molecular targets [12,13]. In this context, drug discovery players have unprecedentedly relied on chemoinformatics to better understand the relationships between structure, physicochemical properties, and biological activity [14][15][16]. Quantitative structure-activity relationships (QSAR) have played a major role in this field [17][18][19]. In this study, we developed artificial neural networks (ANNs) and kernel-based partial least squares models (KPLS) aimed at investigating the molecular events underlying the activity of structurally diverse trypanocidal agents [20,21]. The outcome of these models was used to generate a focused fragment collection and physicochemical heat maps, which provide insights into privileged chemotypes and optimum physicochemical property spaces associated with enhanced trypanocidal activity.
ANNs are aimed to mimic biological neural networks and their processing units, the neurons, are composed of dendrites, a cell body, and axons. All input values (the dendrites) are summed and then are assigned to a learning function (the cell body). The input values are the independent variables and the output values are the dependent variables. The signal (axon) can be propagated or inhibited if the value returned by the activation function is above or below a predetermined threshold, respectively [22]. The multi-layer back-propagation algorithm was used in the ANN models. In particular, the back-propagation method uses the forward and backward steps [23]. First, weights are determined, and the biological activity value is predicted for a compound. The error between experimental and predicted values provides support for adjusting the input weight in the first intermediate layer. The main limitation of the algorithm is the convergence of the network due to low and high values in the learning rate. To reduce this limitation, the term momentum ensures that the learning rate is stabilized.
The fingerprint descriptors in the KPLS models are calculated from the smiles representation of each structure in the dataset [24]. These descriptors can be classified as linear, dendritic, radial, and molprint2D [25]. These four descriptors allow the visualization of atomic contribution maps, which depict the contribution of each atom to the dependent variable. The linear fingerprint descriptor uses the information from the linear fragments and ring closure to convert the structures into binary sequences. The dendritic fingerprint includes branched parts of the molecule during the generation of the binary sequence. Also referred to as extended connectivity fingerprints, the radial fingerprint identifies all heavy atoms and encodes the compounds by assigning fragments that emerge radially 3 of 21 from each atom. Finally, molprint2D is similar to the radial fingerprint and encodes the heavy atom environments by identifying the atom types positioned at different topological distances.

Chemical and Biological Landscape
The dataset used to construct the ANN and KPLS models was selected from the literature and includes 363 structurally diverse compounds . The trypanocidal activity of these molecules is expressed as the concentration of the compound that inhibits 50% the growth of T. cruzi in phenotypic assays (IC 50 ). The IC 50 values range from 2 nM to 97.97 µM (a 48,985-fold activity range) and were converted into pIC 50 values (−log IC 50 ) prior to the QSAR modeling. This wide activity interval follows the broad chemical diversity enclosed in the dataset. The structural and activity landscape covered by the 363 trypanocidal agents is illustrated in Figure 2. In the structure similarity map, the distance among the points is inversely proportional to the structural similarity, and the colors represent different activity ranges (Figure 2A). Based on this map, training and test sets were selected to construct the models. Structurally distinct chemotypes enclosing a wide spectrum of pIC 50 values were included in both the training (280 compounds) and test sets (83 compounds), as depicted in Figure 2B,C. from each atom. Finally, molprint2D is similar to the radial fingerprint and encodes the heavy atom environments by identifying the atom types positioned at different topological distances.

Chemical and Biological Landscape
The dataset used to construct the ANN and KPLS models was selected from the literature and includes 363 structurally diverse compounds . The trypanocidal activity of these molecules is expressed as the concentration of the compound that inhibits 50% the growth of T. cruzi in phenotypic assays (IC50). The IC50 values range from 2 nM to 97.97 µM (a 48,985-fold activity range) and were converted into pIC50 values (−log IC50) prior to the QSAR modeling. This wide activity interval follows the broad chemical diversity enclosed in the dataset. The structural and activity landscape covered by the 363 trypanocidal agents is illustrated in Figure 2. In the structure similarity map, the distance among the points is inversely proportional to the structural similarity, and the colors represent different activity ranges (Figure 2A). Based on this map, training and test sets were selected to construct the models. Structurally distinct chemotypes enclosing a wide spectrum of pIC50 values were included in both the training (280 compounds) and test sets (83 compounds), as depicted in Figure 2B,C.

Artificial Neural Networks
Eleven physicochemical properties were used as molecular descriptors to build the ANNs through which the trypanocidal activity of the dataset compounds were predicted. Hence, prior to running the ANN analyses, the dataset was characterized with respect to its physicochemical profile. Figure 3 shows the distribution of the dataset regarding the following physicochemical descriptors: molecular weight (MW), octanol-water partition coefficient (aLogP), hydrogen bond acceptors (HBA), hydrogen bond donors (HBD), number of rotatable bonds (RB), and heavy atom count (HAC). Figure 4, in turn, shows the physicochemical profile of the dataset regarding ring count (RC), polar surface area (PSA), electrotopological state (E-state), molar refractivity (MR), and molecular polarizability (Polar).

Artificial Neural Networks
Eleven physicochemical properties were used as molecular descriptors to build the ANNs through which the trypanocidal activity of the dataset compounds were predicted. Hence, prior to running the ANN analyses, the dataset was characterized with respect to its physicochemical profile. Figure 3 shows the distribution of the dataset regarding the following physicochemical descriptors: molecular weight (MW), octanol-water partition coefficient (aLogP), hydrogen bond acceptors (HBA), hydrogen bond donors (HBD), number of rotatable bonds (RB), and heavy atom count (HAC). Figure 4, in turn, shows the physicochemical profile of the dataset regarding ring count (RC), polar surface area (PSA), electrotopological state (E-state), molar refractivity (MR), and molecular polarizability (Polar).   The calculated physicochemical descriptors for 280 training set compounds were used as inputs to build backpropagation ANNs. The predictive power of the models was additionally evaluated by using 83 test set molecules. Table 1 shows the performance of the ANNs as a function of the learning rate (LR). The score value was used as the leading parameter to evaluate the performance of the models. The top-scoring model, which was derived with an LR of 0.1, presented a score of 0.80. As the score is derived from the correlation coefficients, this model had good performance regarding these indicators for both the training and test sets (r² = 0.79 and q² = 0.85). The r² value indicates the internal statistical robustness of the models considering the training set only. Otherwise, the q² value indicates the external predictive power, since it considers the performance of the models in predicting the dependent variables of molecules that were not used to train the algorithm. The score value The calculated physicochemical descriptors for 280 training set compounds were used as inputs to build backpropagation ANNs. The predictive power of the models was additionally evaluated by using 83 test set molecules. Table 1 shows the performance of the ANNs as a function of the learning rate (LR). The score value was used as the leading parameter to evaluate the performance of the models. The top-scoring model, which was derived with an LR of 0.1, presented a score of 0.80. As the score is derived from the correlation coefficients, this model had good performance regarding these indicators for both the training and test sets (r 2 = 0.79 and q 2 = 0.85). The r 2 value indicates the internal statistical robustness of the models considering the training set only. Otherwise, the q 2 value indicates the external predictive power, since it considers the performance of the models in predicting the dependent variables of molecules that were not used to train the algorithm. The score value represents the general performance of the models, considering both its internal and external predictive power. Note: LR = learning rate; r 2 = correlation coefficient for the training set; q 2 = correlation coefficient for the test set (r 2 pred ); RMSE = root mean square error; MAE = mean absolute error; RAE = relative absolute error; RRSE = root relative squared error; score Next, the momentum parameter (MP) of the backpropagation algorithm was varied for the top-scoring model of Table 1. This procedure was conducted to search for the best convergence criterion and prevent the ANNs from converging to a local minimum. MP was varied from 0.1 to 0.4 in steps of 0.1, and the resulting models are shown in Table 2. As seen, the variation of MP had no influence on the overall quality of the ANNs, as demonstrated by the resulting scores. Hence, the lowest RMSE values (0.82 and 0.75 for the training and test sets, respectively) were considered to select the model having an MP of 0.2 and a score of 0.80 for further optimization. Importantly, this model maintained good correlation coefficients for both the training (r 2 = 0.79) and test sets (q 2 = 0.85). Note: MP = momentum parameter; r 2 = correlation coefficient for the training set; q 2 = correlation coefficient for the test set (r 2 pred ); RMSE = root mean square error; MAE = mean absolute error; RAE = relative absolute error; RRSE = root relative squared error; score Finally, we varied the number of neurons in the hidden layer from one to 10, while maintaining the optimal values of LR (0.1) and MP (0.2). The best results were produced when seven neurons were added to the hidden layer, as shown in Table 3. A subtle improvement in the score (0.81) was observed in this model compared to the ANN containing the default number of six neurons in the hidden layer.
The statistical indicators shown in Table 3, particularly the correlation coefficient for the test compounds (q 2 = 0.81), show the ability of the best ANN model to predict the trypanocidal activity of novel and structurally diverse compounds. The good agreement between the experimental and predicted pIC 50 values exhibited by this model is graphically illustrated in Figure 5 for the training and test set compounds. Note: NN = number of neurons in the hidden layer; r 2 = correlation coefficient for the training set; q 2 = correlation coefficient for the test set (r 2 pred ); RMSE = root mean square error; MAE = mean absolute error; RAE = relative absolute error; RRSE = root relative squared error; score Note: NN = number of neurons in the hidden layer; r 2 = correlation coefficient for the training set; q 2 = correlation coefficient for the test set (r 2 pred); RMSE = root mean square error; MAE = mean absolute error; RAE = relative absolute error; RRSE = root relative squared error; score = (1-|(r 2 -q 2 )|) × q 2 . # Standard NN.
The statistical indicators shown in Table 3, particularly the correlation coefficient for the test compounds (q² = 0.81), show the ability of the best ANN model to predict the trypanocidal activity of novel and structurally diverse compounds. The good agreement between the experimental and predicted pIC50 values exhibited by this model is graphically illustrated in Figure 5 for the training and test set compounds.

Impact of the Physicochemical Properties on the Trypanocidal Activity
Given the high predictive power of the best ANN (seven neurons in the hidden layer, Table 3), we investigated the role played by each physicochemical descriptor on the trypanocidal activity of the dataset compounds. Each molecular descriptor works as an input applied to the ANN neurons. These descriptors are weighted by an activation function at each neuron, which is responsible for processing and transmitting the signal to the other neurons. These weights can be positive or negative. A positive weight attributed to a given physicochemical descriptor leads to a proportional increase in biological activity; that is, increasing the value of the weight increases the activity, and decreasing this value decreases the biological activity. Otherwise, negative weights lead to an

Impact of the Physicochemical Properties on the Trypanocidal Activity
Given the high predictive power of the best ANN (seven neurons in the hidden layer, Table 3), we investigated the role played by each physicochemical descriptor on the trypanocidal activity of the dataset compounds. Each molecular descriptor works as an input applied to the ANN neurons. These descriptors are weighted by an activation function at each neuron, which is responsible for processing and transmitting the signal to the other neurons. These weights can be positive or negative. A positive weight attributed to a given physicochemical descriptor leads to a proportional increase in biological activity; that is, increasing the value of the weight increases the activity, and decreasing this value decreases the biological activity. Otherwise, negative weights lead to an inversely proportional neuron response, i.e., decreasing the weight value increases the biological activity, and increasing this value decreases the activity. Table 4 shows the weights of each molecular descriptor at each hidden-layer neuron. Among the 11 descriptors, MW showed the greatest difference between the number of positively and negatively weighted neurons. Six neurons had their activation function positively weighted by MW; for most neurons, increments in this property resulted in increased trypanocidal activity. In the second position came aLogP, HBA, HBD, and MR, which positively weighted five out of the seven hidden-layer neurons. The reciprocal ratio was observed for RB, that is, five neurons produced negative values. HAC, RC, and Polar positively weighted four neurons, whereas PSA and E-state showed the inverse ratio, i.e., they negatively weighted four hidden-layer neurons.

Kernel-Based Partial Least Squares
Four molecular fingerprint types (dendritic, linear, molprint2D, and radial) were used as molecular descriptors to generate the 2D QSAR models. These descriptors were correlated with the trypanocidal activity of the dataset compounds using KPLS regression. The statistical indicators for the best models generated with each descriptor are presented in Table 5. All fingerprint types produced models with a similar prediction ability for the test set, with molprint2D performing slightly better (q 2 = 0.84). For the training set, the best correlation coefficients were produced by models generated with dendritic and linear fingerprints (r 2 = 0.89). The highest score, which is the result of the combination of q 2 and r 2 , was produced by the molprint2D-model (score = 0.82). Figure 6 illustrates the good alignment between the experimental and predicted pIC 50 values produced by the molprint2D-model. Considering these results, this model was selected to investigate how the structure of the dataset compounds correlate with the trypanocidal activity. Note: q 2 = correlation coefficient for the test set (r 2 pred ); r 2 = correlation coefficient for the training set; RMSE = root mean square error; SD = standard deviation; N = number of components; score = (1 − |(r 2 − q 2 )|) × q 2 .

Contribution Maps
KPLS models can be assessed for favorable and unfavorable structural characteristics through the generation of atomic contribution maps. Hence, the most relevant structural features for the trypanocidal activity were investigated by generating contribution maps based on the best KPLS model (molprint2D , Table 5). Positive, neutral, and negative contributions are depicted in red, white, and blue, respectively, and the color intensity shows the magnitude of the effect (Figure 7). Overall, heterocyclic aromatic rings contributed positively to activity. Aliphatic hydrocarbon chains showed negative or no influence for most compounds. Halogen substituents showed the full range of contributions-fluorine contributed positively, chlorine and iodine had no influence, and bromine contributed negatively. In general, hydroxyl groups were unfavorable, and piperazine rings were demonstrated to be favorable.

Contribution Maps
KPLS models can be assessed for favorable and unfavorable structural characteristics through the generation of atomic contribution maps. Hence, the most relevant structural features for the trypanocidal activity were investigated by generating contribution maps based on the best KPLS model (molprint2D , Table 5). Positive, neutral, and negative contributions are depicted in red, white, and blue, respectively, and the color intensity shows the magnitude of the effect (Figure 7). Overall, heterocyclic aromatic rings contributed positively to activity. Aliphatic hydrocarbon chains showed negative or no influence for most compounds. Halogen substituents showed the full range of contributions-fluorine contributed positively, chlorine and iodine had no influence, and bromine contributed negatively. In general, hydroxyl groups were unfavorable, and piperazine rings were demonstrated to be favorable.  (Table 5) for some dataset compounds located at both ends of the trypanocidal activity range.

Physicochemical Profile of Favorable Fragments
Twenty-nine active compounds with pIC50 > 6 (see Table S1 for the structures) were selected to construct a collection of 50 fragments. Only molecular fragments that were predicted to enhance the trypanocidal activity (red areas of the contribution maps) were considered in this analysis. Next, physicochemical descriptors were calculated for these fragments and used as an input for the best ANN with the view of predicting their biological activity. Finally, the predicted activity values were correlated with each physicochemical property of the fragment collection. The outcome of this analysis is illustrated as heat maps (Figures 8-10), which allowed us to identify a specific physicochemical space that favors trypanocidal activity. The heat maps also correlate the activity of the compounds from which the fragments were extracted and their physicochemical profile. Figure 8 shows the heat maps for MW, aLogP, HBD, and HBA. Fragments with MW greater than 260 Da were predicted to be the most active (pIC50 > 6). For aLogP, the most active fragments had values predominantly between 2 and 3. Fragments with 0-1 HBD and 1-6 HBAs were predicted to have the highest pIC50 values. Figure 9 illustrates the heat maps for HAC, RB, RC, and PSA. As shown in the figure, fragments with HAC values greater than 20 had the highest pIC50 values. For RB, fragments with two to eight rotatable bonds were the most active. According to the heat maps, fragments with RC values from 2 to 3 were predicted to have the best anti-T. cruzi profile. Finally, fragments with polar surface area (PSA) predominantly between 50 and 80 Å 2 had the highest pIC50 values. Figure 10 shows the heat maps for E-state, MR, and Polar. The ANN predicted the fragments with E-state values between 35 and 63 as being the most active. Fragments with MR ranging from 65 to 115 were predicted to have the highest pIC50 values. Finally, the Polar descriptor was demonstrated to have optimal values ranging from 30 to 53.  (Table 5) for some dataset compounds located at both ends of the trypanocidal activity range.

Physicochemical Profile of Favorable Fragments
Twenty-nine active compounds with pIC 50 > 6 (see Table S1 for the structures) were selected to construct a collection of 50 fragments. Only molecular fragments that were predicted to enhance the trypanocidal activity (red areas of the contribution maps) were considered in this analysis. Next, physicochemical descriptors were calculated for these fragments and used as an input for the best ANN with the view of predicting their biological activity. Finally, the predicted activity values were correlated with each physicochemical property of the fragment collection. The outcome of this analysis is illustrated as heat maps (Figures 8-10), which allowed us to identify a specific physicochemical space that favors trypanocidal activity. The heat maps also correlate the activity of the compounds from which the fragments were extracted and their physicochemical profile. Figure 8 shows the heat maps for MW, aLogP, HBD, and HBA. Fragments with MW greater than 260 Da were predicted to be the most active (pIC 50 > 6). For aLogP, the most active fragments had values predominantly between 2 and 3. Fragments with 0-1 HBD and 1-6 HBAs were predicted to have the highest pIC 50 values. Figure 9 illustrates the heat maps for HAC, RB, RC, and PSA. As shown in the figure, fragments with HAC values greater than 20 had the highest pIC 50 values. For RB, fragments with two to eight rotatable bonds were the most active. According to the heat maps, fragments with RC values from 2 to 3 were predicted to have the best anti-T. cruzi profile. Finally, fragments with polar surface area (PSA) predominantly between 50 and 80 Å 2 had the highest pIC 50 values. Figure 10 shows the heat maps for E-state, MR, and Polar. The ANN predicted the fragments with E-state values between 35 and 63 as being the most active. Fragments with MR ranging from 65 to 115 were predicted to have the highest pIC 50 values. Finally, the Polar descriptor was demonstrated to have optimal values ranging from 30 to 53.   Figure 11 shows the structure and biological activity of 35 fragments that were predicted to be the most promising according to their trypanocidal profile. This group is characterized by a diversity of chemical motifs having two to four rings with the predominant groups being pyridine, pyrimidine, benzene, piperazine, triazole, benzothiazole, benzofuran, oxadiazole, and pyrazolopyrimidine. The four most active fragments from this collection have a phenylsulfonyl-piperazine (fragments 11 and 12) or a phenylpiperazine-carboxamide moiety (fragments 13 and 14) linked to two aromatic rings that are either pyridine, pyrimidine, or benzene. Replacing one of these aromatic rings with a hydrogen, such as in 16 and 22, led to a reduction of the biological activity. The same effect was observed for 24 and 27, in which one aromatic ring was kept and the benzene was replaced with a hydrogen. The replacement of the pyridine in compound 20 with a pyrimidine in compound 21 led to a subtle lowering of the pIC 50 value. Another substitution that affected the biological activity was the exchange between the benzofuran, benzothiazole, and pyrazolopyrimidine in fragments 17, 18, and 19. Among these three compounds, the benzofuran derivative was the most potent.  After analyzing the other molecular scaffolds, it is worth mentioning that for fragments 37 and 38, it was not possible to establish a direct relationship between the presence of the oxadiazole group and trypanocidal activity. Replacing the oxadiazole in compound 37 with a phenyl in 44 decreased the pIC 50 value; however, the same modification involving fragments 35 and 38 produced the opposite effect. Among cyclopentane derivatives 30, 32, and 43, replacing the benzothiazole in fragment 30 with pyrazolopyrimidine and benzofuran in 32 and 43, respectively, decreased the biological activity; the most significant effect occurred for the benzothiazole-benzofuran exchange, which resulted in a decrease of 0.47 in the pIC 50 value. Finally, the insertion of a methyl cyclopentane moiety at the triazole ring of 41 resulted in fragment 36 and increased the trypanocidal activity. Figure 12 shows the overall scheme for the design of novel trypanocidal compounds based on the workflow proposed in this work.  Figure 11 shows the structure and biological activity of 35 fragments that were predicted to be the most promising according to their trypanocidal profile. This group is characterized by a diversity of chemical motifs having two to four rings with the predominant groups being pyridine, pyrimidine, benzene, piperazine, triazole, benzothiazole, benzofuran, oxadiazole, and pyrazolopyrimidine. The four most active fragments from this collection have a phenylsulfonyl-piperazine (fragments 11 and 12) or a phenylpiperazine-carboxamide moiety (fragments 13 and 14) linked to two aromatic rings that are either pyridine, pyrimidine, or benzene. Replacing one of these aromatic rings with a hydrogen, such as in 16 and 22, led to a reduction of the biological activity. The same effect was observed for 24 and 27, in which one aromatic ring was kept and the benzene was replaced with a hydrogen. The replacement of the pyridine in compound 20 with a pyrimidine in compound 21 led to a subtle lowering of the pIC50 value. Another substitution that affected the biological activity was the exchange between the benzofuran, benzothiazole, and pyrazolopyrimidine in fragments 17, 18, and 19. Among these three compounds, the benzofuran derivative was the most potent. Furthermore, replacing the pyrazolopyrimidine in compound 23 with a benzothiazole in compound 15 increased the trypanocidal activity.
After analyzing the other molecular scaffolds, it is worth mentioning that for fragments 37 and 38, it was not possible to establish a direct relationship between the presence of the oxadiazole group and trypanocidal activity. Replacing the oxadiazole in compound 37 with a phenyl in 44 decreased the pIC50 value; however, the same modification involving fragments 35 and 38 produced the opposite effect. Among cyclopentane derivatives 30, 32, and 43, replacing the benzothiazole in fragment 30 with pyrazolopyrimidine and benzofuran in 32 and 43, respectively, decreased the biological activity; the most significant effect occurred for the benzothiazole-benzofuran exchange, which resulted in a decrease of 0.47 in the pIC50 value. Finally, the insertion of a methyl cyclopentane moiety at the triazole ring of 41 resulted in fragment 36 and increased the trypanocidal activity. Figure  12 shows the overall scheme for the design of novel trypanocidal compounds based on the workflow proposed in this work.

Discussion
The physicochemical characterization of the dataset revealed that most compounds follow the Lipinski's Rule of Five, as illustrated in Figure 3 [65]. The determinant role played by these properties

Discussion
The physicochemical characterization of the dataset revealed that most compounds follow the Lipinski's Rule of Five, as illustrated in Figure 3 [65]. The determinant role played by these properties was shown by the analysis of the weights that were attributed to each physicochemical descriptor at the hidden-layer neurons of the best ANN (Table 4). MW, aLogP, HBA, and HBD exhibited the greatest difference between the number of positively and negatively modulated neurons. MW positively weighted six (85.7%) out of the seven hidden-layer neurons, and aLogP, HBA, and HBD positively weighted five (71.4%) neurons. These four descriptors are closely related to bioavailability and the ability to permeate cell membranes, and therefore, the capacity of a compound to reach its molecular target. The number of rotatable bonds, which had a mean value of 5.93 for the whole dataset, also modulated most neurons in the same way-71.4% of the hidden-layer neurons were negatively weighted by this property. HAC and RC positively weighted four out of seven hidden-layer neurons. The predominantly positive weighting profile of HAC and RC can be associated with that of MW and aLogP; an increase in the first two properties generally leads to an increase in the latter two. Another finding worth mentioning is that the KPLS models led to the identification of a set of fragments that are strongly associated with enhanced trypanocidal activity. Most of these fragments contain between two and three rings, which follows the physicochemical profile identified by the ANN and shown in the heat maps for these chemotypes (Figure 9). Aromatic nitrogen-containing rings and fused rings are the most common structural features identified within this collection. Cyclopentane and piperazine are the only representants of aliphatic rings. Functionalized short linkers (from one to four atoms) containing amine, amide, sulfone, or ester groups are found between the cyclic groups. Nonfunctionalized linkers are almost exclusively restricted to methylene groups. Another aspect disclosed in this study was that the heat maps for the favorable fragments showed a more restricted physicochemical space compared to the results for the full molecules. For example, the following physicochemical ranges were predicted to be the most adequate for the fragment collection: MW > 260 Da; aLogP: 2-3; PSA: 50-80 Å 2 ; E-state: 35-63; MR: 65-115; and Polar: 30-53. These findings can be useful guidelines for monitoring the physicochemical profile in Chagas' disease drug design efforts using fragment-like compounds as starting points.

Selection and Construction of the Dataset
The 363 dataset compounds were selected from 39 articles from the Web of Science after eliminating compounds lacking IC 50 values, and duplicated, inorganic, and metal-containing molecules . All structures were built using the default settings of Canvas 2.9 (Schrodinger LLC, New York, NY, USA) [66]. IC 50 values were converted to pIC 50 (−log IC 50 ). The 363 dataset compounds are available in the supporting information (Table S1).

Characterization of the Chemical and Biological Space
To characterize the structural and activity landscape of the dataset, we carried out a principal component analysis (PCA) using SYBYL-X 2.0 and UNITY fingerprints as molecular descriptors (Certara, Princeton, NJ, USA) [67,68]. The PCA result was converted into a structure similarity map that provides information on the structure and activity profiles of the dataset. To generate the structure similarity map, two principal components were initially extracted and applied as initial coordinates of the map. Next, Tanimoto distances between the molecular fingerprints were computed to plot all points of the map. Based on the structure similarity map, we selected the training (280 structures) and test (83 structures) sets to run the QSAR analyses. The training and test set molecules are available in the supporting information (Table S1).

Physicochemical Descriptors
The physicochemical properties used as input to the ANNs were calculated using Canvas 2.9 (Schrodinger LLC, New York, NY, USA) [66]. The following descriptors were calculated: MW, aLogP, HBD, HBA, RB, PSA, E-state, MR, Polar, HAC, and RC.

Backpropagation Artificial Neural Networks
The ANNs were built using the machine learning environment of WEKA 3.6 (University of Waikato, Hamilton, New Zealand) [69]. The calculated physicochemical descriptors and pIC 50 values for the training set were used to train the ANNs [22]. The ANNs were built using the multilayer backpropagation perceptron and a logistic activation function [70]. After the weights were initially determined for each descriptor and the pIC 50 values were predicted, the errors between the experimental and predicted values were used to update the weights for the first intermediate layer [23]. During this procedure, the adjustment of the activation function was performed through the partial derivative of the weights. Variations in the learning rate (0.1-0.4), momentum (0.1-0.4), and number of neurons (1-10) were explored to optimize the ANNs.

Molecular Fingerprints and 2D Contribution Maps
The KPLS models were constructed for the training set using Canvas molecular fingerprints (Schrodinger LLC, New York, NY, USA) as molecular descriptors [66]. Four types of fingerprints were explored: linear, dendritic, radial, and molprint2D [24,25,71]. The best KPLS model, which was obtained with molprint2D descriptors, was used to generate the 2D contribution maps in which red, white, and blue indicate positive, neutral, and negative contributions, respectively.

Heat Maps
Heat maps were constructed to delineate the physicochemical profile of 50 fragments that were highlighted as positive for biological activity. The positive fragments were identified using the 2D contributions maps and extracted from compounds that showed pIC 50 > 6. The physicochemical properties were calculated using Canvas 2.9 (Schrodinger LLC, New York, NY, USA) [66] and used as inputs for the best ANN, which then predicted the pIC 50 for the fragment collection. Next, heat maps were built, in which the physicochemical properties that are associated with higher pIC 50 values could be visually assessed.

Conclusions
The discovery of novel drugs for Chagas' disease remains an outstanding challenge. Progress in this area requires the design of prototypes that combine activities against the molecular target and appropriate pharmacokinetics. Achieving a suitable balance among these different and occasionally conflicting properties requires the development of candidates with finely adjusted physicochemical properties. In this study, a set of 363 structurally diverse compounds covering a broad interval of trypanocidal activity was used to build highly predictive QSAR models. The final ANN showed high predictive power for test set compounds (q 2 = 0.81) and identified critical physicochemical properties associated with the biological activity of the dataset. The best KPLS model, which yielded a q 2 value of 0.84, highlighted key fragments strongly correlated with the anti-T. cruzi activity of the dataset compounds. The integration of the ANN and KPLS analyses enabled the generation of a privileged fragment collection, for which an optimal physicochemical space was determined. The structural information enclosed in the fragment collection along with the delineated physicochemical landscape are valuable information for guiding the design of novel antichagasic agents with improved properties.

Conflicts of Interest:
The authors declare no conflict of interest.

ANN
Artificial Neural Network KPLS Kernel-based Partial Least Squares IC 50 Concentration of compound that inhibits 50% of the growth of T. cruzi in phenotypic assays pIC 50 −