Antiproliferative Activity Predictor: A New Reliable In Silico Tool for Drug Response Prediction against NCI60 Panel

In vitro antiproliferative assays still represent one of the most important tools in the anticancer drug discovery field, especially to gain insights into the mechanisms of action of anticancer small molecules. The NCI-DTP (National Cancer Institute Developmental Therapeutics Program) undoubtedly represents the most famous project aimed at rapidly testing thousands of compounds against multiple tumor cell lines (NCI60). The large amount of biological data stored in the National Cancer Institute (NCI) database and many other databases has led researchers in the fields of computational biology and medicinal chemistry to develop tools to predict the anticancer properties of new agents in advance. In this work, based on the available antiproliferative data collected by the NCI and the manipulation of molecular descriptors, we propose the new in silico Antiproliferative Activity Predictor (AAP) tool to calculate the GI50 values of input structures against the NCI60 panel. This ligand-based protocol, validated by both internal and external sets of structures, has proven to be highly reliable and robust. The obtained GI50 values of a test set of 99 structures present an error of less than ±1 unit. The AAP is more powerful for GI50 calculation in the range of 4–6, showing that the results strictly correlate with the experimental data. The encouraging results were further supported by the examination of an in-house database of curcumin analogues that have already been studied as antiproliferative agents. The AAP tool identified several potentially active compounds, and a subsequent evaluation of a set of molecules selected by the NCI for the one-dose/five-dose antiproliferative assays confirmed the great potential of our protocol for the development of new anticancer small molecules. The integration of the AAP tool in the free web service DRUDIT provides an interesting device for the discovery and/or optimization of anticancer drugs to the medicinal chemistry community. The training set will be updated with new NCI-tested compounds to cover more chemical spaces, activities, and cell lines. Currently, the same protocol is being developed for predicting the TGI (total growth inhibition) and LC50 (median lethal concentration) parameters to estimate toxicity profiles of small molecules.


Introduction
In the search for new chemical compounds endowed with anticancer properties, in vitro antiproliferative screening remains one of the most commonly used approaches to identify new biologically active compounds. As demonstrated by the numerous research projects focused on the characterization of tumor cells (including National Cancer Institute Human Tumor Cell Lines Screen, NCI60 [1]; Cancer Cell Line Encyclopedia, CCLE [2]; Genomics of Drug Sensitivity in Cancer, GDSC [3]; The Cancer Genome Atlas Program, In this work, considering our expertise in using molecular descriptors for biological purposes [26][27][28][29][30], we propose a new in silico Antiproliferative Activity Predictor (AAP) tool that can predict the antiproliferative activity of compounds against the NCI60 panel. The protocol, which is freely available on the DRUDIT web service (https://www.drudit.com (accessed on 15 November 2022) [28]), was based on both the molecular descriptors and in vitro antiproliferative data of tested compounds belonging to the NCI database.
In this article, the computational functions and the validation of the AAP tool are discussed in detail. Since our research group has focused on the synthesis of new small molecules with anticancer properties for many years [31][32][33], we decided to evaluate the performance of the AAP protocol by predicting the anticancer potential of an in-house small molecule database. The anticancer properties of several curcumin-like derivatives, some of which had already been evaluated as neuroprotective agents, were investigated [34][35][36]. To validate and confirm the AAP in silico data, in vitro antiproliferative assays of selected compounds were performed as a part of the NCI60 DTP screening program.

NCI60 Antiproliferative Activity Predictor Tool
The Antiproliferative Activity Predictor (AAP) tool is a new molecular-descriptorbased protocol for predicting the anticancer activities, expressed as GI 50 , of small molecules against the NCI60 panel. The AAP has been included as a module in DRUDIT, a web service that has already proven to be a reliable and valuable support for the development of new small molecules with biological activities [28,37].

Description of the Tool Learning Process
The AAP tool consists of sequential steps, as reported in the flowchart (Figure 1).
features. As a result, it showed better cell viability (IC50) prediction accuracy in pancancer cell lines over two independent cancer cell line datasets. More recently, pdCSM-cancer, which uses a graph-based signature representation, has been used to estimate the antiproliferative activity against multiple cancer cell lines [25].
In this work, considering our expertise in using molecular descriptors for biological purposes [26][27][28][29][30], we propose a new in silico Antiproliferative Activity Predictor (AAP) tool that can predict the antiproliferative activity of compounds against the NCI60 panel. The protocol, which is freely available on the DRUDIT web service (https://www.drudit.com (accessed on 17 November 2022) [28]), was based on both the molecular descriptors and in vitro antiproliferative data of tested compounds belonging to the NCI database.
In this article, the computational functions and the validation of the AAP tool are discussed in detail. Since our research group has focused on the synthesis of new small molecules with anticancer properties for many years [31][32][33], we decided to evaluate the performance of the AAP protocol by predicting the anticancer potential of an in-house small molecule database. The anticancer properties of several curcumin-like derivatives, some of which had already been evaluated as neuroprotective agents, were investigated [34][35][36]. To validate and confirm the AAP in silico data, in vitro antiproliferative assays of selected compounds were performed as a part of the NCI60 DTP screening program.

NCI60 Antiproliferative Activity Predictor Tool
The Antiproliferative Activity Predictor (AAP) tool is a new molecular-descriptorbased protocol for predicting the anticancer activities, expressed as GI50, of small molecules against the NCI60 panel. The AAP has been included as a module in DRUDIT, a web service that has already proven to be a reliable and valuable support for the development of new small molecules with biological activities [28,37].

Description of the Tool Learning Process
The AAP tool consists of sequential steps, as reported in the flowchart (Figure 1).  FP GI 50 ranking (S score 0−1) FP GI 50 assigning GI 50i = GI 50i (FP)×S + GI 50i (CL)×(1-S) Figure 1. Flowchart of the antiproliferative predictor protocol: GI 50i is the GI 50 value for cell line, S is the fingerprint score, GI 50i (FP) is the GI 50 value predicted by the FP module, and GI 50i (CL) is the GI 50 value assigned by the CL module.
First, the NCI60 database, which contains in vitro data on the antiproliferative effects, expressed as GI 50 , of many compounds, was selected [10] (Figure 2). First, the NCI60 database, which contains in vitro data on the antiproliferative effects, expressed as GI50, of many compounds, was selected [10] (Figure 2). Figure 2. NCI selection data used in the AAP tool: compounds screened in five-dose assay and published until 2014 were used as training set; the 5‰ of these structures was used for the internal validation of the protocol (panels A,cC,D); those structures screened in five-dose assay and published until 2016 were used as test set to evaluate the predictive performance of the entire protocol (panels B,E).
From the thousands of structures tested by the NCI (Figure 2A), those that were screened in a five-dose assay were selected by considering the experimental GI50 values ( Figure 2C). Then, according to publication dates, they were split into a training set (Figure 2C, data published until September 2014, referred to as NCI2014DB), which was used to build the model, and a test set ( Figure 2E, data published until June 2016, referred to as NCI2016DB ( Figure 2B)), which was used to validate the tool [38].
The AAP protocol results were obtained from the weighed contributions of two "modules", called Finger-Print (FP) and Cell-Lines (CL); these contributed through a series of well-considered steps ( Figure 1) to assign -logGI50 values (indicated as GI50 in the following sections) to input structures against the NCI60 panel cell lines (Ln).
Preliminarily, a set of molecular descriptors (D1, D2, D3, and Dn) was calculated for the training set (38 k structures, N, from NCI2014DB, Figure 2C). The molecular descriptor calculation was performed by MOLDESTO software (Supporting Information S1 contains a list of molecular descriptors implemented in MOLDESTO; see the Methods Section), which calculates more than 1000 1D, 2D, and 3D molecular descriptors (Di).
Then, using the molecular descriptors matrix (structures versus molecular descriptors) the above-described modules (FP and CL) were built as described below.
The FP module relies on structure similarity. It matches the molecular descriptor sequence of the input structure Di(Xj) (Figure 1) with the Di sequences of the structures belonging to the training set and assigns the S score as follows: where N(Di(Xj)) is the number of the molecular descriptors, Di(Xj), which has a value in the range Di ± Di×0.05, and N(Di) is the number of the molecular descriptors used. By ranking according to the S score, the protocol assigns to the input structure Xj the experimental GI50 values of the best-scored structure of the training set (GI50(FP)). If experimental GI50 data of the training set structure are missing for one or more cell lines, a GI50 value is not assigned.
The  NCI selection data used in the AAP tool: compounds screened in five-dose assay and published until 2014 were used as training set; the 5‰ of these structures was used for the internal validation of the protocol (panels A,C,D); those structures screened in five-dose assay and published until 2016 were used as test set to evaluate the predictive performance of the entire protocol (panels B,E).
From the thousands of structures tested by the NCI (Figure 2A), those that were screened in a five-dose assay were selected by considering the experimental GI 50 values ( Figure 2C). Then, according to publication dates, they were split into a training set ( Figure 2C, data published until September 2014, referred to as NCI2014DB), which was used to build the model, and a test set ( Figure 2E, data published until June 2016, referred to as NCI2016DB ( Figure 2B)), which was used to validate the tool [38].
The AAP protocol results were obtained from the weighed contributions of two "modules", called Finger-Print (FP) and Cell-Lines (CL); these contributed through a series of well-considered steps ( Figure 1) to assign -logGI 50 values (indicated as GI 50 in the following sections) to input structures against the NCI60 panel cell lines (L n ).
Preliminarily, a set of molecular descriptors (D 1 , D 2 , D 3 , and D n ) was calculated for the training set (38 k structures, N, from NCI2014DB, Figure 2C). The molecular descriptor calculation was performed by MOLDESTO software (Supporting Information S1 contains a list of molecular descriptors implemented in MOLDESTO; see the Methods Section), which calculates more than 1000 1D, 2D, and 3D molecular descriptors (D i ).
Then, using the molecular descriptors matrix (structures versus molecular descriptors) the above-described modules (FP and CL) were built as described below.
The FP module relies on structure similarity. It matches the molecular descriptor sequence of the input structure D i (X j ) ( Figure 1) with the D i sequences of the structures belonging to the training set and assigns the S score as follows: where N(D i (X j )) is the number of the molecular descriptors, D i (X j ), which has a value in the range D i ± D i × 0.05, and N(D i ) is the number of the molecular descriptors used. By ranking according to the S score, the protocol assigns to the input structure X j the experimental GI 50 values of the best-scored structure of the training set (GI 50 (FP)). If experimental GI 50 data of the training set structure are missing for one or more cell lines, a GI 50 value is not assigned.
The With this aim, the structures of the training set were assigned to each template according to the experimental GI 50 values. In detail, all the structures that, for the specific cell line CL i , had a GI 50 value in the appropriate range were assigned to the related template ( Figure 3). With this aim, the structures of the training set were assigned to each template according to the experimental GI50 values. In detail, all the structures that, for the specific cell line CLi, had a GI50 value in the appropriate range were assigned to the related template ( Figure 3). Then, the mean (μi) and standard deviation (σi) were computed for all molecular descriptors, considering the structures belonging to each template.
Thus, for each of the sixty cell lines, the molecular descriptor values of the input structure Xj were matched with the 42 cell line templates. The input structure Xj was assigned the GI50 value (GI50(CL)) of the corresponding template with the higher matching score ( Figure 4). This score was assigned by considering the sequence of Di values of the input structure Xj matched to the sequence of the μi(Di) ± σi(Di) range of the cell line template. If a Di value was in the range, a value of 1 was assigned; otherwise, a value of 0 was assigned. The sum of these binary scores, normalized to all molecular descriptors, gave the CL scores ( Figure 4) and consequently the higher value, allowing GI50(CL) assignation ( Figure  4).

NCI DB
Cell line (CLi) GI50  Then, the mean (µ i ) and standard deviation (σ i ) were computed for all molecular descriptors, considering the structures belonging to each template.
Thus, for each of the sixty cell lines, the molecular descriptor values of the input structure X j were matched with the 42 cell line templates. The input structure X j was assigned the GI 50 value (GI 50 (CL)) of the corresponding template with the higher matching score ( Figure 4). This score was assigned by considering the sequence of D i values of the input structure X j matched to the sequence of the µ i (D i ) ± σ i (D i ) range of the cell line template. If a D i value was in the range, a value of 1 was assigned; otherwise, a value of 0 was assigned.
The sum of these binary scores, normalized to all molecular descriptors, gave the CL scores ( Figure 4) and consequently the higher value, allowing GI 50 (CL) assignation ( Figure 4).
Once an input structure, X j , was uploaded into the DRUDIT tools interface, MOLDESTO optimized the geometry in vacuo and calculated the molecular descriptors described above for the training set. Then, the molecular descriptor values were submitted to the FP and CL modules.
The output data from these modules were weighted as shown below: where GI 50i is the GI 50 value for that cell line, S is the fingerprint score, GI 50i (FP) is the GI 50 value predicted by the FP module, and GI 50i (CL) is the GI 50 value assigned by the CL module. From this formula, if the structural similarity between the input structure and the best-scored structure of the training set was high (S score close to 1), it was assumed that the biological activity of the input structure was very similar to that of the compound from the training set (a similar structure could correspond to a similar biological activity). When S = 1, the input structure was included in the training set. Thus, the predicted GI 50 corresponded to the experimental values. Instead, if S was not close to 1, GI 50 (CL) contributed more to the overall result, according to the S value. When the GI 50 (FP) was not available (unavailable data from the NCI screening), the GI 50 corresponded to the GI 50 (CL).

Validation of the AAP Tool
The predictive ability of the AAP was validated by internal and external validation. Internal validation: First, 5‰ of the training database structures (193 molecules), randomly selected from NCI2014DB ( Figure 2D, Supporting Information S2), were used to validate the CL module by matching the calculated GI 50 (CL) values with the experimental GI 50 data. Because these structures were used to generate the AAP protocol, their experimental GI 50 values were indicated by the FP protocol, except for those that were not available (the experimental GI 50 values are listed in Supporting Information S3; an empty box in the matrix indicates unavailable experimental data). The 193 structures were clustered into three groups according to their GI 50 values: the most active compounds (more than 40/60 GI 50 values equal to 8); the structures with GI 50 values in the range of 4-8; and the cluster of less active/inactive compounds with GI 50 values close to 4. Therefore, GI 50 (CL) values were first calculated for the selected structures by setting the DRUDIT parameters (see the Methods Section for the meaning of the DRUDIT parameters). This step had two aims: it allowed us to verify the predictive capability of the CL module and, more importantly, to tune the DRUDIT parameters for the best prediction of antiproliferative activity for new compounds. Thus, runs 1-18 were performed by modulating the values of the parameters N, Z, and G, as reported in Table 1. The 18 outputs for the CL module are reported in Supporting Information S4). The eighteen matrices from CL were matched with the experimental GI 50 values to obtain 18 new matrices in which the |DTV (GI 50 )|, i.e., the absolute deviation of the calculated GI 50 (CL) value from the experimental GI 50 value, was reported for each structure. Furthermore, for each entry, the average |DTV(GI 50 )| for all runs was examined.
From the analysis of these data, it appears that the protocol allows the identification of potentially inactive or moderately active structures (below <4 or in the range of 4-7) with a remarkable degree of reliability, while it is less effective for structures with high activity (in the range of 7-8) but with acceptable errors.
Moreover, it was demonstrated that the quality of the prediction was closely related to the amount of available biological data used to build the model. Since the number of highly active compounds (7 < GI 50 < 8) was very low with respect to inactive or moderately active compounds, the prediction was negatively affected (higher error). Thus, the M19-MEL melanoma cancer cell line was kept out of the analysis due to a lack of sufficient biological data.
The matrix of GI 50 values provided by each run was further elaborated to calculate the overall average |DTV(GI 50 )| for each cancer cell line. SK-MEL-28, belonging to the melanoma panel, gave the best predictions (|DTV(GI 50 )| of 1.14).
To select the optimized DRUDIT parameters, |DTV(GI 50 )| was calculated for each GI 50 (CL) matrix ( Table 1). The parameters of run 1 (N = 240, Z = 50, G = a) were identified as the best, with an overall average error of 1.22 (Table 1). In this run, the renal cancer panel was the best prediction, with an average error of 1.18. The full results are reported in Supporting Information S5.
External test validation: A set of 99 molecules was collected from NCI2016DB ( Figure 2B, see Methods Section for database selection and Supporting Information S6). Their known GI 50 values were compared with the predicted ones, which were calculated using the optimized DRUDIT parameters (run 1, see above). The output matrix showed an interesting scenario with a |DTV(GI 50 )| of 0.87 and excellent predictions for structures with low activity (GI 50 > 4 for at least 40 cell lines). On the other hand, significant errors were recorded for structures with high experimental GI 50 values (GI 50 above or close to 8). This evidence confirmed the capability of the protocol to better predict GI 50 values for lowactivity molecules. The average errors for each cancer cell line were also calculated, and they showed excellent prediction for the breast cancer panel (average error of 0.77 for the panel), especially against MDA-MB-231-ATCC (average error for the cancer line of 0.64).
Analyzing the |DTV(GI 50 )| for all selected databases, considered by ranges, it was found that the protocol was able to assign the correct value returning ÷1 for 65% of the data ( Figure 5).
External test validation: A set of 99 molecules was collected from NCI2016DB (Figure 2B, see Methods Section for database selection and Supporting Information S6). Their known GI50 values were compared with the predicted ones, which were calculated using the optimized DRUDIT parameters (run 1, see above). The output matrix showed an interesting scenario with a |DTV(GI50)| of 0.87 and excellent predictions for structures with low activity (GI50 > 4 for at least 40 cell lines). On the other hand, significant errors were recorded for structures with high experimental GI50 values (GI50 above or close to 8). This evidence confirmed the capability of the protocol to better predict GI50 values for lowactivity molecules. The average errors for each cancer cell line were also calculated, and they showed excellent prediction for the breast cancer panel (average error of 0.77 for the panel), especially against MDA-MB-231-ATCC (average error for the cancer line of 0.64).
Analyzing the |DTV(GI50)| for all selected databases, considered by ranges, it was found that the protocol was able to assign the correct value returning ÷1 for 65% of the data ( Figure 5). A matrix that compares the experimental GI50 values with the calculated ones for all structures is reported in Supporting Information S7, in addition to the complete data analysis.
With the aim to demonstrate the relevant contribution of the CL module in the pre- A matrix that compares the experimental GI 50 values with the calculated ones for all structures is reported in Supporting Information S7, in addition to the complete data analysis.
With the aim to demonstrate the relevant contribution of the CL module in the prediction, we also compared the experimental GI 50 values with the calculated GI 50 (FP) values. The total average |DTV(GI 50 )| of 0.95, which is higher than that obtained by combining the predicted GI 50 values given by both modules, shows that the CL module improves accuracy and leads to better predictions (Supporting Information S8).

Parameter Optimization for Cell Line/Subpanel Activity Prediction
Tuning the DRUDIT parameters (N, Z, and G) could also allow the optimization of the prediction for a specific cell line or subpanel (all the following results are shown in Supporting Information S9).
Then, the average |DTV(GI 50 )| values obtained for the cancer cell lines in all runs were analyzed (Supporting Information S5).
Regarding the optimization of parameters for each cancer cell line, SK-MEL-28 (melanoma panel) gave the best prediction, with an average error of 0.97 for parameter combinations 2 and 3. The full results are shown in Table 2.  To identify the best parameter combination for each panel, the average |DTV(GI 50 )| values of cell lines obtained in each run were grouped by the panel, and the average |DTV(GI 50 )| for the entire panel was calculated for runs 1-18. The results of the best combinations of parameters for each panel are given in Table 3. The renal cancer panel resulted the best prediction of all, with an average error of 1.15 using parameter combination 10.

Application of the AAP Tool for the Virtual Screening of an In-House Structure Database
The predictive capability of the AAP tool was exploited in the analysis of an in-house small molecule database to select compounds to be submitted to the NCI-DTP screening program. Recently, a few curcumin-like compounds were designed and biologically evaluated for neuroprotective and anti-Alzheimer properties, showing interesting results ( Figure 6) [34][35][36].  In detail, the selected in-house database included three different subclasses of curcumin-like derivatives, as reported in Figure 7: 1,2-diones (1a-o); 1,2,4-oxadiazoles (2a-h); and 1,3,4-oxadiazoles (3a-h). They were developed by replacing the symmetrical βdiketone core of curcumin, which is responsible for unfavorable physicochemical properties and a weak pharmacokinetic profile [43][44][45][46][47][48], with stacked moieties such as heterocycles or α-diketones; these replacements have been shown to improve stability, solubility, oral absorption, and bioavailability [49]. Since curcumin has also been extensively studied in recent decades for its significant anticancer activity against various malignant cell types [39], the evaluation of curcumin-like compounds as antiproliferative agents against the NCI60 panel may be remarkably interesting. Indeed, many curcumin analogues have already been studied as antiproliferative agents, and some of them (e.g., EF24, UBS109, and CDF) showed higher activities and improved drug-like properties compared to curcumin ( Figure 6) [40][41][42].
In detail, the selected in-house database included three different subclasses of curcuminlike derivatives, as reported in Figure 7: 1,2-diones (1a-o); 1,2,4-oxadiazoles (2a-h); and 1,3,4-oxadiazoles (3a-h). They were developed by replacing the symmetrical β-diketone core of curcumin, which is responsible for unfavorable physicochemical properties and a weak pharmacokinetic profile [43][44][45][46][47][48], with stacked moieties such as heterocycles or α-diketones; these replacements have been shown to improve stability, solubility, oral absorption, and bioavailability [49]. After selecting the molecule database, the next step was to tune the parameters of the AAP protocol. As mentioned earlier, the GI50 calculation can be targeted to a specific cell line or class of compounds by optimizing the parameters N, G, and Z. In this light, curcumin, tested by NCI (NSC code 32982) and included in the training set (NCI2014DB), was selected as a reference compound to determinate the best combination of parameters for the CL module. The tuning was performed with the parameters in the ranges 250 < N < 800 and 50 < Z < 100 while considering a, b, or c for the G function. Eighteen runs were started, following the procedure described above for the internal validation (the combi- After selecting the molecule database, the next step was to tune the parameters of the AAP protocol. As mentioned earlier, the GI 50 calculation can be targeted to a specific cell line or class of compounds by optimizing the parameters N, G, and Z. In this light, curcumin, tested by NCI (NSC code 32982) and included in the training set (NCI2014DB), was selected as a reference compound to determinate the best combination of parameters for the CL module. The tuning was performed with the parameters in the ranges 250 < N < 800 and 50 < Z < 100 while considering a, b, or c for the G function. Eighteen runs were started, following the procedure described above for the internal validation (the combinations are reported in Table 1). Consequently, the total absolute deviations from the experimental GI 50 values were calculated for each run, and the set of N = 760, Z = 50, and G = c was identified (total |DTV(GI 50 )| = 0.44) and applied to the AAP tool for the selected database (see Supporting Information S10).
The output of the AAP tool with the calculated GI 50 values of the 39 in-house compounds is listed in Supporting Information S11. The AAP GI 50 values of the curcumin-like analogues were compared with the experimental ones determined by the NCI for the curcumin lead compound (Supporting Information S11). The analysis of the GI 50 mean of each compound for the full panel highlighted several curcumin-like molecules with predicted antiproliferative activities better than that of the reference curcumin (average GI 50  To test the consistency of the AAP protocol, both compounds classified as active and inactive (with a GI 50 mean of less than 4.5) were synthesized and subsequently proposed to the NCI for the in vitro evaluation of the antiproliferative activity against the NCI60 human tumor cell lines, assuming that a reliable protocol must be able to identify both active and inactive compounds.

Chemistry
The three classes of compounds of types 1, 2, and 3 were synthetized as previously described in the literature [35,36,50]. Cinnamils (1,6-diarylhexa-1,5-diene-3,4-diones) 1ao were achieved by the aldol condensation of aromatic aldehydes 4a-o with diacetyl 5, leading to the formation of the double bonds, both with E geometry (Scheme 1) [51]. The output of the AAP tool with the calculated GI50 values of the 39 in-house compounds is listed in Supporting Information S11. The AAP GI50 values of the curcumin-like analogues were compared with the experimental ones determined by the NCI for the curcumin lead compound (Supporting Information S11). The analysis of the GI50 mean of each compound for the full panel highlighted several curcumin-like molecules with predicted antiproliferative activities better than that of the reference curcumin (average GI50 value of 5.17), such as 1a (5.47) To test the consistency of the AAP protocol, both compounds classified as active and inactive (with a GI50 mean of less than 4.5) were synthesized and subsequently proposed to the NCI for the in vitro evaluation of the antiproliferative activity against the NCI60 human tumor cell lines, assuming that a reliable protocol must be able to identify both active and inactive compounds.

Chemistry
The three classes of compounds of types 1, 2, and 3 were synthetized as previously described in the literature [35,36,50]. Cinnamils (1,6-diarylhexa-1,5-diene-3,4-diones) 1a-o were achieved by the aldol condensation of aromatic aldehydes 4a-o with diacetyl 5, leading to the formation of the double bonds, both with E geometry (Scheme 1) [51]. The 1,2,4-oxadiazole derivatives 2a-e were synthesized following the conventional amidoxime synthetic strategy [52], starting from the esters 6 and the amidoximes 7 (Scheme 2). The output of the AAP tool with the calculated GI50 values of the 39 in-house compounds is listed in Supporting Information S11. The AAP GI50 values of the curcumin-like analogues were compared with the experimental ones determined by the NCI for the curcumin lead compound (Supporting Information S11). The analysis of the GI50 mean of each compound for the full panel highlighted several curcumin-like molecules with predicted antiproliferative activities better than that of the reference curcumin (average GI50 value of 5.17), such as 1a (5.47), 1e (5.65), and 1m (5.82) for the diones class; 2a (5.49), 2d (5.57), and 2g (5.44) for the 1,2,4-oxadiazole class; and 3a (5.72), 3e (5.49), 3g (6.27), and 3h (5.47) for the 1,3,4-oxadiazole class. To test the consistency of the AAP protocol, both compounds classified as active and inactive (with a GI50 mean of less than 4.5) were synthesized and subsequently proposed to the NCI for the in vitro evaluation of the antiproliferative activity against the NCI60 human tumor cell lines, assuming that a reliable protocol must be able to identify both active and inactive compounds.

Scheme 1. Synthesis of cinnamils 1a-o.
The 1,2,4-oxadiazole derivatives 2a-e were synthesized following the conventional amidoxime synthetic strategy [52], starting from the esters 6 and the amidoximes 7 (Scheme 2). Scheme 2. Synthesis of 1,2,4-oxadiazole derivatives 2a-e. The 1,3,4-oxadiazoles 3a-h were achieved through the one-pot synthesis described in Scheme 3. Diacylhydrazine intermediates were obtained by the reaction of the cinnamic acid analogues 8 and hydrazine. The subsequent cyclization led to the isolation of the regio-isomers (E) 3a-h in good overall yields [34] (synthetic details and spectroscopic characterization for all compounds are reported in Supporting Information S12) [34][35][36]. The 1,3,4-oxadiazoles 3a-h were achieved through the one-pot synthesis described in Scheme 3. Diacylhydrazine intermediates were obtained by the reaction of the cinnamic acid analogues 8 and hydrazine. The subsequent cyclization led to the isolation of the regio-isomers (E) 3a-h in good overall yields [34] (synthetic details and spectroscopic characterization for all compounds are reported in Supporting Information S12) [34][35][36].

Biological Assays: NCI60 Human Tumor Cell Lines Screen Selected Compounds
All synthesized curcumin-like compounds were submitted to NCI cell-line-based in vitro screening for anticancer drugs. As described in the Methods Section (compound selection guidelines paragraph), the NCI applied specific criteria for compound selection. In the case of analogues, the selected compounds were those that were the most representative of the series and had significant structural novelty compared to the NCI collection.

Biological Assays: NCI60 Human Tumor Cell Lines Screen Selected Compounds
All synthesized curcumin-like compounds were submitted to NCI cell-line-based in vitro screening for anticancer drugs. As described in the Methods Section (compound selection guidelines paragraph), the NCI applied specific criteria for compound selection. In the case of analogues, the selected compounds were those that were the most representative of the series and had significant structural novelty compared to the NCI collection.

Biological Assays: NCI60 Human Tumor Cell Lines Screen Selected Compounds
All synthesized curcumin-like compounds were submitted to NCI cell-line-based in vitro screening for anticancer drugs. As described in the Methods Section (compound selection guidelines paragraph), the NCI applied specific criteria for compound selection. In the case of analogues, the selected compounds were those that were the most representative of the series and had significant structural novelty compared to the NCI collection.

One-Dose Antiproliferative Assay
The NCI screening protocol consisted of a preliminary one-dose assay (concentration of 10 µM) against the full NCI60 panel. Compounds that met the NCI selection criteria and had a significant growth-inhibitory effect on a minimum number of cell lines proceed to the five-dose screening (experimental details are described in the Methods Section). The results are expressed as the percent of growth (G%) of the treated cells when compared to the untreated control cells. This parameter accurately expresses the anticancer potential of the drug. At G% > 100, the compound has no effect on cancer cell proliferation (inactive). In the range, the compound inhibits cell proliferation by a percentage indicated by 100-G%. When G% is <0, the compound is cytotoxic and lethal to the cancer cells. To graphically represent the most sensitive panels/cell lines, a mean growth percentage is also provided.
The mean G% values of the five selected compounds for each of the nine subgroups of cancer cell lines are shown in Table 4. The full results and the mean graphs from the one-dose screening are reported in Supporting Information S13. Consistent with the AAP-predicted GI 50 values for these compounds, the biological data confirmed the dimethoxy-dione 1a (NSC785541) and the dimethoxy-1,3,4-oxadiazole 3e (NSC785543) to be the most active curcumin-like derivatives. They showed remarkable overall average G% values (26. Similarly, dimethoxy-1,3,4-oxadiazole 3e showed a strong antiproliferative effect against the colon cancer panel but with lower toxicity; it exhibited a G% close to 0, implying an arrest of cell growth with low/no lethality against the most sensitive cell lines, HCT116 and HT-29. Furthermore, the leukemia, melanoma, and CNS cancer panels also showed interesting sensitivity to the tested compound. Therefore, according to the selection criteria of the DTP NCI protocol, these two molecules progressed to the full five-dose assay (see the next section).
The other three compounds tested in the NCI one-dose protocol, 1b, 1c, and 2a, generally exhibited less inhibitory activity, with an average G% close to 100.
These results confirm the AAP in silico data, according to which the compounds 1b and 1c were predicted to be almost inactive, with mean GI 50 values of 4.38 and 4.54 (low millimolar range), respectively.
On the other hand, when it was analyzed for its antiproliferative effect on specific human cancer cell lines, dione 1b reduced the growth of the RPMI-8226 and MCF-7 cell lines by 55% and 76%, respectively (G% = 45.15 and 23.84) and induced a remarkable death in the HCT-116 colon cancer cell line (G% = −36.84). Compound 1c, instead, exhibited a G% of 69.01 against HCT-116 colon cancer cells.
The 1,2,4-oxadiazole 2a, which was predicted to be more active than curcumin, did not exhibit any appreciable anticancer activity against the NCI60 database, with the exception of HT29 colon cancer cells (G% of 56.32). This result was unexpected, considering that the isomer 3e was selected for investigation in the five-dose screening. It could probably be suggested that the change from the 1,3,4-oxadizole to the 1,2,4-oxadiazole core affects, in particular, the ability of the compound to interfere with biological targets.
Therefore, both compounds 1a and 3e were selected for the five-dose screening to measure the GI 50 s, which permitted us to further validate our tool.

Five-Dose Antiproliferative Assay for the Most Active Derivatives, 1a and 3e
The two selected compounds, 1a and 3e, were tested with the five-dose assay by measuring the percentage of cell growth at five different concentrations (from 10 −8 to 10 −4 M), as described in detail in the Methods Section. For each selected compound, NCI provided the measured GI 50 , TGI, and LC 50 values against the NCI60 cell lines, with the corresponding mean graphs (see Supporting Information S14).
In the first part of this section, attention is focused exclusively on the GI 50 values, as these data allowed us to further assess the predictive ability of the proposed AAP protocol.
The comparison of the average predicted GI 50 with the experimental values obtained by NCI for the two compounds confirmed that the protocol was able to predict with high accuracy the range of activity of both compounds against the full NCI60 (for 1a, the average predicted GI 50 was 5.39, whereas the average experimental GI 50 was 5.49; for 3e, the average predicted GI 50 was 5.41, whereas the average experimental GI 50 was 5.28).
To further analyze the performance of the AAP protocol, the predicted GI 50 values were matched to the experimental values; moreover, the |DTV(GI 50 )| was computed for the two tested compounds. This allowed us to calculate the average absolute error values for the compounds, which were 0.39 and 0.40 for 1a and 3e, respectively (see Supporting Information S15), indicating the capability to assign a GI 50 value with an error of less than one order of magnitude.
In general, considering the mean |DTV(GI 50 )| for each panel, the AAP returned very low errors in activity prediction against specific panels, such as, for example, the prostate cancer panel for 1a (average |DTV(GI 50 )| for the panel of 0.07) and the colon cancer panel for 3e (average |DTV(GI 50 )| for the panel of 0.25). Furthermore, a detailed analysis of specific cell lines revealed that the protocol was able to predict GI 50 values for some specific cell lines with remarkable precision: for 1a, the |DTV(GI 50 )| against HL-60TB and NCI-H322M was only 0.02; for 2a, a |DTV(GI 50 )| of only 0.03 was computed for HT-29 and OVCAR-5.
On the other hand, BT-549 (breast cancer panel) gave the worst predictions for both compounds, with |DTV(GI 50 )| values of 2.21 and 2.41; this evidence is consistent with the results presented in the previous section (tool validation), where this cell line showed the highest error. As previously demonstrated, high prediction error can be attributed to the lack of biological data for selected cell lines. This was confirmed by looking at the GI 50 (FP) values assigned to both compounds: the structure selected with the best score in the FP module was not tested against BT-549. Thus, the final GI 50 relied solely on the CL protocol rather than combining outputs from both modules, which may have drastically affected the quality of the prediction.
In Figure 9a, the two bar graphs depicting the comparisons between predicted GI 50 and experimental GI 50 are reported to graphically appreciate the excellent performance of our protocol; in Figure 9b, instead, the mean error graphs are shown to highlight the cell lines for which the highest/lowest errors were recorded.  In Table 5, the full NCI output data, (GI 50 s, TGI, and LC 50 values) of compounds 1a, 3e, and curcumin are shown (the mean graphs and the full NCI schedules are reported in Supporting Information S14). The analysis of the values allowed us to highlight the noteworthy antiproliferative activity of the two selected curcumin-like compounds compared to curcumin; indeed, the protocol predicted a higher activity with respect to the parent compound.  Through the evaluation of the GI 50 values (the most diagnostic parameter used to compare antiproliferative potential), it emerged that the average GI 50 values were higher for both the tested compounds, in the high micromolar range, than for curcumin (5.59 for 1a and 5.37 for 3e vs. 5.16 for curcumin), confirming the in silico predictions.
With respect to the tumor subpanels, the most active compound, the dione 1a, proved to be particularly effective against leukemia, colon cancer, and breast cancer. In fact, the calculated average GI 50 values for these subpanels (5.87, 6.00, and 5.79, respectively) were always higher than the overall average GI 50 value (5.59). In detail, among these subpanels, several cell lines showed remarkable sensitivity to the compound, with excellent GI 50 values in the low micromolar range: RPMI-8226 (6.41), HCT-116 (6.5, the most sensitive cell line), HCT-15 (6.11), HT-29 (6.12), and MCF-7 (6.29). Moreover, the analysis of TGI values showed that the dione 1a was the most active compound of the series (overall average TGI of 4.81), confirming its selectivity against the aforementioned subpanels and cell lines, especially against colon cancer (average TGI of 5.48). This trend was also confirmed at the LC 50 level, with a strong cytotoxic effect against colon cancer cell lines (average GI 50 of 4.91 for colon cancer). Interestingly, it is important to highlight the very low toxicity against RPMI-8226 (LC 50 < 4), which proved to be one of the most sensitive (GI 50 in the sub-micromolar range), demonstrating the high potency and low toxicity of the compound against this cell line.
The curcumin-like 3e, although less active than the previous one, was more effective than curcumin. Remarkable results were obtained against the leukemia subpanel, with a panel average of 5.57, much higher than the average value for the full NCI60 (5.37). Moreover, the oxadiazole derivative exhibited high potency against two cell lines, MDA-MB-435 (melanoma) and A498 (renal cancer), with excellent GI 50 values in the low micromolar range (6.03 and 6.61, respectively). In terms of the TGI level, it was slightly less effective than curcumin, but the average LC 50 value of 4.01, even against the most susceptible cells, indicated high potency with low cytotoxicity, even at high concentrations.
To further evaluate the prediction capability of the AAP, the pdCSM-cancer tool was selected as a comparative approach. Thus, structures 1a and 3a were submitted to pdCSMcancer and compared with the results obtained by the AAP. The obtained data are reported in Figure 10. It is noteworthy that the AAP tool showed |DTV(GI 50 )| values of 0.39 and 0.41 for the compounds 1a and 3a, while the pdCSM-cancer tool showed values of 0.60 and 0.63, respectively. The cell line that showed a low grade of reliability was BT-549, not only in these cases in the study but also in the external test screening. A change in dataset could solve this problem.

Computational Studies
The DRUDIT web service runs on four servers that are automatically selected according to the number of jobs and online availability. Each server can support up to 10 simultaneous jobs, while the exceeding jobs are placed in a queue.

Software
DRUDIT consists of several software modules implemented in C and JAVA and running on MacOS Mojave.

Database Selection and Dataset Building
The NCI60 database, containing both antiproliferative and chemical data of thousands of compounds, was selected as a reliable source for the building of the protocol.
In detail, since the presented tool is based on molecular descriptors, the 2D chemical structures of the NCI-tested compounds (.mol files, only available until the June 2016 release) and the corresponding growth inhibition data were retrieved from the NCI website (284,176 chemical structures) [38,53]. Among these thousands of compounds, only those tested with the five-dose assay, which provided GI 50 data, were selected to build and validate the model. In particular, the structures were split into two sets: a training set containing more than 38 k compounds released until 2014 (NCI2014DB) was used to build the protocol, and a test set containing about 100 compounds that were first released in 2016 (NCI2016DB) was used to validate the AAP tool.

MOLDESTO: A New Software for Molecular Descriptor Calculations
MOLDESTO (molecular descriptor tool), as described previously [28], is a software tool implemented in DRUDIT that represents the evolution of our expertise in the calculation/manipulation of molecular descriptors [30]. It is currently able to calculate more than 1000 molecular descriptors (1D, 2D, and 3D) for each input structure (the full list of molecular descriptors calculated by MOLDESTO is reported in Supporting Information S1). The input structures can be drawn directly in the web interface or uploaded as commonly used molecule file formats (e.g., SMILES, SDF, Inchi, Mdl, and Mol2). The software is provided with a caching system to boost the calculation speed of previously submitted structures.

DRUDIT Settings for Antiproliferative Activity Predictor (AAP) Tool
The AAP tool comprises the fingerprint (FP) and cell line (CL) modules, which cooperate simultaneously to assign the predicted GI 50 values to an input structure. In each module, the performed calculation is dynamic; indeed, it can be modulated by appropriately tuning the values of the available parameters (three for each module, see below).
The FP module parameters are a choice of biological activity, such as GI 50 , TGI, LC 50 , or G% (in this work only the first choice was considered); N (-b), the best number of the dynamically selected molecular descriptors; Z (-m), the number of descriptors for which |v-m|/m < <value> applies (v: descriptor value, m: target mean); and G (-c), the max number of zero percentage values per descriptor. The DRUDIT parameters for the CL module are a choice of biological activity as GI 50 , TGI, LC 50 , or G% (in this work only the first parameter was considered); N (-b), the best number of dynamically selected molecular descriptors; Z (-m), the max number of zero percentage values per descriptor; and G (-f), the Gaussian smoothing function to be used (a, b, or c mode).

Chemistry
All solvents and reagents were used as received unless otherwise stated. Melting points were determined on a hot-stage apparatus. The 1 H-NMR and 13 C-NMR spectra were recorded at the indicated frequencies; the residual solvent peak was used as a reference. Chromatography was performed using silica gel (0.040-0.063 mm) and mixtures of ethyl acetate and petroleum ether (fraction boiling in the range of 40-60 • C) in various ratios (v/v). Compounds 1d-o [36], 2a-c [36], 2d [35], 3a-g [36], and 3h [35] were prepared as previously reported. Compounds 1a-c and 2e-h were prepared by adapting previously reported methods. The synthetic details and spectroscopic characterizations of all compounds are reported in Supporting Information S12.

Compound Selection Guidelines
The compounds to be screened were selected according to precise and rigorous guidelines; in general, submission was encouraged for molecules that bring some novelties (novel heterocyclic ring systems and privileged scaffolds) to the NCI collection and compounds that emerged from computer-aided drug design. In addition, in the case of a series of analogues, it was preferred to select only the one that was expected to provide the greatest information. On the other hand, the submission of compounds with the following features was discouraged: excessive flexibility; the presence of non-drug-like functional groups (nitro, nitroso, diazo, imine, etc.); and the presence of chemical portions that could affect the reliability of the assays (PAINS) [54].

One-Dose Assay
All compounds submitted to NCI were first assayed against the NCI60 DB in a onedose screen (concentration of 10 5 M); this kind of assay aims to determine the G% (growth inhibition percent) of the compounds against the considered cells. The results were plotted in a one-dose graph showing the G% of the single compound against the 60 cell lines. This first assay was considered passed only for the most promising compounds (satisfaction of predetermined threshold criteria); in this case, the compound passed to the five-dose screen (for further experimental details about the standardized assay procedures, see [55,56]).

Five-Dose Assay
The most active compounds were submitted to a multiple-dose screen using five different concentrations (ranging from 10 −8 to 10 −4 M). The dose-response curves obtained from this assay permitted the extrapolation of the GI 50 (the molar concentration of the compound that inhibits 50% of cell growth), TGI (the molar concentration of the compound leading to total inhibition of cell growth), and LC 50 (the molar concentration of the compound that induces 50% cell death) values of the selected compounds against each cancer cell line. For each of the mentioned parameters, a mean graph midpoint (MG_MID) was calculated, providing an average activity parameter over all cell lines (for further experimental details about the standardized assay procedures, see [55,56]).

Conclusions
In the field of antitumor drug discovery, in vitro antiproliferative assays still represent one of the most important tools for identifying new small molecules against cancer. In recent years, numerous projects and online databases have been launched aiming to collect both drug response and cellular data, with NCI60 undoubtedly being the best-known and most complete [1].
To avoid the high failure rate and the enormous number of resources invested to perform such intensive in vitro screenings, computational chemistry and biology, in the last few years, have sought to develop in silico techniques that are able to predict the antiproliferative potential of new small molecules in the early stage of the drug discovery pipeline.
In this light, we have presented the antiproliferative activity predictor (AAP), a new molecular-descriptor-based tool capable of reliably predicting the anticancer potential (expressed as GI 50 , the most diagnostic parameter to measure anticancer drug responses) of small molecule libraries against the full NCI60 DB.
Using both the structural and antiproliferative data (GI 50 ) of thousands of compounds stored in the NCI DB, we applied our expertise in manipulating molecular descriptors to build two convergent modules, the FP (fingerprint) and the CL (cell line); as shown, these operate synergistically to assign GI 50 s values to the input structures.
Both internal and external validation were performed to validate the CL module and the entire tool, demonstrating the reliability and the robustness of the proposed protocol. Interestingly, the possibility to appropriately tune the available parameters (N, Z, and G) allows researchers to address the screening to a specific subpanel/cell line or class of compounds. An important aspect that should be highlighted is the deep correlation between the quality of the prediction and the availability of biological data. In the case of the lack of sufficient GI 50 data, as for a cell line (M19-MEL) or for some ranges of activity (structures with GI 50 values in the range of 7-8), the prediction is negatively affected.
Moreover, the application of the AAP tool to quickly screen an in-house structure database of curcumin-like compounds permitted us to further corroborate the already obtained encouraging results: compounds 1a and 3e, predicted to be highly active by the protocol, were found to be significantly antiproliferative for several human cancer cell lines in the five-dose NCI screen. This further goal confirmed that the AAP could be an invaluable help in the in silico design of new effective anticancer small molecules, permitting the selection of the most promising molecules in the first phases of the drug design process.
It is worth noting that the integration of the AAP tool into our free and easy-touse DRUDIT web service (available online at https://www.drudit.com (accessed on 15 November 2022)) provides an interesting tool for the entire medicinal chemistry community.
In the future, we plan to continuously update the training set with new NCI-tested compounds to cover more and more chemical spaces, activities, and cell lines. Finally, the extension of the AAP's potential for the calculation of more parameters, such as TGI or LC 50 , could also be interesting. The possibility to calculate not only the GI 50 but also these parameters (evaluated in vitro by the NCI) could allow the identification and eventually the elimination of small molecules with toxicity profiles that are unacceptable for further development.

Conflicts of Interest:
The authors declare no conflict of interest.