Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Antitumor Activity of Anthrapyrazole Derivatives

An approach using multivariate adaptive regression splines (MARSplines) was applied for quantitative structure–activity relationship studies of the antitumor activity of anthrapyrazoles. At the first stage, the structures of anthrapyrazole derivatives were subjected to geometrical optimization by the AM1 method using the Polak–Ribiere algorithm. In the next step, a data set of 73 compounds was coded over 2500 calculated molecular descriptors. It was shown that fourteen independent variables appearing in the statistically significant MARS model (i.e., descriptors belonging to 3D-MoRSE, 2D autocorrelations, GETAWAY, burden eigenvalues and RDF descriptors), significantly affect the antitumor activity of anthrapyrazole compounds. The study confirmed the benefit of using a modern machine learning algorithm, since the high predictive power of the obtained model had proven to be useful for the prediction of antitumor activity against murine leukemia L1210. It could certainly be considered as a tool for predicting activity against other cancer cell lines.


Introduction
Anthrapyrazoles are synthetic anticancer drugs, synthesized in order to retain high levels of the wide spectrum of antitumor activity in anthracyclines (e.g., doxorubicin), while at the same time, diminishing cardiotoxicity by reducing the potential to generate semiquinone free radicals in cardiac cells [1,2]. Although there was a broad range of antitumor activity in model tumors [1,3], they revealed diversified activity in doxorubicin-resistant cells [4]. The action mechanism of these planar compounds is based on DNA intercalation, topoisomerase II inhibition of DNA synthesis, and DNA strand breaks [2]. Structurally, anthrapyrazoles are similar to mitoxantrone, but their structure has to be modified to reduce the abovementioned side effect. Attempts to reduce the toxicity of anthracyclines have led to the development of various anthrapyrazole derivatives, including teloxantrone (Ci-937, DUP-937, molecule a-60, which is studied in this work), piroxantrone (CI-942, DUP-492, molecule a-58, which is studied in this work), and finally losoxantrone (CI-941, DUP-941), with reduced side effects and increased efficacy in patients with breast cancer. Those three anthrapyrazoles even underwent clinical trials, and in phase II trials, they exhibited significant response rates in women with metastatic breast cancer [3]. Losoxantrone has shown impressive cytotoxic activity on a wide range of tumor cell lines (virtually the same spectrum of antitumor activity as mitoxantrone) with predicted potential to replace anthracyclines through a more favorable therapeutic index [1]. What is more, is that a response rate of 63% in women with metastatic breast cancer was observed in the study conducted by Talbot et al. [5].
The multivariate adaptive regression splines (MARSplines) were presented by Friedman as a method for flexible regression modeling of high dimensional data [6]. This modern machine learning algorithm was successfully applied in a quantitative structure-activity relationship (QSAR), and a quantitative structure-retention relationship (QSRR) modeling approach was applied in studies for drug activity prediction. A MARSplines procedure was used for the development of predictive QSAR models of various compounds with diverse pharmacological activities, such as antitrypanosomal 4-thiazolidinones [7], antispasmodial artemisin compounds [8], pyridine N-oxide derivatives against human severe acute respiratory syndrome [9], or anticancer acridone derivatives [10]. The advantages of the MARS technique were shown, among others, in the case of artemisinin compounds. Namely, it was found that QSAR models determined by the MARS procedure are the most satisfactory predictive models in comparison with some other methods such as multiple linear regression [8]. For abovementioned reasons, the MARSplines algorithm was chosen as a promising tool for a prediction of the antitumor activity of anthrapyrazoles in the present study.
A large set of anthrapyrazole compounds (about 119 derivatives, 73 of which have been studied in the present work) was tested against L1210 murine leukemia in vitro, and P388 leukemia in vivo, by Hollis Showalter et al. [11] In subsequent studies, some of the abovementioned compounds were tested in eight different mouse tumor systems [1]. Moreover, it was found in another study that 12 different anthrapyrazole derivatives inhibited the growth of K562 and K/VP.5 cells [12]. In light of the constant need to develop new anticancer drugs, as well as the high potential of such a large group of anthrapyrazole derivatives studied in the present work, structure-activity studies using modern machine learning algorithms may contribute to achieving better levels of predictivity, thus indicating a potential candidate for further research. The goal of the present work is to create a model predicting the antitumor activity of 73 anthrapyrazole derivatives, as well as to evaluate the usefulness of the MARSplines procedure for QSAR studies.

Results
More than 2500 molecular descriptors were obtained using Hyperchem and Dragon software, which were used as independent variables to create a model predicting the antitumor activity of 73 anthrapyrazole derivatives ().

Geometry Optimization
Molecular modeling was performed with 73 derivatives, which were first geometrically optimized. Examples of three-dimensional particle structures with defined geometries are shown in Figure 1. even underwent clinical trials, and in phase II trials, they exhibited significant response rates in women with metastatic breast cancer [3]. Losoxantrone has shown impressive cytotoxic activity on a wide range of tumor cell lines (virtually the same spectrum of antitumor activity as mitoxantrone) with predicted potential to replace anthracyclines through a more favorable therapeutic index [1]. What is more, is that a response rate of 63% in women with metastatic breast cancer was observed in the study conducted by Talbot et al. [5].
The multivariate adaptive regression splines (MARSplines) were presented by Friedman as a method for flexible regression modeling of high dimensional data [6]. This modern machine learning algorithm was successfully applied in a quantitative structure-activity relationship (QSAR), and a quantitative structure-retention relationship (QSRR) modeling approach was applied in studies for drug activity prediction. A MARSplines procedure was used for the development of predictive QSAR models of various compounds with diverse pharmacological activities, such as antitrypanosomal 4-thiazolidinones [7], antispasmodial artemisin compounds [8], pyridine N-oxide derivatives against human severe acute respiratory syndrome [9], or anticancer acridone derivatives [10]. The advantages of the MARS technique were shown, among others, in the case of artemisinin compounds. Namely, it was found that QSAR models determined by the MARS procedure are the most satisfactory predictive models in comparison with some other methods such as multiple linear regression [8]. For abovementioned reasons, the MARSplines algorithm was chosen as a promising tool for a prediction of the antitumor activity of anthrapyrazoles in the present study.
A large set of anthrapyrazole compounds (about 119 derivatives, 73 of which have been studied in the present work) was tested against L1210 murine leukemia in vitro, and P388 leukemia in vivo, by Hollis Showalter et al. [11] In subsequent studies, some of the abovementioned compounds were tested in eight different mouse tumor systems [1]. Moreover, it was found in another study that 12 different anthrapyrazole derivatives inhibited the growth of K562 and K/VP.5 cells [12]. In light of the constant need to develop new anticancer drugs, as well as the high potential of such a large group of anthrapyrazole derivatives studied in the present work, structure-activity studies using modern machine learning algorithms may contribute to achieving better levels of predictivity, thus indicating a potential candidate for further research. The goal of the present work is to create a model predicting the antitumor activity of 73 anthrapyrazole derivatives, as well as to evaluate the usefulness of the MARSplines procedure for QSAR studies.

Results
More than 2500 molecular descriptors were obtained using Hyperchem and Dragon software, which were used as independent variables to create a model predicting the antitumor activity of 73 anthrapyrazole derivatives ().

Geometry Optimization
Molecular modeling was performed with 73 derivatives, which were first geometrically optimized. Examples of three-dimensional particle structures with defined geometries are shown in Figure 1.

Model Construction and Prediction of pIC 50 Values
The MARS model, using a considerable set of descriptors as possible predictors, was developed using a training set to describe the antitumor activity denoted as a negative logarithm of the half maximal inhibitory concentration (pIC 50 ). The degree of interaction was set at 3, which led to linear, second, and third order splines being incorporated into the model, whereas the maximum number of basis functions was set at 40. Finally, the optimal MARS model was selected on the basis of three validation parameters (R 2 ,Q 2 and MAE). All fourteen descriptors incorporated into the model are characterized in Table 2. The MARS model is based on several interactions between molecular properties. All of the abovementioned molecular descriptors treated as predictor variables appear in 38 basis functions, which form 23 splines (high-order basis functions) (B m ). The model starts with the constant function B 1 , and then, in subsequent steps, functions giving the best learning system fit for the current residual are added to the model according to Equation (1): The  Table 3. As an example of a linear basis function, B 9 can be considered: What this means, is that the ninth term of Equation (1) is-5.67335 (0.42100 − Mor19m) when Mor19m is lower than 0.42100, and zero when it is smaller than 0.42100. As for an exemplary two-order interaction between molecular properties, B 14 may be reviewed: What this means is that the fourteenth term of Equation (1) is-8.31766 (MATS8e − 0.07400) (RDF135e − 7.04700) when MATS8e is higher than 0.07400 and RDF135e is higher than 7.04700, but otherwise it is zero.
As mentioned above, 14 of more than 2500 descriptors were incorporated into the MARS model. Their relevance to the MARS model expressed as the number in the basis functions, as well as in their definition, block, and dimensionality which are presented in Table 2. Descriptors describing the molecule's 3-D geometrical properties (3D-MoRSE descriptors, GETAWAY descriptors, RDF descriptors) emerge in the foreground in the present molecular modeling. The other descriptors are two-dimensional burden eigenvalues and autocorrelations, namely ATS descriptors, which describe how a property is distributed along the topological structure. Out of all the descriptors present in the model, Mor05s and Mor19m descriptors belong to the class of 3D-MoRSE descriptors, and contribute the most to the model, as they appear in the basis functions nine and six times, respectively. The next descriptor is MATS8e, which appears four times in the model, and belongs to the class of 2D autocorrelations, and finally, H1e presents three times as a representative of GETAWAY descriptors. Other descriptors of minor importance for the model (i.e., occurring twice), include ATSC7v, ATSC1e, SpMax8_Bh(s), and R5p. The contributions of ATSC1s, ATSC8s, RDF135e, and HATS5s are much less significant.

Validation of Models and Selection of the Optimal One for Prediction
Using the Multivariate Adaptive Regression Splines nonparametric procedure, 11 QSAR models were created using a different degree of interactions, as well as a different maximum number of basis functions. The coefficients included in the models were determined on the basis of the training group (see Table 1). Following the calculated validation parameters of all models, an optimal model was selected showing the structure-activity relations (degree of interaction 3, number of basis functions 38) with the highest determination coefficient (R 2 ) (a perfect correlation was obtained), a cross-validated R 2 (Q 2 ) threshold greater than 0.5 (checking R 2 for internal validation), and the lowest mean absolute error (MAE). The values of the aforementioned parameters are presented in Table 4. Table 4. Values of validation parameters of models obtained with the MARSplines procedure (the optimal model marked in yellow). Moreover, for the optimal MARS model, the extended validation procedure that is typical for QSAR models was applied according to Roy et al. [13] (see Table 5) Considering the above characteristics, the reasonably high predictive power of the established MARS model should be emphasized.

Degree of Interaction
It measures the variation of observed data with the predicted ones.
A measure of correlation between the observed and predicted data of the test set.
Almost equal or closer values of Q 2 (F2) and Q 2 (F1) infer that the training set mean lies in the close propinquity to that of the test set.
Concordance correlation coefficient (CCC) measures both precision and accuracy, detecting the distance of the observations from the fitting line and the degree of deviation of the regression line from that passing through the origin, respectively. and parameters r 2 and r 2 0 are denoted as follows: The terms k and k are explained as follows:

Values of Predicted Data
Values of pIC 50 (pIC50 calc ) obtained on the basis of the constructed model were compared with the experimental data (pIC50 exp ) (see Table S1) and in the scatter plot, where a strong positive relationship is shown (see Figure 2). Moreover, analysis of residuals showed that the residual plot represents a normal distribution (see Figure 3). An elaborated MARS model was also employed for the prediction of antitumor activity against murine leukemia L1210 out of the seven other anthrapyrazole derivatives. This external set was adopted from the literature [1]. It should be noted that the antitumor activity against murine leukemia L1210 has not been reported so far. For more details, see Table S2.

Discussion
On the basis of the abovementioned validation parameters, namely R 2 , Q 2 , and MAE [13], the optimal predictive and applicative model was selected from the eleven proposed MARS models elaborated in this study, differing in terms of independent variables included, as well as the degree of interactions, and maximum number of basis functions. The interpretation of obtained results begins with a focus on the number and the nature of molecular descriptors present in the model. Fourteen selected descriptors appear in 38

Discussion
On the basis of the abovementioned validation parameters, namely R 2 , Q 2 , and MAE [13], the optimal predictive and applicative model was selected from the eleven proposed MARS models elaborated in this study, differing in terms of independent variables included, as well as the degree of interactions, and maximum number of basis functions. The interpretation of obtained results begins with a focus on the number and the nature of molecular descriptors present in the model. Fourteen selected descriptors appear in 38 basis functions, which form 23 splines. Predictive descriptors can be divided into the following groups: 3D-MoRSE descriptors, 2D autocorrelations, GETAWAY descriptors, Burden eigenvalues, and RDF descriptors. Descriptors derived from the three-dimensional structure of anthrapyrazole compounds have the highest frequency in repetition (and, in this way, the largest share) in the model (over 68%). This class of geometrical descriptors, which is calculated based on optimized molecular geometry that is obtained by the method of computational chemistry in the current study, comprises 3D-MoRSE descriptors and GETAWAY descriptors. The remaining 32% of the descriptors are calculated from the 2D structure of a molecule (molecular topology).
The 3D-MoRSE Molecular Representation of Structures' (based on electronic diffraction) descriptors, which have contributed the most to the model percentage-wise (50%), and they comprise the most prominent block of descriptors in the present study. The 3D-MoRSE structure was introduced in 1996 by J.H. Schuur, P. Selzer, and J. Gasteiger in order to encode the 3D structure of a molecule by a fixed number of variables. Each representative of this descriptor block combines the information about the whole molecule structure and its final value, which is derived mostly from short-distance atomic pairs [14]. The 3D-MoRSE descriptors, which are representations of the 3D structure of a molecule, encode features such as molecular weight, van der Waals volume, electronegativities and polarizabilities. In this study, 3D-MoRSE descriptors, weighted by I-state, weighted by mass, and weighted by Sanserson electronegativity, are distinguished. The 3D-MoRSE descriptors cannot describe complex atomic groups or regions with a high or low electron density, or some quantum-chemical properties, but they result in a good model performance when activity variation coincides with variation in interatomic distances due to changes to the bonds' order and the introduction of new atoms [14].
It was shown that other important factors in predicted antitumor activity (MATS8e, ATSC7v, ATSC1e, ATSC1s, ATSC8s) are 2D autocorrelation descriptors. In general, they explain how the considered property is distributed along the topological structure. An autocorrelation descriptor is a topological descriptor encoding both the molecular structure and physicochemical properties of a molecule [15,16]. The 2D autocorrelations have a share in the optimal model with a percentage of 26.30%.
The next important variables selected belong to GETAWAY (Geometry, Topology, and Atom-Weights Assembly) descriptors (H1e, R5p and HATS5s), which are the block of descriptors that contribute 15.80%. GETAWAY tries to match the 3D molecular geometry provided by the molecular influence matrix and atom relatedness, using topology and chemical information, with the use of various atomic weighting schemes [15].
Another important variable, which is representative of burden eigenvalues with two repetitions in the MARS model, is denoted as SpMax8_Bh(s), which occurs two times in the elaborated MARS model. It belongs to the block of molecular descriptors based on the assumption that the lowest eigenvalues contain contributions from all atoms, and thus, they reflect topology of the molecule [15].
The last parameter, which has been used for modeling and has the smallest frequency, represents RDF (Radial Distribution Function) descriptors, which are based on a radial distribution function. It can be interpreted as the probability distribution of finding an atom in a spherical volume of a radius [15,17].
Efforts to establish mathematical equations for the prediction antitumor activity of anthrapyrazoles also prompt a closer examination and understanding of the mechanism action of these compounds. First of all, anthrapyrazoles with their planar structure can intercalate into DNA. It is well known that compounds that intercalate into DNA stabilize the DNA double helix and increase the temperature at which the DNA is denatured. It is worth noticing that, for a small set of anthrapyrazoles examined in the study, some anthrapyrazole compounds were bound to DNA even more strongly than doxorubicin, the drug hoped, as mentioned before, to be replaced with anthrapyrazoles due to its cardiotoxicity. Anthrapyrazoles not only target DNA, but also interfere with one of the enzymes processing DNA. More specifically, anthrapyrazoles inhibit the decatenation activity of human topoisomerase IIα. This enzyme alters DNA topology by catalyzing the passing of an intact DNA double helix through a transient double-stranded break, which is made in a second helix. Topoisomerase IIα activity is critical for relieving torsional stress that occurs during replication and transcription, and for daughter-strand separation during mitosis. Not only anthrapyrazoles, but also most of the currently used anticancer agents, such as anthracyclines (for instance doxorubicin, mitoxantrone, and etoposide), act as topoisomerase II inhibitors, and their cytotoxicity is a result of the stabilization of a covalent topoisomerase II-DNA intermediate (the cleavable complex). Finally, docking studies on several compounds revealed that the inhibitory activity of anthrapyrazoles is due, in part, to their ability to bind to DNA and structurally similar anthrapyrazoles that can be docked into the doxorubicin-binding pocket on DNA. Moreover, increased binding is associated with increased anthrapyrazole-DNA van der Waals interactions [12].
In the study by Showalter et al. [11] which incorporates the activity data subjected to the current study, the antitumor activity against murine L1210 leukemia in vitro, as well as against P388 leukemia in vivo, was tested over one hundred anthrapyrazole derivatives. Findings of the study indicate that basic side chains at N-2 and C5two to three carbon spaces between proximal and distal nitrogen atoms of the side chain, and A-ring hydroxylations, especially at C-7, contribute to the activity against P388 leukemia growth [11]. Those findings were confirmed by Hartley et al. [18] but the obtained results were not always consistent. On the one hand, the side chains had a greater effect on DNA binding, but on the other hand, the intercalation was affected more by hydroxylation of the A-ring. DNA binding was increased by hydroxylation at C-7 and decreased by hydroxyl groups at any position on the A-ring [18]. Interestingly, in the study by Begleiter et al. [3] anthrapyrazole derivatives showed a broad range of activity for inhibiting topoisomerase II decatenation activity; however, there was no significant correlation with the cytotoxic activity observed. All of the anthrapyrazole analogues examined in this study inhibited the growth of the four cell lines with IC 50 values that ranged from 0.1 to 45.2 µM, but losoxantrone was the most potent molecule. Structure-activity studies revealed an increase in the cytotoxic activity with the presence of a tertiary amine in the basic side chain at N-2, in comparison with a secondary amine in the same position for the majority of examined derivatives, but only in the case of the absence of a basic side chain at the C-5 position. Other structural alternations, such as a chlorine substituent on the basic side chain at N-2, moving the position of a chlorine substituent from C-5 to C-7, or introducing a basic side chain at C-5, did not have a consistent effect on cytotoxic activity. The authors of this study suggested that the ability of the analogues to bind to DNA by alkylation does not contribute significantly to the antitumor activity of the anthrapyrazoles [3]. A study by Liang et al. [12] confirmed the abovementioned results. Namely, cell growth inhibition by anthrapyrazoles was not wellcorrelated with the inhibition of topoisomerase IIα catalytic activity, which suggests that the anthrapyrazole derivatives examined in this study did not act solely by inhibiting the catalytic activity of topoisomerase II. Moreover, the authors showed that hydrogen-bond donor interactions and electrostatic interactions with the protonated amino side chains of the anthrapyrazoles led to high cell growth inhibitory activity [12].
The abovementioned studies demonstrated that structural changes on the basic side chain at N-2, and at C-5, C-7, can have a considerable impact on the cytotoxic activity of anthrapyrazoles as well as on topoisomerase II inhibition. Those results that are still inconsistent, may even, to small extent, help to understand the role of descriptors incorporated into the optimal MARS model. In the present study, descriptors derived from the three-dimensional structure of anthrapyrazole compounds comprise the largest share in the model, alongside the most prominent 3D-MoRSE descriptors, with values that are very sensitive to any conformational change in the molecule, and GETAWAY descriptors encoding information about the influence that each atom has in determining the whole shape of the molecule. In this light, the antitumor activity difference of the anthrapyrazoles studied, is presumably a result of the interatomic distances' changes, or the introduction of new atoms at N-2 and C-5, C-7. The obtained data indicate that parameters based on the molecular geometry and physicochemical properties, the reflection of molecular topology, and finally, the distance distribution of the compounds, are of the greatest importance for the antitumor activity of the anthrapyrazole derivatives.
It should be emphasized that the MARS model has been expanded upon in the present study, in order to predict the antitumor activity of 73 anthrapyrazole compounds, so that it is able to describe more than 96% of the variance in the experimental activity. Good predictive properties of the model were confirmed by an extensive validation procedure, which is characteristic of QSAR models. Several validation parameters were calculated. Among others, cross-validated R 2 was checked for internal validation, mean absolute error was calculated, and predictability, as well as precision and accuracy, were also assessed. It should be noted that all tested parameters met the acceptance criteria [13] listed in Table 5. Moreover, the MARS model that was created may be successfully employed for the prediction of the antitumor activity of anthrapyrazole compounds. Its applicative value was confirmed by an external set of seven molecules with the predicted pIC 50 listed in Table S2.
Searching the literature, it is still easier to come across QSAR analysis based on multiple linear regression than multivariate adaptive regression splines. Nevertheless, MARSplines procedure is one of the modern machine learning algorithms with numerous advantages that are emphasized in this work. Its usefulness was confirmed for QSAR studies for predicting the antimalarial activity of dihydroartemisinin derivatives by Nguyen-Cong et al. [8], antitumor activity of acridone derivatives by Koba and Bączek [10], or anti-HIV activities of thiazolylthiourea derivatives by Alamdari et al. [19]. The present study strongly supports the idea of promoting the MARSplines technique in QSAR analysis; however, it should be considered that, given the multitude of possible datasets and descriptors available, various options for MARSplines analysis, as well as other modern machine learning algorithms with their numerous advantages, in a particular case or other regression procedure, may show a better performance. This trend is visible in the study by Kryshchyshyn et al. [7] where four machine learning algorithms, namely, Random forest regression, Stochastic gradient boosting, Multivariate adaptive regression splines, and Gaussian processes regression, were studied to reach better levels of predictivity. Finally, in the case of predicting the antitrypanosomal activity of 4-thiazolidinones, a model developed only with the Random forest and Gaussian processes regression algorithms had good predictive ability. In light of this, there is no universal regression method, but different studies prove that modern machine algorithms are worth exploring.
In sum, the obtained model can be successfully used for in silico studies in order to find new compounds with promising antitumor activity. On the one hand, it can be assumed that the presented approach has some limitations as a restriction to the chemical domain of the training set, especially A-ring hydroxylation at C-7,10, and a necessity to follow whole procedure of geometry optimization and descriptor calculation; however, on the other hand, it should be taken into consideration that there are a multitude of combinations of different possible substituents at N-2 and C-5 in the anthrapyrazole ring (some of them were tested so far and some of them were considered promising). What is more, is that the semi-empirical method AM1 for geometry optimization and the overall process of descriptor generation are fast, which speaks for a routine application of a presented MARSplines approach for QSAR studies. For the abovementioned reasons, the expanded MARSplines procedure may become a part of the process of drug design, largely as it may be useful in the selection of the new anticancer compounds of anthrapyrazoles for the synthesis and in vitro testing on various cancer cell lines.

Anthrapyrazole Derivatives
The conducted analyses were founded on 73 compounds of anthrapyrazole derivatives (Anthra [1,9-cd]pirazol-6(2H)-on), differing in both chemical structure and antitumor activity, as shown in Table 6. The data concerning the antitumor activity of anthrapyrazoles against the L1210 murine leukemia cell line, tested in vivo, and expressed as IC 50 , were obtained from the literature [11]. Table 6. Chemical structures and antitumor activity of the anthrapyrazoles studied.

Geometry Optimization and Structural Descriptors
The initial optimization of the geometric structures of the analyzed particles, with the use of specialized HyperChem Release 8.0 (Hypercube Inc., Gainesville, FL, USA) software, was performed using the built-in Molecular Mechanic Force Field (MM+) procedure, taking into account the adequacy of the principles of quantum mechanics. In the next step, the proper optimization was achieved using the Semi-Empirical Molecular Method AM1, with the utilization of the Polak-Ribiere algorithm. The gradient norm limit applied for the calculations was 0.01 kcal (Å·mol) −1 , and the maximum possible number of cycles was set to 32,000. Finally, HyperChem, as well as Dragon 7 (Talete, Milano, Italy) software, were used to obtain molecular descriptors for all studied structures, using previously optimized molecules. In total, 2554 descriptors were calculated, mainly by using Dragon software. In the next stage of the study, the obtained descriptors were subjected to MARSplines analysis. Descriptors calculated by Dragon include 29 logical molecular descriptor blocks: constitutional indices, ring descriptors, topological indices, walk and path counts, connectivity indices, information indices, 2D matrix-based descriptors, 2D autocorrelations, burden eigenvalues, P_VSA-like descriptors, ETA indices, edge adjacency indices, geometrical descriptors, 3D matrix-based descriptors, 3D autocorrelations, RDF descriptors, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, randic molecular profiles, functional group counts, atom-centered fragments, atom-type E-state indices, CATS 2D, 2D atom pairs, 3D atom pairs, charge descriptors, molecular properties, and drug-like indices [20].

Statistical Analysis
Statistical analysis was carried out using Statistica 13.3 software (StatSoft, Cracow, Poland), introducing the data obtained in the previously performed molecular modeling. The analysis used the following variables: descriptors describing molecular properties of a particle, and the values of the negative decimal logarithm of the IC 50 describing the biological activity against the L1210 murine leukemia cell line tested in vitro, obtained from the literature data. The whole group of compounds was divided into a training and test set on the basis of random sample selection in STATISTICA 13.3 Data Miner (StatSoft, Cracow, Poland). Raw data, consisting of 2554 descriptors (independent variables) and negative decimal logarithm values of the IC 50 (pIC 50 , dependent variable), were subjected to a process of standardization and pre-selection. The selection consisted of removing the variables that did not show variability. The analyses were performed at the 5% significance level (α = 0.05). The multivariate adaptive regression splines procedure was used to build eleven different models. Pearson's correlation coefficient was used in the analysis of the correlation of variables. J. Guilford's classification was used for the interpretation of the results. The analysis of three validation parameters, providing minimal but sufficient information about model performance (R 2 , Q 2 , MAE) [13,21], and is explained in Section 4.4, allowed for the selection of an optimal theoretical model aimed at predicting the pIC 50 value for each of the considered derivatives.

MARSplines Analysis
Multivariate Adaptive Regression Splines (MARSplines), performed with the use of Statistica 13.3, is an adaptive procedure for regression. The specification of MARSplines analysis is shown in Table 7. It is very useful, especially to solve high-dimensional problems, such as a large number of inputs. Moreover, it is used to solve both regression and classification problems, and does not require assumptions about the functional relationship between independent (input) and dependent (output) data. This relationship is modeled with the use of base functions and a set of coefficients generated solely on the basis of data [6,22]. The basis functions in MARS model are single truncated spline functions, or an interaction of a few spline functions, and they consist of a left-sided and right-sided segment (reflected pair) (Equation (4)). The subscript "+" means the positive part, thus: Each function is piecewise linear, with a knot at value t (the so-called linear spline). Those reflected pairs are formed for each input X j , with knots at each observed value x ij of that input. That is why the collection of basis functions is as follows: If all of the input values are distinct, there are 2Np basis functions altogether. Although each basis function depends only on a single X j , it is still considered as a function over the entire input space IR p .
The model-building approach is similar to a forward stepwise linear regression, but instead of using the original inputs, functions from set C and their products are used, so the model is as follows: where each h m (X) is a function in C, or a product of two or more such functions. During each iteration, the best reflected pair is chosen and all possible predictors, as well as corresponding knot locations, are evaluated. As a result of each iteration, the so-called interactions may be introduced if this improves the model. The building process stops when a user-defined maximum number of basis functions is reached; however, it should be emphasized that, during model building, a global model usually overfits the training data. That is why, in the next step, a pruning procedure based on generalized cross-validation (GCV) is applied, which leads to exclusion functions that receive the lowest contribution from the model. The GCV parameter, comprising a penalty for the model complexity, is an adjusted residual sum of squares used to prevent the occurrence of an excessive number of spline functions in the final model.

Model Validation
Elaborated models underwent a process of validation in the terms of the determination coefficient, cross validated determination coefficient, and mean absolute error, in order to select the optimal MARS model suitable for the prediction of the antitumor activity of the anthrapyrazoles studied [13].
The determination coefficient R 2 (Equation (7)) measures the variation of observed data with the predicted data. A perfect correlation is observed when the R 2 reaches the maximum possible value (i.e., 1. Y obs denotes the observed response values for the training set, and Y calc denotes the calculated response values for the training set of compounds. Y training is the mean observed response of the training set compounds [13]).
Cross-validated R 2 (Q 2 ), presented in Equation (8), is checked for internal validation. Y obs (training) is the observed response, and Y pred (training) is the predicted response of the training set molecules based on the leave-one-out (LOO) technique. The generally accepted threshold value of Q 2 is 0.5 [13].
The mean absolute error (MAE) (Equation (9)) is also recognized as the average absolute error (AAE). Generally, it is regarded as a superior index of errors in the context of predictive modeling studies. Due to the involvement of the squared term of the prediction errors in the expression of RMSE, the variance of errors may be influenced by a set of data. That is because squaring the higher prediction error values have more weight than the lower errors in the formalism of the root mean square error (RMSE), whereas MAE provides an equal weight to all errors; thus, MAE is considered to be a simpler and more straightforward determinant of prediction errors [13].

Conclusions
A quantitative structure-activity relationship study was applied to a large set of anthrapyrazole compounds presenting antitumor activity against murine leukemia L1210. The approach of MARSplines was employed for prediction purposes, and was able to describe more than 96% of the variance in the experimental activity. This study has shown that fourteen parameters appearing in the statistically significant and extensively validated MARS model (i.e., descriptors belonging to 3D-MoRSE, 2D autocorrelations, GETAWAY, burden eigenvalues and RDF descriptors) significantly affect the antitumor activity of anthrapyrazole compounds. Moreover, this study confirmed the benefit of using the modern machine learning algorithm, namely, the MARSplines procedure, because the elaborated flexible model was also used in the prediction of antitumor activity against murine leukemia L1210 using an external set of seven anthrapyrazole compounds. Finally, in light of the potential laying in such a large set of anthrapyrazole compounds, which still may be tested on various cell lines, and the high predictive power of the MARS model, the MARSplines procedure may be useful in the selection of the anticancer compounds of anthrapyrazoles for future clinical studies.