Rational Design of a Low-Data Regime of Pyrrole Antioxidants for Radical Scavenging Activities Using Quantum Chemical Descriptors and QSAR with the GA-MLR and ANN Concepts

A series of pyrrole derivatives and their antioxidant scavenging activities toward the superoxide anion (O2•−), hydroxyl radical (•OH), and 1,1-diphenyl-2-picryl-hydrazyl (DPPH•) served as the training data sets of a quantitative structure–activity relationship (QSAR) study. The steric and electronic descriptors obtained from quantum chemical calculations were related to the three O2•−, •OH, and DPPH• scavenging activities using the genetic algorithm combined with multiple linear regression (GA-MLR) and artificial neural networks (ANNs). The GA-MLR models resulted in good statistical values; the coefficient of determination (R2) of the training set was greater than 0.8, and the root mean square error (RMSE) of the test set was in the range of 0.3 to 0.6. The main molecular descriptors that play an important role in the three types of antioxidant activities are the bond length, HOMO energy, polarizability, and AlogP. In the QSAR-ANN models, a good R2 value above 0.9 was obtained, and the RMSE of the test set falls in a similar range to that of the GA-MLR models. Therefore, both the QSAR GA-MLR and QSAR-ANN models were used to predict the newly designed pyrrole derivatives, which were developed based on their starting reagents in the synthetic process.


Introduction
Free radicals in organisms can be defined as unstable and highly reactive groups or molecules with unpaired electrons that are constantly produced through intracellular metabolism [1]. They are harmful to the human body, not only aggravating the aging process but also causing a multitude of diseases, including Parkinson's disease, Alzheimer's disease, Huntington's disease, depression, cardiovascular disease, cancers, etc. [2][3][4][5]. In addition to their spontaneous production in an organism, free radicals can emerge abruptly due to several exogenous factors, such as exposure to UV light, alcohol addiction, and excessive smoking [6]. Under physiological conditions, these free radicals usually include oxidizing substances, such as hydroxyl and superoxide anion radicals, hydrogen peroxide, singlet oxygen, nitric oxide, and nitroso peroxide. Many of the conditions caused by these radicals can be prevented by effective antioxidant mechanisms to regulate their presence in •− antioxidant activity. Regarding DPPH • scavenging, Cpd.11 has the lowest value, at 13.48%, while the highest activity is 76.04% for Cpd.7. It is worth noting that Cpd.2, 7, 12, 13, and 15 exhibit good scavenging activities for all the ROS types. More details about the pyrrole derivatives see at the methods section. Subsequently, to investigate the relationship between the ROS scavenging activities and the structural properties of the pyrrole derivatives, the QSAR mathematical models were applied, and both linear and non-linear models were developed for all three free radicals scavenging activities.
Molecules 2023, 28, x FOR PEER REVIEW 3 of 15 Figure 1 depicts the three ROS scavenging activities of the 15 pyrrole derivatives, where the percentage scavenging activities toward • OH and O2 •− are relatively close, indicating a similar range of ROS reactivities. The values for • OH scavenging occur in the range of 6.365 to 9.151, and Cpd.1, 2, 3, 12, 13, 14, and 15 show more than 80% • OH scavenging activity. The values for the O2 •-antioxidants are in the range of 6.203 to 8.644, with Cpd.3 and 12 showing high O2 •-antioxidant activity. Regarding DPPH • scavenging, Cpd.11 has the lowest value, at 13.48%, while the highest activity is 76.04% for Cpd.7. It is worth noting that Cpd.2, 7, 12, 13, and 15 exhibit good scavenging activities for all the ROS types. More details about the pyrrole derivatives see at the methods section. Subsequently, to investigate the relationship between the ROS scavenging activities and the structural properties of the pyrrole derivatives, the QSAR mathematical models were applied, and both linear and non-linear models were developed for all three free radicals scavenging activities. Fifteen pyrrole derivatives with their • OH, O2 •− , and DPPH • activities, expressed as (percent scavenging activity)/10, (data collected from reported work [30]), were used as the training set in the QSAR study.

QSAR GA-MLR Models
Before carrying out the QSAR modeling, we first calculated the correlation coefficients of variable pairs, based on the preset definitions of the antioxidant activities as the dependent variables and the molecular descriptors as the independent variables. The values of the pairwise correlation coefficient, r, lie between 1 and −1; the correlation coefficients are all presented in the correlation matrix heatmap ( Figure 2). Some of the descriptors show similar importance concerning Y1 and Y2; the most significant is the C4-C11 bond length, and its coefficients with Y1 and Y2 are −0.73 and −0.71, respectively. This can be explained by noting that the shorter C4-C11 bond length leads to better antioxidant activity in scavenging • OH and O2 •− free radicals. The charges of O8 and O12 are the next two important features, both of which show negative effects on antioxidant activities, with the absolute value of the correlation coefficients exceeding 0.5. Therefore, more negative charges on these two oxygen atoms helps improve the ROS scavenging efficiency of the pyrrole derivatives. •− , and DPPH • activities, expressed as (percent scavenging activity)/10, (data collected from reported work [30]), were used as the training set in the QSAR study.

QSAR GA-MLR Models
Before carrying out the QSAR modeling, we first calculated the correlation coefficients of variable pairs, based on the preset definitions of the antioxidant activities as the dependent variables and the molecular descriptors as the independent variables. The values of the pairwise correlation coefficient, r, lie between 1 and −1; the correlation coefficients are all presented in the correlation matrix heatmap ( Figure 2). Some of the descriptors show similar importance concerning Y1 and Y2; the most significant is the C4-C11 bond length, and its coefficients with Y1 and Y2 are −0.73 and −0.71, respectively. This can be explained by noting that the shorter C4-C11 bond length leads to better antioxidant activity in scavenging • OH and O 2 •− free radicals. The charges of O8 and O12 are the next two important features, both of which show negative effects on antioxidant activities, with the absolute value of the correlation coefficients exceeding 0.5. Therefore, more negative charges on these two oxygen atoms helps improve the ROS scavenging efficiency of the pyrrole derivatives. Examining the correlation coefficients between the descriptors and the DPPH • scavenging activity, Y3, shows that the N1-C13 bond length is highly correlated with the latter, with a value of 0.75, but has little effect on the other two types of free radical scavenging activities. AlogP, which reflects molecular hydrophobicity, is the second important descriptor for improving Y3 performance, with an r value of −0.49. Thus, hydrophilicity and longer distances between the R(d) substituent and pyrrole ring of the antioxidants are preferred in achieving a higher DPPH • quenching ability. Table 1 presents the QSAR model results using the GA-MLR method.
QSAR GA-MLR of • OH model. Equation (1) (see Table 1) is the QSAR model obtained from the GA-MLR method for • OH scavenging activity. The model yields good statistical values, with R 2 = 0.848 and R 2 (CV) = 0.711. The regression is significant since F > Fcr. Equation (1), Y1 ( • OH scavenging activity) = −90.879 * X17 (bond C2-R(b)) − 47.988 * X19 (bond C4-C11) + 0.016 * X20 (polarizability) + 207.384, implies that the steric structural properties (C2-R(b) and C4-C11 bonds) play an important role in the • OH scavenging activity, while the electronic polarizability is a minor property. Further, the negative coefficient values of X17 and X19 suggest that the shorter bond distances of C2-R(b) and C4-C11 are favorable in increasing the • OH scavenging activity. Examining the correlation coefficients between the descriptors and the DPPH • scavenging activity, Y3, shows that the N1-C13 bond length is highly correlated with the latter, with a value of 0.75, but has little effect on the other two types of free radical scavenging activities. AlogP, which reflects molecular hydrophobicity, is the second important descriptor for improving Y3 performance, with an r value of −0.49. Thus, hydrophilicity and longer distances between the R(d) substituent and pyrrole ring of the antioxidants are preferred in achieving a higher DPPH • quenching ability. Table 1 presents the QSAR model results using the GA-MLR method.
QSAR GA-MLR of • OH model. Equation (1) (see Table 1) is the QSAR model obtained from the GA-MLR method for • OH scavenging activity. The model yields good statistical values, with R 2 = 0.848 and R 2 (CV) = 0.711. The regression is significant since F > F cr . Equation (1), Y1 ( • OH scavenging activity) = −90.879 * X17 (bond C2-R(b)) − 47.988 * X19 (bond C4-C11) + 0.016 * X20 (polarizability) + 207.384, implies that the steric structural properties (C2-R(b) and C4-C11 bonds) play an important role in the • OH scavenging activity, while the electronic polarizability is a minor property. Further, the negative coefficient values of X17 and X19 suggest that the shorter bond distances of C2-R(b) and C4-C11 are favorable in increasing the • OH scavenging activity. •− scavenging activity. In Equation (2), X19 has a negative coefficient, implying that the shorter bond distance of C4-C11 (see the position in Figure 5) is favorable in increasing the O 2 •− scavenging activity, similar to the case in Equation (1) for the • OH scavenging activity. Regarding the HOMO energy, the pyrrole derivatives with higher values are preferable in increasing the O 2 •− scavenging activity. Further, Y1 ( • OH scavenging activity) and Y2 (O 2 •− scavenging activity) have a high Pearson correlation coefficient of 0.84; from the molecular properties related to these ROS, X19 and X20 are common to both models.
QSAR GA-MLR of DPPH • model. The QSAR model for DPPH • scavenging activity developed using the GA-MLR method is given in Equation (3) (see Table 1). The statistical results include R 2 and R 2 (CV) values of 0.810 and 0.559, respectively, which are slightly lower than those of the • OH and O 2 •− scavenging activity models. The DPPH • scavenging activity (Y3) is related to the N1-C13 bond, AlogP, and the Connolly surface area, as shown in Equation (3): Y3 = 61.220 * X16 (N1-C13 bond) − 1.240 * X26 (AlogP) + 0.052 * X30 (Connolly surface area) − 102.072. The bond distance of N1-C13 (see the position in Figure 1) has the highest (positive) coefficient value, suggesting that a longer bond distance would support the DPPH • scavenging activity, which corresponds well with experimental discussions [30]. The second notable feature is that molecules with lower AlogP values are preferred, that is, antioxidants need to be more hydrophilic to have a higher quenching ability toward radicals. In addition, the compounds with higher values of Connolly surface area are beneficial to improve the DPPH • scavenging activity; however, this property only plays a minor role in the DPPH • scavenging activity.
The molecular descriptors used in Equations (1)-(3) of the QSAR GA-MLR models consist of steric and electronic properties ( Table 2). The important descriptors from the QSAR models correspond well with the Pearson correlation coefficients of the • OH, O 2 •− , and DPPH • scavenging activities. Figure 3a-c depict the linear relationship between the experimental and predicted • OH, O 2 •− , and DPPH • scavenging activities, and the predicted results are listed in Table S1. Based on the residuals between the predicted and actual activities, the RMSE of the training set (RMSE train) of the • OH, O 2 •− , and DPPH • scavenging activities were calculated as 0.368, 0.269, and 0.958, respectively, as shown in Figure 3a-c. To evaluate the feasibility of the QSAR models, the test set (Cpd.6,

8, and 11) predicted the compounds' • OH, O 2
•− , and DPPH • scavenging activities using Equations (1)-(3), respectively. The RMSE values of the test set (RMSE test) of the • OH and DPPH • scavenging activities (Figure 3a,c) are lower than the RMSE train values, while the RMSE test for O 2 •− scavenging is slightly higher than the RMSE train ( Figure 3b). Therefore, in summary, the QSAR GA-MLR technique helps manipulate mathematical models for limited data sets. Furthermore, the obtained models could be used for further predictions of newly designed pyrrole compounds to demonstrate their good predictive power in external evaluations.  Figure 3b). Therefore, in summary, the QSAR GA-MLR technique helps manipulate mathematical models for limited data sets. Furthermore, the obtained models could be used for further predictions of newly designed pyrrole compounds to demonstrate their good predictive power in external evaluations.

QSAR ANN Models
The ANN models originate from Artificial Intelligence, which is an interconnected assembly of simple processing elements, known as artificial neurons, that mimic human neuron functions. Consequently, the input of each neuron is one or more weighted variables, and the output is a linear or nonlinear function of the weighted inputs. Alternatively, the neurons learn by adjusting the weights of the input variables by minimizing the error between the neuron's expected output and the measured output value. Therefore, we applied the QSAR-ANN technique for the • OH, O 2 •− , and DPPH • scavenging activities and the selected molecular descriptors ( Table 2). The ANN architecture was set as 3-3-1, with one input layer (three neurons), one hidden layer (three neurons), and one output layer, according to the descriptors found in Equations (1)-(3) for the • OH, O 2 •− , and DPPH • scavenging activities, respectively. The optimal ANN models of the three ROS are given in Table 3. The statistical R 2 values of the ANN models are in the range of 0.920-0.965, which is higher than that of the GA-MLR models (R 2 values in the range of 0.810-0.863). The linear plots of the experimental versus predicted • OH, O 2 •− , and DPPH • scavenging activities from the ANN models are displayed in Figure 3d-f, respectively. The RMSE train values of all ANN models fall in the range of 0.175-0.427, which is much lower than that of the GA-MLR models. However, the RMSE test values of the ANN models are comparable to the results from the GA-MLR ones. In summary, as shown in Table 3, the ANN learns to predict the antioxidant activity with higher accuracy, approximating the experimental data with small differences (see the predicted data in Table S1). Therefore, the three ANN models can be used to predict the antioxidant activity of the newly designed pyrrole compounds.

Newly Designed Compounds with Predicted ROS Activities
Scheme 1 provides insight into the newly designed pyrrole antioxidants. Firstly, 4-hydroxycoumarin (1a) and 2-hydroxy-1,4-naphthoquinone (1b) were selected for substitution at the R(a) position since they appeared in Cpd.2, 12, and 15, which gave superior ROS activities. Secondly, at the R(b) substituent position, a thiophene ring was chosen as it resulted in shortening the C2-R(b) bond distance, as observed in the X17 property of Cpd.3, 8, and 15 (Table S2). Next, in the R(c) position, the substitution of cyclohexanone or benzoyl groups has a prevailing impact on the scavenging activity, as seen in Cpd.1, 2, 12, and 14.
In addition, this position is related to X19 (bond C4-C11); the shorter the bond distance, the more preferable the antioxidant activity. Lastly, regarding the R(d) position, related to the X16 property (bond N1-C13), a longer N1-C13 bond distance would support improved antioxidant activity. Thus, three functional groups-4-methoxyphenyl, n-butylamine, and phenylethylamine-were selected for substitution at the R(d) position. In summary, there are 17 new compounds, Cpd. 16 to Cpd.32 (Scheme 1). All new complex structures were built and optimized by employing the same computational level criteria as those for the training set compounds (see at methods section). The structural and electronic properties were then collected and are presented in Table S3 for the predicted • OH, O 2 •− , and DPPH • scavenging activities of all newly designed molecules. The • OH, O2 •− , and DPPH • scavenging activities of newly designed pyrrole compounds were then predicted by using both QSAR GA-MLR and QSAR-ANN (see in Table  S4). The plots of number of compounds with their predicted • OH, O2 •− , and DPPH • scavenging activities are depicted in Figure 4. The • OH and O2 •− prediction results obtained from the two models shares a similar trend, while the predicted DPPH • scavenging activities from GA-MLR and ANN showed some partial differences. It is worth noting that there are seventeen, and nine new pyrrole compounds were predicted to achieve more than 80% of • OH and O2 •− scavenging activities, respectively (Table S4). For the predicted DPPH • scavenging activities above 70%, there are found on Cpd.26, 27, 31, and 32 ( Figure  6). In addition, these four new compounds have also resulted in great • OH and O2 •− scavenging activities.
In summary, our newly designed pyrrole compounds (Cpd.16-Cpd.32) based on the QSAR molecular descriptors showed the higher tendency of • OH, O2 •− , and DPPH • scavenging activities in comparison with training data of pyrrole derivatives (see in Figure S1). •− , and DPPH • scavenging activities of newly designed pyrrole compounds were then predicted by using both QSAR GA-MLR and QSAR-ANN (see in Table S4). The plots of number of compounds with their predicted • OH, O 2 •− , and DPPH • scavenging activities are depicted in Figure 4. The • OH and O 2 •− prediction results obtained from the two models shares a similar trend, while the predicted DPPH • scavenging activities from GA-MLR and ANN showed some partial differences. It is worth noting that there are seventeen, and nine new pyrrole compounds were predicted to achieve more than 80% of • OH and O 2 •− scavenging activities, respectively (Table S4). For the predicted DPPH • scavenging activities above 70%, there are found on Cpd.26, 27, 31, and 32 ( Figure 6). In addition, these four new compounds have also resulted in great • OH and O 2 •− scavenging activities. In summary, our newly designed pyrrole compounds (Cpd.16-Cpd.32) based on the QSAR molecular descriptors showed the higher tendency of • OH, O 2 •− , and DPPH • scavenging activities in comparison with training data of pyrrole derivatives (see in Figure S1).

Methods
Experimental activities data. In the current QSAR study, we employed 15 pyrrole derivatives and obtained their experimental radical scavenging activities ( Figure S2 and Table S5) from Tania et al. [30]. Figure 5 depicts the template of a pyrrole ring with four substitution positions. Three types of radical scavenging activities (against • OH, O2 •− , and DPPH • ) were measured for all the pyrrole derivatives ( Figure 6) and used as the data sets in this work. Therefore, we represented the • OH, O2 •− , and DPPH • radical scavenging activities with three dependent variables: Y1, Y2, and Y3, respectively.

Methods
Experimental activities data. In the current QSAR study, we employed 15 pyrrole derivatives and obtained their experimental radical scavenging activities ( Figure S2 and Table S5) from Tania et al. [30]. Figure 5 depicts the template of a pyrrole ring with four substitution positions. Three types of radical scavenging activities (against • OH, O 2 •− , and DPPH • ) were measured for all the pyrrole derivatives ( Figure 6) and used as the data sets in this work. Therefore, we represented the • OH, O 2 •− , and DPPH • radical scavenging activities with three dependent variables: Y1, Y2, and Y3, respectively.

Methods
Experimental activities data. In the current Q derivatives and obtained their experimental radic Table S5) from Tania et al. [30]. Figure 5 depicts th substitution positions. Three types of radical scave DPPH • ) were measured for all the pyrrole derivati in this work. Therefore, we represented the • OH, O tivities with three dependent variables: Y1, Y2, and  Molecular Features. To obtain the electronic and steric molecular descriptors, all the pyrrole derivatives were built and optimized using the Hartree-Fock (HF) functional and the 6-31G(d,p) basis set, which includes the polarization functions of all atoms in the structure. The optimizations were performed with the Gaussian 16 program [31]; the optimized structures were analyzed, and 23 of their molecular properties, denoted as X1-X23, were recorded (Table S2). Furthermore, additional molecular properties, denoted as X24-X33 (Table S2), were obtained using the Materials Studio software [32], leading to a total of 33 independent variables in this work.
Molecules 2023, 28, x FOR PEER REVIEW 10 of 15 recorded (Table S2). Furthermore, additional molecular properties, denoted as X24-X33 (Table S2), were obtained using the Materials Studio software [32], leading to a total of 33 independent variables in this work. Data sets. The data of pyrrole derivatives were divided into a training set (80%) and a test set (20%) according to the Kennard-Stone algorithm [33] using the Python package Kennard-stone 1.1.2. Based on this algorithm, Cpd.6, 8, and 11 were selected as the testset compounds, that is, for use as the external test set to evaluate the generalization performance of the regression models.
GA-MLR method. Inspired by natural genetics and evolution, genetic function approximation (GFA) is an approach that emphasizes achieving model-building optimization. The GFA method has been used in the development of QSAR models and has demonstrated the ability to elucidate the relationship between the desired molecular activity and chemical identity [34,35]. It automatically selects variables and effectively discovers combinations of features that take advantage of correlations between multiple features. The maximum number of variables is established by fixing the preferred model length. Additionally, the GFA algorithm can work flexibly with or without spline curves [36], which increases the complexity of the model, though at the expense of reducing its interpretation ability. The expression of the output equation without splines is the same as that of the MLR model. Therefore, the MLR model based on the GFA algorithm (GA-MLR) can provide an "understanding" of important molecular characteristics for the activity of compounds. One notable feature of the GFA is that it can generate a set of models, rather than a single model, at once. The workflow of the GA can be summarized as a basic function of genetic selection. After crossover and mutation operations, new generations will be generated. Each new model is then scored according to a specific fitness criterion. Data sets. The data of pyrrole derivatives were divided into a training set (80%) and a test set (20%) according to the Kennard-Stone algorithm [33] using the Python package Kennard-stone 1.1.2. Based on this algorithm, Cpd.6, 8, and 11 were selected as the test-set compounds, that is, for use as the external test set to evaluate the generalization performance of the regression models.
GA-MLR method. Inspired by natural genetics and evolution, genetic function approximation (GFA) is an approach that emphasizes achieving model-building optimization. The GFA method has been used in the development of QSAR models and has demonstrated the ability to elucidate the relationship between the desired molecular activity and chemical identity [34,35]. It automatically selects variables and effectively discovers combinations of features that take advantage of correlations between multiple features. The maximum number of variables is established by fixing the preferred model length. Additionally, the GFA algorithm can work flexibly with or without spline curves [36], which increases the complexity of the model, though at the expense of reducing its interpretation ability. The expression of the output equation without splines is the same as that of the MLR model. Therefore, the MLR model based on the GFA algorithm (GA-MLR) can provide an "understanding" of important molecular characteristics for the activity of compounds. One notable feature of the GFA is that it can generate a set of models, rather than a single model, at once. The workflow of the GA can be summarized as a basic function of genetic selection. After crossover and mutation operations, new generations will be generated. Each new model is then scored according to a specific fitness criterion.
The regression analysis was developed using the GFA module in Materials Studio. Initially, the training data was fully imported, with the maximum equation length set at 3 and the population and maximum generations set to 1000 and 500, respectively. The mutation probability was 0.1. The fitness of a GFA model was measured using the Rsquared (R 2 ) value, which reflects the fraction of the total variance of the dependent variable, y; the larger the R 2 value, the better the model. ANN method. ANN is a nonlinear-function mapping technique originally developed to simulate the structure and computations of the brain. In the field of cheminformatics, it has been widely used to study the complex nonlinear relationship between the biological activity of molecules and their structural characteristics [37][38][39][40]. In this study, one of the most popular neural networks, the multilayer perceptron (MLP) ANN, which served as the function approximation method, was used to model the antioxidant activities and structural properties data [41]. The MLP network designed herein is based on the principle of the backpropagation algorithm and was optimized using the Levenberg-Marquardt technique to reduce the error [42]. Generally, the MLP network includes three types of neural layers: an input layer, one or more hidden layers, and an output layer.
The running script was generated using the neural network fitting tool in the MATLAB program [43], and a multilayer ANN structure composed of three input neurons, an implicit layer (three neurons), and an output layer (one neuron) was constructed. In our ANN regression task, the Bayesian regularization backpropagation algorithm was used to optimize and update the weights and biases, which are the network optimization functions according to the Levenberg-Marquardt algorithm. The optimal combination was determined by minimizing the combination of the square error and the weight to generate a network model with a good generalization ability. This process is also known as Bayesian regularization.
Evaluation of statistical terms. This section discusses the equations used to evaluate the prediction reliability of the QSAR models. The coefficient of determination (R 2 ) measures how well a statistical model predicts an outcome. It is the proportion of variance in the dependent variable that is explained through the model; the closer the value is to 1.0, the better the genetic function approximation equation explains the dependent variable. The expression for R 2 is given using Equation (1): where ESS is the sum of squares of errors (or the explained sum of squares), and TSS is the total sum of squares of y. The variation in y not explained through the regression equation (or the residual sum of squares, RSS) is the sum of the squares of the differences between the predicted values (ý i ) and the actual (y i ) as given in Equation (2): The total variation in y (or the total sum of squares, TSS) is the sum of the squares of the differences between the observed y values (y i ) and their mean (y). It can also be described as the mean-corrected sum of squares of the responses over the entire data set. The TSS is given as in as in Equation (3): The TSS is also expressed as in Equation (4): The variation in y explained through the regression equation (ESS) is the sum of the squares of the differences between the predicted y values (ý i ) and the mean (y), as given in Equation (5): The F test is a standard statistical test to assess the equality of the variances of two populations with normal distributions. Here, it was used to test whether the variance in the data that is explained through the regression is significantly larger than the remaining variance due to errors. If this is the case, the model is then stated to be significant rather than one that simply fits the noise. The significance-of-regression (SOR) F value is defined in Equation (6): where n is the number of data points from which the model is built, and p is the number of parameters in a regression model (including the intercept, when present). The calculated F value was compared with the tabulated values of the F distribution for different values of n and p. The critical SOR F value is the critical point of the F distribution of degrees n − p and p − 1 evaluated for a probability of 0.05 (at a 95% confidence level). The regression is significant if F is greater than the tabulated value F cr , or SOR F value (95%).
The cross-validation R 2 , or R 2 (CV), is the cross-validated equivalent of R 2 , which constitutes a crucial measure of a model's predictive power; the closer the value is to 1.0, the better the predictive power. For a good model, R 2 (CV) should be reasonably close to R 2 . R 2 (CV) is expressed as in Equation (7): The cross-validation involved excluding the required set of data, performing the principal component analysis (PCA) on the remaining data, and calculating the PRESS of the prediction error based on the model generated using the retained data, which was excluded from model development. This process was repeated until each observation was ignored. The PRESS is calculated as in Equation (8): The root mean square error (RMSE) is used to determine whether a model has the predictive ability, reflected using R 2 , to ensure its rationality from a statistical perspective. The RMSE is the square root of the sum of the squared differences between the actual and predicted values divided by the number of observations, N, as given in Equation (9). It measures deviations from true values and is sensitive to divergent data.

Conclusions
The QSAR concept was applied to understand the influence of substitutions on pyrrole derivatives and their • OH, O 2 •− , and DPPH • scavenging activities. Both the GA-MLR and ANN techniques were applied to relate the quantitative relationships between the three types of antioxidant activities of the pyrrole derivatives and their molecular descriptors, which were determined from quantum chemical calculations. In the QSAR GA-MLR models, the statistical coefficient of determination, R 2 , was greater than 0.8, while the QSAR-ANN models yielded superior R 2 values (greater than 0.9), both of which showed high predictive ability. The RMSE of the test set was introduced to evaluate the prediction reliability of all QSAR models; the RMSE values were in the range of 0.3-0.6, which implies that the models can be used for further predictions. However, the RMSE of the ANN models test set did not outperform substantially in comparison with the GA-MLR model. Thus, in this case, the resulted GA-MLR model have equivalent prediction reliability with the ANN model. The obtained QSAR GA-MLR models, both steric (bond lengths and Connolly surface area) and electrostatic (HOMO energy and polarizability) properties, played an important role in the three types of antioxidant activities equations. Finally, based on the QSAR GA-MLR and QSAR-ANN models, most of the predictions for the • OH, O 2 •− , and DPPH • scavenging activities of the newly designed pyrrole compounds were more effective than those of the training set pyrrole derivatives. Based on our findings, the newly designed compounds Cpd.26, 27, 31, and 32 were predicted via both the GA-MLR and ANN models to be potent and effective antioxidants against • OH, O 2 •− , and DPPH • , which would be useful in further experimental syntheses and tests.