A Quantitative Structure-Property Relationship Model Based on Chaos-Enhanced Accelerated Particle Swarm Optimization Algorithm and Back Propagation Artificial Neural Network

A quantitative structure-property relationship (QSPR) model is proposed to explore the relationship between the pKa of various compounds and their structures. Through QSPR studies, the relationship between the structure and properties can be obtained. In this study, a novel chaos-enhanced accelerated particle swarm algorithm (CAPSO) is adopted to screen molecular descriptors and optimize the weights of back propagation artificial neural network (BP ANN). Then, the QSPR model based on CAPSO and BP ANN is proposed and named the CAPSO BP ANN model. The prediction experiment showed that the CAPSO algorithm was a reliable method for screening molecular descriptors. The five molecular descriptors obtained by the CAPSO algorithm could well characterize the molecular structure of each compound in pKa prediction. The experimental results also showed that the CAPSO BP ANN model exhibited good performance in predicting the pKa values of various compounds. The absolute mean relative error, root mean square error, and square correlation coefficient are respectively 0.5364, 0.0632, and 0.9438, indicating the high prediction accuracy. The proposed hybrid intelligent model can be applied in engineering design and the prediction of physical and chemical properties.


Introduction
In quantitative structure-property relationship (QSPR) modeling, some mathematical and artificial intelligence methods are used to explore the chemical and physical properties of various substances.These methods, including mathematical statistics, machine learning methods, and artificial intelligence methods, can reflect the relationship between the activity and structure of compounds.Through QSPR studies, the relationship between the structure and activity of compounds can be mined [1,2].The QSPR model can be used to predict the activity of unknown materials and discover key influencing factors of the activity of related substances, such as groups or substituents determining the activity of the molecular structure [3,4].Nowadays, QSPR has been applied in the fields of computer science, chemistry, materials science, medicine science, and life sciences [5,6].The establishment of the QSPR model mainly involves the following steps: acquisition of experimental data, construction and optimization of the molecular structure, calculation and screening of molecular descriptors, establishment and verification of the model, etc.First of all, the variable selection is important in many fields, such as spectroscopy [7,8], QSPR [9,10], and other fields [11,12].The selection of molecular descriptors largely determines the quality of the QSPR model [13][14][15].The step of molecular descriptor screening aims to reflect more structural information so that there is no noise in the descriptors.Many methods have been developed to screen molecular descriptors and can be mainly divided into two categories [16][17][18].The first category includes the common methods, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), and forward/backward/bi-directional stepwise multiple linear regression (MLR).The second includes the modern search algorithms, such as genetic algorithm (GA), simulated annealing algorithm (SA), ant colony algorithm (AC), particle swarm optimization (PSO), and other swarm intelligence algorithms [7,11,[19][20][21].The common methods are the most simple and efficient, but their overall performances are low in complex nonlinear problems.The modern search algorithms based on the optimization strategy have obvious advantages and can search for optimal variables and deal with complex large data points.The model establishment is important in the QSPR study and commonly used QSPR models include two-dimensional (2d), three-dimensional (3d), and four-dimensional (4d) models [22][23][24].According to the modeling ideas, these methods can be divided into linear and nonlinear QSPR methods.Linear methods mainly include multiple regression methods (MLR), partial least squares (PLS), and principal component regression (PCR) [25].Nonlinear methods include support vector regression (SVR) and artificial neural network (ANN) methods [26][27][28][29][30].
However, the QSPR study based on various artificial intelligence algorithms also has some shortcomings, such as high computational cost [31].Therefore, it is necessary to develop a QSPR model with high accuracy, high efficiency, and good stability.
The pKa value is a key parameter of some compounds, but its determination experiments are cumbersome.Therefore, it is important to develop a pKa prediction model with high accuracy.Polanski et al. [32] proposed a model based on ANN and PLS to predict the pKa of aromatic acids and alkyl acids.Luan et al. [33] developed a model with radial basis function artificial neural network (RBF ANN) and the heuristic method (HM) and obtained the better performance in pKa prediction.These studies showed that ANN has outstanding performance in pKa prediction.However, the performance of ANN is sensitive to its parameters and training algorithm.Many artificial intelligence algorithms, including various evolutionary algorithms, are applied in ANN training.However, the evolutionary algorithm also has its own shortcomings, such as the tendency to fall into the local extreme value and a slow convergence rate in the later stage, which lead to unsatisfactory results of QSPR modeling based on the evolutionary algorithm [34][35][36][37].In this paper, a novel QSPR model is proposed based on BP ANN and the chaos-enhanced accelerated particle swarm algorithm (APSO) reported in recent years [38].An improved APSO is applied in the screening of molecular descriptors and the optimization of the weights of BP ANN.Then, combined with other artificial neural networks, the QSPR model is used to predict the pKa values of various compounds.

Chaos-Enhanced Accelerated Particle Swarm Optimization Algorithm
Particle swarm optimization (PSO) was proposed by Eberhart and Kennedy in 1995 [39], but the performance of the standard PSO algorithm was not high enough and showed some defects, such as parameter sensitivity, premature convergence, and slow local search.In recent years, a variant PSO called accelerated PSO (APSO) has attracted wide attention from scholars [38,[40][41][42].Although the APSO improves the convergence speed, it may also lead to premature convergence and omit some extreme values.Therefore, in this study we propose a new chaos-enhanced accelerated particle swarm optimization algorithm (CAPSO) by integrating chaos theory into the improvement of the APSO algorithm.
In the APSO algorithm, the influence of the inertial weight factor or cognitive factor on the particle is not considered and the algorithm is only improved by the global exploration factor [43].The main idea of the algorithm is to fully attribute the power to the variable that is responsible for global search and to consider the update of the particle with the exploration factor.In the whole search process, the particle is only constrained by the global extreme value.The position update formula is: where C 1 and C 2 are learning factors; r is the random number between 0 and 1; X K+1 i,d is the position of particle i in d-dimensional k-th iteration; and p K g,d is the position of the global extremum of the whole population in the d-th dimension.
Compared with the standard PSO algorithm, APSO adds two parameters, C 1 and C 2 , to reduce the randomness in the iterative process.In this paper, C 1 represents the monotonically decreasing function: C 1 = δ t , where 0 < δ < 1 and t is the current iteration number.Therefore, the performance of the APSO algorithm is mainly affected by parameter C 2 .For common problems, the value is (0.2,0.7).When C 2 is 1, the particle can converge at any time to the current global value and does not change any more.Moreover, the global value may not be the real global value at all.When C 2 is 0, the global search speed of the algorithm is extremely slow.Therefore, the optimization of C 2 is important in the APSO algorithm.
A chaotic system refers to a deterministic system involving random irregular movements, whose behavior is uncertain, unrepeatable, and unpredictable.In a chaotic system, when the initial conditions are slightly changed, the system will be greatly different after continuous amplification.In the process of the APSO algorithm, the value of learning factor C 2 is uncertain and unpredictable and has partial chaotic characteristics.Therefore, in order to simulate the chaotic characteristics of C 2 , the classical logistic equation is used to realize the evolution of chaotic variables and optimize the parameters in this paper.The iterative formula is provided as follows: when 0 < X K i < 1, the logistic equation is in a completely chaotic state.The CAPSO algorithm involves the following steps: Step 1: To initialize the particle group.The particles in the PSO algorithm are initialized.The optimal value of the individual extremum is selected as the global optimal value to generate chaotic values; Step 2: To calculate the adaptive value of group particles; Step 3: The adaptive value of each particle is compared with that of the particle at the best position.If the adaptive value is better, the best position is updated; Step 4: The learning factor C 2 is obtained from the chaotic sequence (generated by Equation ( 2)) and the position of the particle is updated with Equation (1); Step 5: If the end condition of the algorithm is satisfied, the global optimum position is the optimal solution.The result is saved and the algorithm is completed.Otherwise, return to Step 2.

QSPR Model Based on the Hybrid Intelligent Method
The back propagation artificial neural network (BP ANN) is one of the most important network models.It generally consists of an input layer, hidden layer, and output layer [44][45][46].The implementation of BP ANN mainly consists of two processes: a learning process and a working process [47,48].
In a three-layer BP ANN, each layer consists of several nodes.The input layer receives the input information of the network.Then, the input information is processed and sent to the hidden layer.
The relationship between the input and output can be expressed as: Output : where x i , x 2 , . . .x n are the input vectors of the network; w i , w 2 , . . .w n are the connection weights for each input vector; and y is the output of the network.
In the BP ANN model, the nonlinear relationship between input and output is established by determining the weight and deviation between each layer.Structurally, the nonlinear relationship between the input and output can be understood as: output y = f (w ih , w ho , b o ), where w ih , w ho , b o are, respectively, the weight vector between the input layer and the hidden layer, the weight vector between the hidden layer and the output layer, and the deviation vector of the hidden layer.The performance of the network depends on the three main parameters of the network (w ih , w ho , b o ).
To improve the BP algorithm, a prediction model based on CAPSO and BP ANN, called CAPSO BP ANN, is proposed based on the optimization of BP ANN parameters with the CAPSO algorithm.The CAPSO BP ANN model makes full use of the strong global search capability of the PSO algorithm and the fast local search capability of the BP algorithm, thus improving the prediction speed and accuracy of the model.In CAPSO BP ANN, the PSO algorithm is proposed to optimize BP ANN parameters (w ih , w ho , b o ).Therefore, in the PSO optimization algorithm, the particle is designed as the structure with weight vector w ih , weight vector w ho , and deviation vector b o : The implementation of the CAPSO BP ANN model can be simply described as follows: Step 1: To initialize the model.The connection weights, deviations, and population parameters of the model are initialized by the random method; Step 2: Model training.The CAPSO algorithm is used to optimize the parameters of BP ANN and the particle structure is designed.
Step 3: Parameter adjustment.Based on the error of the output, the parameters are adjusted until the number of execution times reaches the set value or the error satisfies the setting condition.
Step 4: Output.After training, the model outputs each parameter and then the trained model is tested.

Model Evaluation
The evaluation of the model is mainly based on the stability and reliability of the model [49].In this paper, the evaluation indices of prediction accuracy including the absolute average relative deviation (AARD) and the root mean square error of prediction (RMSEP) are defined as follows: The squared correlation coefficient (R 2 ) reflects the correlation between predicted values and experimental values and is defined as follows: In these formulas, N is the number of samples; y i is the predicted or calculated value of the model; y i is the actual value obtained in experiments; y ave is the average of actual values of the samples; and y ave is the average of predicted values.

Experimental Data
The comprehensive performance of the model was verified by the prediction experiments of the pKa values of various compounds.The experimental database was obtained from References [50][51][52] and is shown in Table 1.Table S1 lists the compound families used for the QSPR modeling in this paper.The database consists of 268 records of data.The largest organic molecules contain up to 50 non-hydrogen atoms, eight aromatic rings, and 11 heteroatoms.In order to obtain a more reasonable prediction model, the database is randomly divided into three subsets: training set, verification set, and testing set [53].The training set is used to establish the model.The verification set is used to optimize and validate the model.The testing set is used to test the performance of the model and the tested performance can directly reflect the comprehensive performance of the model.11.00-13.80[50,51] In this paper, 70% of the data are used for training.Both the verification set and testing set account for 15%.The numbers of the experimental data in the training set, validation set, and testing set are 188, 40, and 40, respectively.

Screening of Molecular Descriptors
The molecular descriptors are generated by the following methods:

•
Construction of molecular structure.This is performed using Chemdraw UItra 7.0 software.

•
Optimization of molecular structure.The molecular structure is further optimized in Hyper Chem 7.5 software.

•
Calculation of molecular descriptors.The optimized molecular structure is imported into CODESSA software and the corresponding molecular descriptors are obtained by calculation.
Through molecular descriptor computing software, 733 molecular descriptors are generated and some of the molecular descriptors are closely related to each other.When modeling, it is necessary to filter a large number of calculated molecular descriptors in order to select the descriptors which are the most closely related to the research questions.The quality of the QSPR model depends on the way to determine molecular descriptors to a large extent.
In this study, the CAPSO algorithm is used to screen a large number of calculated molecular descriptors.The implementation process of filtering molecular descriptors with CAPSO is described as follows: Step 1. Population initialization.To set the population size and initialize the population individual as a molecular descriptor; to set the number of iterations and the maximum number of iterations.
Step 2. Adaptive evaluation.To calculate the fitness of all the molecular descriptors of a population.
Step 3. Molecular descriptor selection.To select the next generation of molecular descriptors based on individual fitness values.
Step 4. Population renewal.To iterate the molecular descriptors in the population and obtain the next generation of molecular descriptor population.
Step 5. Re-evaluation of individual adaptive values.To calculate the fitness of all of the molecular descriptors of the population through iterative evolution and re-evaluate the merits and demerits of the individuals.
Step 6. Iteration.To judge whether the iteration condition is satisfied.If it is satisfied, the evolution is ended, otherwise turn to Step 3 and continue to perform the iteration.
Finally, five molecular descriptors were selected through CAPSO's search for molecular descriptors (Table 2).Five molecular descriptors belonging to four types were selected by CAPSO: constitutional descriptors, topological descriptors, electrostatic descriptors, and quantum descriptors.
The relative number of N atoms (molecular descriptor 1 (MD1)) is a constitutional descriptor and usually proportional to the density of the electron cloud.When the polarity of the positive and negative charge of a molecule increases, its pKa value decreases.The relative number of N atoms can be used to characterize the composition of the molecular structure.
The Randic index (order 3) (MD2) is a topological descriptor for molecular size, shape, branching degree, and dispersion force.As the molecular dispersion increases, the molecular volume increases, leading to the decrease of pKa value.The Randic index (order 3) can represent the topological structure of molecules.
RNCG relative negative (QMNEG/QTMINUS) (quantum-chemical PC) (MD3) and RNCS relative negative charged SA (SAMNEG * RNCG) (Zefirov's PC) (MD4) are electrostatic descriptors, which depend on the distribution of the charges on the molecule.The negative coefficient of the relative negative charge is inversely proportional to the pKa value and the probability that positive ions replace protons is inversely proportional to the contact area and the pKa value of the negative atomic solvent.The relative negative charge and its surface area can be used to characterize the electrostatic parameters of molecules.
The maximum net atomic charge (No.MD5) is a quantum chemical descriptor, which is proportional to pKa and related to the largest net atom.It can be used to characterize the quantum chemical structure of molecules.
In conclusion, the molecular descriptors selected by the CAPSO algorithm can objectively characterize the molecular structure theoretically and reflect the relationship between the pKa value and the molecular structure.The CAPSO algorithm can provide a reference for the selection of molecular descriptors in all methods of QSPR modeling.

Model Structure
The CAPSO BP ANN model was established with the molecular descriptors selected by CAPSO.The CAPSO BP ANN model adopted the three-layer structure composed of the input layer, the hidden layer, and the output layer.The input layer includes five input parameters representing the selected five molecular descriptors.The input parameters are: relative number of N atoms, Randic index (order 3), RNCG relative negative charged (QMNEG/QTMINUS) (Quantum-Chemical PC), RNCS relative negative charged SA (SAMNEG * RNCG) (Zefirov's PC), and maximum net atomic charge.
The output layer has one output parameter representing the corresponding pKa value.
In this paper, the number of hidden layers is estimated with the formula: (2 × sqrt(m × n) + 1, where m and n are the numbers of the nodes of the input and output layers), and then the number of optimal hidden layer neurons is determined by the heuristic method.The model in this paper contains five input nodes and one output node, so the number of hidden layer neurons is estimated to be 5.Then, we assumed that the number of the neurons of the hidden layer was tested from 3 to 15, respectively.Figure 1 shows the comparison diagram of predicted errors and the number of hidden layer neurons.
atomic solvent.The relative negative charge and its surface area can be used to characterize the electrostatic parameters of molecules.
The maximum net atomic charge (No.MD5) is a quantum chemical descriptor, which is proportional to pKa and related to the largest net atom.It can be used to characterize the quantum chemical structure of molecules.
In conclusion, the molecular descriptors selected by the CAPSO algorithm can objectively characterize the molecular structure theoretically and reflect the relationship between the pKa value and the molecular structure.The CAPSO algorithm can provide a reference for the selection of molecular descriptors in all methods of QSPR modeling.

Model Structure
The CAPSO BP ANN model was established with the molecular descriptors selected by CAPSO.The CAPSO BP ANN model adopted the three-layer structure composed of the input layer, the hidden layer, and the output layer.The input layer includes five input parameters representing the selected five molecular descriptors.The input parameters are: relative number of N atoms, Randic index (order 3), RNCG relative negative charged (QMNEG/QTMINUS) (Quantum-Chemical PC), RNCS relative negative charged SA (SAMNEG * RNCG) (Zefirov's PC), and maximum net atomic charge.The output layer has one output parameter representing the corresponding pKa value.
In this paper, the number of hidden layers is estimated with the formula: (2 × sqrt(m × n) + 1, where m and n are the numbers of the nodes of the input and output layers), and then the number of optimal hidden layer neurons is determined by the heuristic method.The model in this paper contains five input nodes and one output node, so the number of hidden layer neurons is estimated to be 5.Then, we assumed that the number of the neurons of the hidden layer was tested from 3 to 15, respectively.Figure 1 shows the comparison diagram of predicted errors and the number of hidden layer neurons.As shown in Figure 1, with the increase in the neurons of the hidden layer, the mean square error (MSE) decreases first and then increases.When the number is 7, the training MSE is the lowest and the structure of the prediction model is optimal.The model structure is 5-7-1.As shown in Figure 1, with the increase in the neurons of the hidden layer, the mean square error (MSE) decreases first and then increases.When the number is 7, the training MSE is the lowest and the structure of the prediction model is optimal.The model structure is 5-7-1.

Results and Discussion
A three-layer (5-7-1) CAPSO BP ANN prediction model was established to predict the pKa values of the compounds.MSE values were adopted as performance metrics for the model.To ensure the generalization ability, the model was run 10 times.The optimized CAPSO BP ANN parameters used in this paper are summarized in Table 3.

Results and Discussion
A three-layer (5-7-1) CAPSO BP ANN prediction model was established to predict the pKa values of the compounds.MSE values were adopted as performance metrics for the model.To ensure the generalization ability, the model was run 10 times.The optimized CAPSO BP ANN parameters used in this paper are summarized in Table 3.In the training set, the predicted value of the model training is distributed around the actual value, indicating the high coincidence degree.From the vertical distance between the prediction data points and the line, we can see that the prediction error of the model is small and that the prediction accuracy is high.In the validation set, the prediction results are significantly better than those in the training set, indicating that the training effect of the model is good.In the training set, the predicted value of the model training is distributed around the actual value, indicating the high coincidence degree.From the vertical distance between the prediction data points and the line, we can see that the prediction error of the model is small and that the prediction accuracy is high.In the validation set, the prediction results are significantly better than those in the training set, indicating that the training effect of the model is good.
Figure 3 shows the correlation between the actual value and the predicted value of the model in the testing set.In the testing set, the predicted value of the model is also consistent with the actual value.Table 4 shows the results of the model in the training set, validation set, and testing set.
Figure 3 shows the correlation between the actual value and the predicted value of the model in the testing set.In the testing set, the predicted value of the model is also consistent with the actual value.Table 4 shows the results of the model in the training set, validation set, and testing set.The prediction results of the model in each subset are good and the prediction error is small, indicating the better comprehensive performance.The prediction performance of the model is better in terms of prediction accuracy and correlation.The above results prove that the prediction performance of the model is excellent.
In this paper, the partial derivative (PaD) method [13,54] was adopted to assess the sensitivity of the output against slight changes of the five molecular descriptors in the inputs.Figure 4 shows the contributions of the five input variables (five molecular descriptors).
Quantitatively, the Randic index (order 3) (MD2) contributes the most; the relative number of N atoms (MD1) and maximum net atomic charge (MD5) contribute roughly the same proportion (about 20%).The contributions of RNCG relative negative charged (QMNEG/QTMINUS) (Quantum-Chemical PC) (MD3) and RNCS relative negative charged SA (SAMNEG * RNCG) (Zefirov's PC) (MD4) are relatively small, but they all belong to the electrostatic descriptors.Among the four types of descriptors, electrostatic descriptors contribute the most, followed by topological descriptors and constitutional descriptors, and the quantum descriptors contribute the least (Figure 4).The prediction results of the model in each subset are good and the prediction error is small, indicating the better comprehensive performance.The prediction performance of the model is better in terms of prediction accuracy and correlation.The above results prove that the prediction performance of the model is excellent.
In this paper, the partial derivative (PaD) method [13,54] was adopted to assess the sensitivity of the output against slight changes of the five molecular descriptors in the inputs.Figure 4 shows the contributions of the five input variables (five molecular descriptors).
Quantitatively, the Randic index (order 3) (MD2) contributes the most; the relative number of N atoms (MD1) and maximum net atomic charge (MD5) contribute roughly the same proportion (about 20%).The contributions of RNCG relative negative charged (QMNEG/QTMINUS) (Quantum-Chemical PC) (MD3) and RNCS relative negative charged SA (SAMNEG * RNCG) (Zefirov's PC) (MD4) are relatively small, but they all belong to the electrostatic descriptors.Among the four types of descriptors, electrostatic descriptors contribute the most, followed by topological descriptors and constitutional descriptors, and the quantum descriptors contribute the least (Figure 4).
Moreover, three artificial intelligence models, BP ANN, SVM, and PSO BP ANN, were selected as the comparison models.In addition, Jensen et al. [50] used PM6, PM7, PM3, AM1, and DFTB3 methods to predict the pKa values of some amine groups and indicated that PM3/COSMO was the best pKa prediction method.Therefore, in order to verify the performance of the model, the PM3/COSMO model [50] was selected as the comparison model in the study.Figure 5 shows the correlation and residual curve between the experimental values and the predicted values of each model in the testing set.Moreover, three artificial intelligence models, BP ANN, SVM, and PSO BP ANN, were selected as the comparison models.In addition, Jensen et al. [50] used PM6, PM7, PM3, AM1, and DFTB3 methods to predict the pKa values of some amine groups and indicated that PM3/COSMO was the best pKa prediction method.Therefore, in order to verify the performance of the model, the PM3/COSMO model [50] was selected as the comparison model in the study.Figure 5 shows the correlation and residual curve between the experimental values and the predicted values of each model in the testing set.Moreover, three artificial intelligence models, BP ANN, SVM, and PSO BP ANN, were selected as the comparison models.In addition, Jensen et al. [50] used PM6, PM7, PM3, AM1, and DFTB3 methods to predict the pKa values of some amine groups and indicated that PM3/COSMO was the best pKa prediction method.Therefore, in order to verify the performance of the model, the PM3/COSMO model [50] was selected as the comparison model in the study.Figure 5 shows the correlation and residual curve between the experimental values and the predicted of each model in the testing set.As shown in Figure 5a, the vertical distance between the prediction data points and the experimental data indicates that the prediction data of CAPSO BP ANN model are close to the experimental values.The prediction performance of the method proposed in this study is better than that of other methods.It can be seen from the residual curve that the error of the model proposed in this study is close to 0 (Figure 5b).Apart from some prediction points that have large errors, the prediction errors are generally smaller than those of other comparison models.Table 5 shows the evaluation results of each model.In order to verify the performance of each comparison model, the confidence interval (C.I.) of RMSEP in the testing set was calculated [49][50][51][52][53][54][55] (Table 6).To verify the stability and robustness of the models, an applicability domain study was proposed, as shown in Figure 6.The critical leverage is 0.213.The CAPSO BP ANN model has eight outliers (four outliers from the training set, two outliers from validation set, and two outliers from the testing set) and six influential values.The PSO BP ANN model has 11 outliers, including six outliers from training sets, three outliers from the verification set, and one outlier from the testing set.The SVM model has nine outliers and seven influential values, while the BP ANN model has 10 outliers and six influential values.All of the other values are within the applicability domain.Although all models show good performance, the CAPSO BP ANN proved its superiority, and the highest number of its respective observations was found to be within the warning limits of the defined applicability domain.

Conclusions
In this study, in order to solve the problem of molecular descriptor selection and model establishment in QSPR research, a novel chaos-enhanced accelerated particle swarm optimization algorithm (CAPSO) was proposed.The algorithm was applied in the selection of molecular descriptors and QSPR modeling, and a prediction model called CAPSO BP ANN was obtained.Through the prediction experiment of the pKa values of compounds, the conclusions are drawn as follows: The CAPSO algorithm could be applied in the selection of molecular descriptors.Prediction experiments showed that the five molecular descriptors selected by the CAPSO algorithm could well represent the molecular structures of various compounds in the prediction of the pKa value and provide the basis for the selection of molecular descriptors.
The CAPSO BP ANN model based on the PSO algorithm and BP ANN exhibited good performance in the prediction experiment of the pKa values of various compounds and achieved a higher prediction accuracy and correlation.The experimental results showed that the CAPSO BP ANN model could provide the basis for QSPR modeling.

Conclusions
In this study, in order to solve the problem of molecular descriptor selection and model establishment in QSPR research, a novel chaos-enhanced accelerated particle swarm optimization algorithm (CAPSO) was proposed.The algorithm was applied in the selection of molecular descriptors and QSPR modeling, and a prediction model called CAPSO BP ANN was obtained.Through the prediction experiment of the pKa values of compounds, the conclusions are drawn as follows: The CAPSO algorithm could be applied in the selection of molecular descriptors.Prediction experiments showed that the five molecular descriptors selected by the CAPSO algorithm could well represent the molecular structures of various compounds in the prediction of the pKa value and provide the basis for the selection of molecular descriptors.
The CAPSO BP ANN model based on the PSO algorithm and BP ANN exhibited good performance in the prediction experiment of the pKa values of various compounds and achieved a higher prediction accuracy and correlation.The experimental results showed that the CAPSO BP ANN model could provide the basis for QSPR modeling.

Figure 1 .
Figure 1.Optimization comparison diagram of the number of hidden layer neurons.MSE: Mean square error.

Figure 1 .
Figure 1.Optimization comparison diagram of the number of hidden layer neurons.MSE: Mean square error.

First, 188
sets of data from the training set and 40 sets of data from the validation set were respectively used for model training and validation.Figure 2 shows the comparison between the experimental value and the predicted value in the training set and validation set, respectively.The circle and square respectively represent the predicted values of the model in the training set and the validation set.The vertical distances between the predicted data points and lines represent the absolute error of predicted values and experimental values.

First, 188
sets of data from the training set and 40 sets of data from the validation set were respectively used for model training and validation.Figure 2 shows the comparison between the experimental value and the predicted value in the training set and validation set, respectively.The circle and square respectively represent the predicted values of the model in the training set and the validation set.The vertical distances between the predicted data points and lines represent the absolute error of predicted values and experimental values.

Figure 2 .
Figure 2. Comparison between the predicted and experimental values in the training and validation sets.

Figure 2 .
Figure 2. Comparison between the predicted and experimental values in the training and validation sets.

Figure 3 .
Figure 3.Comparison between the predicted and experimental values in the testing set.

Table 4 .
Statistics of the model prediction performance.AARD: Absolute average relative deviation.RMSEP: Root mean square error of prediction.R 2 : Squared correlation coefficient.

Figure 3 .
Figure 3.Comparison between the predicted and experimental values in the testing set.

Figure 4 .
Figure 4. Contributions of the five molecular descriptors.

Figure 5 .
Figure 5.Comparison of the testing results of each model.

Figure 4 .
Figure 4. Contributions of the five molecular descriptors.

Figure 4 .
Figure 4. Contributions of the five molecular descriptors.

Figure 5 .
Figure 5.Comparison of the testing results of each model.Figure 5. Comparison of the testing results of each model.

Figure 5 .
Figure 5.Comparison of the testing results of each model.Figure 5. Comparison of the testing results of each model.

Figure 6 .
Figure 6.Applicability domain of each model.

Figure 6 .
Figure 6.Applicability domain of each model.

Table 1 .
Statistical table of experimental data.

Table 2 .
Molecular descriptors selected by the chaos-enhanced accelerated particle swarm optimization algorithm (CAPSO) algorithm.

Table 4 .
Statistics of the model prediction performance.AARD: Absolute average relative deviation.RMSEP: Root mean square error of prediction.R 2 : Squared correlation coefficient.

Table 5 .
Statistical results of each model.AARD: Absolute average relative deviation.RMSEP: Root mean square error of prediction.R 2 : Squared correlation coefficient.

Table 6 .
Confidence intervals of RMSEP for each model.C.I.: Confidence interval.

Table 5
shows that the accuracy and relevance of the CAPSO BP ANN model have obvious advantages, including the lowest RMSEP and the highest R 2 .The performances of PM3/COSMO and SVM are equivalent to that of the PSO BP ANN model.From Table6, it can be observed that the CAPSO BP ANN model has the narrowest C.I., 90%, 95%, or 99%.From the tables, we can see that the CAPSO BP ANN model with the lowest RMSEP and the narrowest C.I. is superior to other models.