Predictive Models of Gas/Particulate Partition Coefficients (KP) for Polycyclic Aromatic Hydrocarbons and Their Oxygen/Nitrogen Derivatives

Polycyclic aromatic hydrocarbons (PAHs) and their oxygen/nitrogen derivatives released into the atmosphere can alternate between a gas phase and a particulate phase, further affecting their environmental behavior and fate. The gas/particulate partition coefficient (KP) is generally used to characterize such partitioning equilibrium. In this study, the correlation between log KP of fifty PAH derivatives and their n-octanol/air partition coefficient (log KOA) was first analyzed, yielding a strong linear correlation (R2 = 0.801). Then, Gaussian 09 software was used to calculate quantum chemical descriptors of all chemicals at M062X/6-311+G (d,p) level. Both stepwise multiple linear regression (MLR) and support vector machine (SVM) methods were used to develop the quantitative structure-property relationship (QSPR) prediction models of log KP. They yield better statistical performance (R2 > 0.847, RMSE < 0.584) than the log KOA model. Simulation external validation and cross validation were further used to characterize the fitting performance, predictive ability, and robustness of the models. The mechanism analysis shows intermolecular dispersion interaction and hydrogen bonding as the main factors to dominate the distribution of PAH derivatives between the gas phase and particulate phase. The developed models can be used to predict log KP values of other PAH derivatives in the application domain, providing basic data for their ecological risk assessment.


Introduction
Polycyclic aromatic hydrocarbons (PAHs) are typical persistent organic pollutants (POPs) that are widely found in the environment [1]. Exposure to PAHs may lead to atherosclerosis, hypertension and myocardial infarction, and increase the risk of skin, lung, pancreas, stomach, intestinal and other cancers [2][3][4][5][6]. PAHs can further undergo photochemical reactions or be oxidized by atmospheric oxidants, such as O 3 , OH radicals and NO x , to generate oxygen/nitrogen derivatives, including oxidized PAHs (O-PAHs), nitro PAHs (N-PAHs) and azaarenes (AZAs) [7][8][9]. In addition, the incomplete combustion of fuels during human activities, vehicle and ship exhaust emissions, and industrial waste emissions also lead to the generation of PAHs and their oxygen/nitrogen derivatives [10][11][12][13]. In recent years, PAHs and their oxygen/nitrogen derivatives have been detected not only in the atmosphere, but also in soil, water, sediment and other environmental media and organisms [14][15][16][17][18]. Minero et al. [19] detected N-PAHs in atmospheric particles of Antarctica in the year 2010, indicating the global presence of PAH derivatives. PAH derivatives are generally trace compounds with concentrations of about one-tenth or even one-hundredth of the parent level in environmental media [20,21]. However, most PAH derivatives are direct mutagens or potential carcinogens [22], and some N-PAHs are even 10 times more car-cinogenic or 100,000 times more mutagenic than their parent compounds [23][24][25], bringing high risk to human health and arousing widespread concern.
When PAHs and oxygen/nitrogen derivatives are released into the atmosphere from various sources, they can partly exist in gaseous form or partly combine with atmospheric particles to migrate over long distances, and finally move to the ground surface through atmospheric deposition [26]. Studying the distribution of these compounds between the atmosphere and particulate matter has great implications for understanding their environmental behavior and fate. The gas/particulate partition coefficient, K P , is often used to characterize such distribution equilibrium of organic pollutants and calculated by [27].
Here, C P represents the concentration of organic matter in the atmospheric particle phase and C A represents the concentration in the air phase, with the unit of ng/m 3 ; TSP refers to the concentration of total suspended particulate matter, in µg/m 3 . Determining the K P values is time-consuming, laborious, and limited by standard samples of target compounds. Therefore, establishing a predictive model for K P can provide an important method to study the gas/particulate distribution behavior of pollutants and supply basic data for their ecological environment safety and health risk assessment.
Previous studies have shown that the n-octanol/air partition coefficient (K OA ) can predict the K P values of organic pollutants such as PAHs, polychlorinated biphenyls (PCBs), polychlorinated naphthalenes (PCNs), and DDT [28,29]. However, the low prediction accuracy and the lack of K OA values for some compounds restricts the application of this method. A quantitative structure-property relationship (QSPR) model can be used to establish a quantitative relationship between compound properties, environmental behavior parameters and molecular structure feature through mathematical methods, and can further predict the properties and environmental behavior of similar compounds which lack experimental data [30][31][32][33][34].
Therefore, the goals of this study are first to analyze the correlation between log K P and log K OA of PAHs and oxygen/nitrogen derivatives, and then establish a QSPR prediction model for log K P by multiple linear regression (MLR) and support vector machine (SVM) methods. The model performance will be validated and evaluated, and the relevant mechanism and application domain will be discussed to further understand the partitioning process and predict more chemicals.

Log K P Experimental Values
In this study, the experimental C A , C P and TSP values of 50 PAHs and oxygen/nitrogen derivatives were obtained from the previous study [35], including 22 parent and alkyl PAHs, 15 O-PAHs, 9 N-PAHs and 4 AZAs. Then, the log K P value for every chemical is calculated by Equation (1). Information about all compounds as well as log K P data are listed in Table 1. a α (a.u.) represents the average molecular polarizability; V s.min (eV) represents the most negative electrostatic potential on the molecular surface; b as the validation set in simulated external validation.

Descriptors
The intermolecular interactions, such as van der Waals forces (e.g., dispersion, dipoledipole, dipole-induced dipole Interactions) and specific polarization (e.g., hydrogen bonding), are important factors to determine the distribution of organic chemicals between gas and particulate phases [36,37]. In this study, the molecular volume (V, cm 3 /mol), the dipole moment (d, Debye), the square of dipole moment (d 2 , Debye) and the average molecular polarization (α, a.u.) are selected to characterize dispersion, dipole-dipole and dipole-induced dipole interactions. The frontier molecular orbitals (E LUMO and E HOMO , eV), hardness (η, eV), softness (σ, eV), chemical potential (µ, eV) and electrophilic index (ω, eV) are used to quantify the ability of molecules to receive or provide electrons. Three charge descriptors, the most positive electrostatic charge of hydrogen atom (qH + , a.c.u.), the most positive electrostatic charge of carbon atom (qC + , a.c.u.) and the most negative electrostatic charge of carbon atom (qC − , a.c.u.), are employed to characterize the charge information of the compounds. All 13 descriptors were obtained from the output files of molecular configuration optimization, which was carried out by using Gaussian 09 software [38] at M062X/6-311 + G (d, P) level. In addition, the parameters that characterize the electrostatic potential on the molecular surface are selected, including the most positive/negative electrostatic potential on the molecular surface (V s.max /V s.min , eV), the average value of the positive/negative electrostatic potential on the molecular surface (V face (τ). These parameters were further calculated by GsGrid (Verison 1.7) software [39] based on the Gaussian output files. Moreover, the log K OA values of the compounds were calculated using EPI Suite TM v4.11 software [40].

Model Construction and Verification
The stepwise regression method in IBM SPSS 21.0 software [41] was used to screen variables and build the MLR model. The regression performance, predictive capability and robustness of the model was assessed according to the OECD guidelines [42], the square of the correlation coefficient (R 2 ), the square of the prediction correlation coefficient (Q 2 ), the root-mean-square error (RMSE), the mean absolute error (MAE), the maximum positive error (MPE), the maximum negative error (MNE), and the systematic error (BIAS) were calculated to evaluate the fitting ability of the model. Then the original data set was randomly divided into a training set (70%) and a test set (30%) for simulated external authentication to evaluate the predictive ability. The leave-one-out cross-validation was further implemented by Weka 3.8.0 software [43], and the mean cross-validation correlation coefficient (Q 2 CV ) and the mean root-mean-square error (RMSE CV ) were obtained to evaluate the robustness of the QSPR model.
In order to further improve the model's performance, an SVM model was constructed using R language by the descriptors employed in the MLR model. The kernel function of the SVM method is radial basis function. The complexity and prediction error of the model were determined by searching for the optimal combination of hyperparameters (γ and C), and the optimal model is obtained based on it. In this process, the range of the combination of γ and C was set as 10 −2~1 0 4 , and the grid search method was used to find the optimal combination. The ten-fold cross validation method was used to evaluate the performance of the SVM model. At the same time, the counter map of log γ and log C was drawn to visualize the combination of the hyperparameters.

Define the Application Domain
The model application domain was defined using a Williams diagram [33]. If the absolute value of the standardized residual (StdR) of a compound is smaller than 3, it is considered to be well predicted. If the leverage value h i of a compound is larger than the threshold h * (h * = 3 × p/n, p represents the number of molecular structure descriptors, n represents the number of modeling data), this compound may have extreme descriptors that can influence the model construction, so it is identified as a high-influence compound. It should be noted that if the absolute values of StdR of the high-influence compounds are less than 3, this indicates the model has great generalization capability.

Model Establishment and Verification
The linear correlation between log K P and log K OA was obtained: As shown in Figure 1, log K P positively correlates with log K OA (p < 0.05). This indicates log K OA can be used to roughly predict log K P values. The predictive values are listed in Table 1 and the 95% confidence intervals are provided in Supplementary Materials,  Table S1. However, the moderate correlation coefficient (R 2 = 0.801) may lead to inaccurate predictive results. QSPR models are further developed. The linear correlation between log Kp and log KOA was obtained: log KP = (0.643 ± 0.046) × log KOA + (−8.287 ± 0.410) As shown in Figure 1, log KP positively correlates with log KOA (p < 0.05). This indicates log KOA can be used to roughly predict log KP values. The predictive values are listed in Table 1 and the 95% confidence intervals are provided in Supplementary Materials,  Table S1. However, the moderate correlation coefficient (R 2 = 0.801) may lead to inaccurate predictive results. QSPR models are further developed.
This model has two molecular descriptors α and Vs.min, both of which have VIF values less than 10 (see Table 2), and there is no multicollinearity in the model (p < 0.05). The predictive log KP values as well as the calculated values of the employed descriptors are shown in Table 1, and the 95% confidence interval of the predictive log KP values can be found in Table S1. (2) MLR model The QSPR model was established by stepwise MLR method based on quantum chemical descriptors: This model has two molecular descriptors α and V s.min , both of which have VIF values less than 10 (see Table 2), and there is no multicollinearity in the model (p < 0.05). The predictive log K P values as well as the calculated values of the employed descriptors are shown in Table 1, and the 95% confidence interval of the predictive log K P values can be found in Table S1. The statistical performance of MLR model based on quantum chemical descriptors has been significantly improved: R 2 = 0.847, Q 2 = 0.847, and RMSE = 0.584 (Table 3), indicating the model has a good fitting performance. It can be seen from Table 3  (log K P vs. residuals) = 0.029, smaller than 0.5; (2) after removing the two highest residual values (5%), the MAE (0.370) is smaller than 0.1 × log K P range of training set (5.21) and MAE + 3σ (standard deviation of the absolute error, 0.234) is very close to 0.2 × log K P range of training set. These results show that the developed MLR model has no systematic error and good predictive ability. In the leave-one-out cross-validation, the average Q 2 CV and RMSE CV is 0.906 and 0.625, respectively, which further proves the robustness of the developed MLR model [46]. The fitting plot of the experimental log K P values and the predicted log K P values by the MLR model ( Figure 2) shows they have great agreement. Figure 3 shows that the predictive errors of log K P are randomly distributed, and they have no dependence on the experimental value. This conclusion can also be verified by the BIAS = 0.000 of the MLR model (Table 3).      In order to check whether the machine learning method could improve the statistical performance of the model, SVM model is further established basing on the descriptors (α and V s.min ) that are screened by MLR. The contour map of the combination of hyperparameter γ and penalty factors C is shown in Figure 4. It shows that the smallest predictive errors of the model (<0.50) exist in the brown area, and the largest predictive errors appear in the gray area. The optimal combination is γ = 0.1, C = 10, which yields the following evaluation parameters (N represents the number of data points in the data set): The model also has good fitting ability and robustness, as shown by the high R 2 (0.908) and Q 2 (0.853) values. In external validation, both R 2 (0.813) and Q 2 (0.818) values are greater than 0.8, further indicating a good predictive ability. Figure 5 also shows a good agreement between the experimental log KP values and the predictive values calculated by the SVM model. The R 2 values of the log KOA model, the MLR model and the SVM model are all greater than 0.8, indicating every model has good fitting ability. In comparison, the MLR model has better performance than the log KOA model. The training set of SVM model obtains the highest R 2 value among the three models; however, the R 2 of its validation set is relatively lower than that of MLR model. As a black-box model, the prediction of the SVM model is The model also has good fitting ability and robustness, as shown by the high R 2 (0.908) and Q 2 (0.853) values. In external validation, both R 2 (0.813) and Q 2 (0.818) values are greater than 0.8, further indicating a good predictive ability. Figure 5 also shows a good agreement between the experimental log K P values and the predictive values calculated by the SVM model.  The model also has good fitting ability and robustness, as shown by the high R 2 (0.908) and Q 2 (0.853) values. In external validation, both R 2 (0.813) and Q 2 (0.818) values are greater than 0.8, further indicating a good predictive ability. Figure 5 also shows a good agreement between the experimental log KP values and the predictive values calculated by the SVM model.  The R 2 values of the log K OA model, the MLR model and the SVM model are all greater than 0.8, indicating every model has good fitting ability. In comparison, the MLR model has better performance than the log K OA model. The training set of SVM model obtains the highest R 2 value among the three models; however, the R 2 of its validation set is relatively lower than that of MLR model. As a black-box model, the prediction of the SVM model is an opaque process which cannot provide more information, such as the relationship between the molecular descriptors and the target endpoint under study, thus limiting its application. Furthermore, the MLR model based on molecular structure descriptors avoids the difficulties of experimental measurement, and is a visual model, making it simpler and more convenient for practical application. According to the comprehensive comparison, the MLR model is considered as the optimal predictive log K P model for the following analysis. an opaque process which cannot provide more information, such as the relationship between the molecular descriptors and the target endpoint under study, thus limiting its application. Furthermore, the MLR model based on molecular structure descriptors avoids the difficulties of experimental measurement, and is a visual model, making it simpler and more convenient for practical application. According to the comprehensive comparison, the MLR model is considered as the optimal predictive log KP model for the following analysis.

Mechanism Analysis
The MLR model contains two descriptors, the average molecular polarization α and the most negative electrostatic potential on the molecular surface Vs.min. α has the highest correlation with log KP with the correlation coefficient R of 0.839, indicating the great importance of average molecular polarizability in affecting the distribution of PAHs and oxygen/nitrogen derivatives between gas phase and atmospheric particle phase. α characterizes the dispersive interaction between the molecules, and a larger α corresponds to a stronger intermolecular dispersive effect [47,48]. Because of the great distance between molecules in the air, the dispersion interaction mainly occurs between atmospheric particles and chemical molecules. Therefore, a larger α leads to stronger dispersion interactions between chemical molecules and particles, and further results in a larger log KP. Thus, α yields a positive coefficient (0.031) in the model. The second descriptor, Vs.min, shows negative correlation with log KP (the coefficient of −24.453). Vs.min reflects the contribution of molecular electrostatic hydrogen bonds; that is, it reflects the ability of molecules to accept protons to form hydrogen bonds. The smaller Vs.min value indicates a higher electron

Mechanism Analysis
The MLR model contains two descriptors, the average molecular polarization α and the most negative electrostatic potential on the molecular surface V s.min . α has the highest correlation with log K P with the correlation coefficient R of 0.839, indicating the great importance of average molecular polarizability in affecting the distribution of PAHs and oxygen/nitrogen derivatives between gas phase and atmospheric particle phase. α characterizes the dispersive interaction between the molecules, and a larger α corresponds to a stronger intermolecular dispersive effect [47,48]. Because of the great distance between molecules in the air, the dispersion interaction mainly occurs between atmospheric particles and chemical molecules. Therefore, a larger α leads to stronger dispersion interactions between chemical molecules and particles, and further results in a larger log K P . Thus, α yields a positive coefficient (0.031) in the model. The second descriptor, V s.min , shows negative correlation with log K P (the coefficient of −24.453). V s.min reflects the contribution of molecular electrostatic hydrogen bonds; that is, it reflects the ability of molecules to accept protons to form hydrogen bonds. The smaller V s.min value indicates a higher electron density and a stronger ability to accept protons to form hydrogen bonds [49,50]. Therefore, PAHs and oxygen/nitrogen derivatives with smaller V s.min values are more likely to combine with atmospheric particulates which have complex compositions.

Discussion
Yuan et al. [51] constructed a temperature-dependent QSPR model for predicting the log K P values of 10 PAHs compounds based on molecular structure descriptors and ambient temperature (T). The model included also the descriptor α as well as variable T; however, its statistical performance is not satisfactory: R 2 = 0.624, Q 2 = 0.624, and RMSE = 0.395. Sun et al. [52] established a Theoretical Linear Solution Energy Relationship (TLSER) model for some organic compounds, including alkanes, alkalic acids, PAHs, O-PAHs and N-PAHs (Table 4), in which K P1 and K P2 represent K P values measured by 190 m 3 and 25 m 3 smoke chambers, respectively. These models show that dispersion and hydrogen bonding are important factors affecting K P values, which is consistent with the results of this study. However, the TLSER models contain fewer PAH, O-PAH and N-PAH data. Furthermore, the number of descriptors used in this study is less, which makes it easier to apply.

Conclusions
In this study, the correlation between the log K P and log K OA of PAHs and their oxygen/nitrogen derivatives is first analyzed, and then QSPR models for log K P prediction are constructed based on quantum chemical descriptors by MLR and SVM algorithms. The QSPR models have better fitting performance, predictive ability and robustness. The mechanism analysis shows that the major factors affecting the distribution of PAHs, O-PAHs, N-PAHs and AZA in the gas and particle phases are intermolecular dispersion and hydrogen bonding. Although the SVM model is slightly superior to the MLR model, it is a black-box model with poor transparency and is dependent on the descriptor screening of MLR process, limiting its further application. In contrast, the MLR model has simple and visualized mathematical expression, bringing convenience to the analysis of the important factors that affect the partitioning of these chemicals between gas and atmospheric particulate phases according to the chemical information carried by the quantum chemical descriptors. Thus, the MLR model can be used to predict the log K P values of other PAHs and oxygen/nitrogen derivatives, with the average molecular polarization within 280.623 and 99.753 and the most negative electrostatic potential on the molecular surface V s.min within −2.498 and −9.535. The log K P values can provide basic data for their environmental fate and ecological risk assessment.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27217608/s1, Table S1: The lower and upper limit of 95% confidence interval of log KOA model, MLR model and SVM model.

Conflicts of Interest:
The authors declare no competing interests.