Developing a QSPR Model of Organic Carbon Normalized Sorption Coefficients of Perfluorinated and Polyfluoroalkyl Substances

Perfluorinated and polyfluoroalkyl substances (PFASs) are known for their long-distance migration, bioaccumulation, and toxicity. The transport of PFASs in the environment has been a source of increasing concerned. The organic carbon normalized sorption coefficient (Koc) is an important parameter from which to understand the distribution behavior of organic matter between solid and liquid phases. Currently, the theoretical prediction research on log Koc of PFASs is extremely limited. The existing models have limitations such as restricted application fields and unsatisfactory prediction results for some substances. In this study, a quantitative structure–property relationship (QSPR) model was established to predict the log Koc of PFASs, and the potential mechanism affecting the distribution of PFASs between two phases from the perspective of molecular structure was analyzed. The developed model had sufficient goodness of fit and robustness, satisfying the model application requirements. The molecular weight (MW) related to the hydrophobicity of the compound; lowest unoccupied molecular orbital energy (ELUMO) and maximum average local ionization energy on the molecular surface (ALIEmax), both related to electrostatic properties; and the dipole moment (μ), related to the polarity of the compound; are the key structural variables that affect the distribution behavior of PFASs. This study carried out a standardized modeling process, and the model dataset covered a comprehensive variety of PFASs. The model can be used to predict the log Koc of conventional and emerging PFASs effectively, filling the data gap of the log Koc of uncommon PFASs. The explanation of the mechanism of the model has proven to be of great value for understanding the distribution behavior and migration trends of PFASs between sediment/soil and water, and for estimating the potential environmental risks generated by PFASs.


Introduction
Perfluorinated and polyfluoroalkyl substances (PFASs), as a class of synthetic aliphatic compounds [1,2], have be widely used in industrial production and daily consumer products because of their hydrophobicity, oleophobicity, thermal stability, and chemical stability [3,4]. Up to now, PFASs and their precursors have been found in water, atmosphere, soil, sediment, and other environmental media [5][6][7][8]. Generally, the concentration of PFAS in these environmental media is within the scale range of ng·L −1 , but in some areas with serious pollution (i.e., around fluorine chemical plants), the concentration of PFAS in water can reach the scale range of mg·L −1 [9]. PFASs can enter the human body and accumulate through drinking water, the food chain, and in other ways [10,11], and when a Molecules 2022, 27, 5610 2 of 11 certain threshold is reached, they will produce corresponding toxic effects, such as neurological, reproductive, liver, and endocrine toxicity, which can seriously endanger human health [12][13][14][15][16].
Sediment, soil, and water are important sinks for PFASs [1,10,17]. The accurate measurement of the organic carbon normalized sorption coefficient (K oc ) of PFASs can reflect their distribution behavior between sediment/soil and water [18][19][20], which is crucial for their environmental fate and risk assessment. So far, many studies have been undertaken on the distribution behavior of traditional PFASs, such as perfluorinated carboxylic acids (PFCAs) and perfluoroalkyl sulfonic acids (PFSAs), but there are few studies on that of emerging PFASs. PFASs are composed of a carbon skeleton and hydrophilic groups, where the hydrogen atoms connected to the carbon skeleton are partially or completely replaced by fluorine atoms [1,2]. From a structural perspective, both conventional and emerging PFASs are dominated by the carbon skeleton. However, the substituents on the skeleton and the functional groups at the ends have an important impact on the environmental transport characteristics of the compounds. Previous studies have shown that the migration of PFASs in different environmental media is closely related to their structural factors, such as their carbon chain length, substituents, and functional groups [17,21]. For instance, for the twophase medium of sediment/soil and water, there is a linear relationship between the K oc of PFASs and the number of perfluorinated carbon (CF). In general, the K oc increases with the number of CF, while PFASs with a sulfonic group have a larger K oc than similar compounds containing a carboxyl group [21]. The lack of research on the environmental migration law of emerging PFASs results from the variety of PFASs, the fact that their derivatives appeared one after another, and the insufficient understanding of their physicochemical properties. At the same time, the experimental measurement of K oc is not only cumbersome and costly but may also pose environmental pollution and human health risks in large-scale experiments. However, it is possible to quickly fill the data gap syrrounding the log K oc of PFASs at low experimental cost by constructing a mathematical model based on the structural characteristics of PFASs to predict their K oc .
The quantitative structure-property relationship (QSPR) model is a theoretical prediction tool with a rapid development and a wide application range. It establishes a functional relationship between the molecular structure of compounds and their properties to effectively predict the compounds' properties [22,23]. The QSPR model can be used to predict the partition coefficient of various organic pollutants with high efficiency, such as the partition coefficient of polycyclic aromatic hydrocarbons (PAHs) between polydimethylsiloxane (PDMS) and water [24], the partition coefficient of polychlorinated biphenyls (PCBs) between low-density polyethylene and water [25], and the partition coefficient of PFASs between gas and particles [2]. To date, few studies have employed PFASs as a unique research object to construct QSPR models for predicting their log K oc [26,27]. A previous study reported the log K oc of 824 organic compounds predicted by a QSPR model, but only a few PFASs were included [26]. Due to the limitation of its data set, the application scope of the model was narrow for PFASs Another limitation of prior work on the model's establishment was that the modeling process did not fully follow the five guidelines for QSPR model construction [27,28]. Generally, for the log K oc prediction of PFASs by QSPR, it is necessary to improve the applicability and accuracy of the model. Meanwhile, the standardization of modeling also need further investigation. Owing to the issues with the above models, there is still a knowledge gap in the overall analysis of PFASs distribution mechanism between sediment/soil and water at the molecular level.
This study developed an optimal log K oc prediction model for PFASs based on the QSPR model construction guidelines. A comprehensive verification and evaluation of the model was undertaken to ensure the integrity and standardization of the modeling process and achieve a reasonable prediction of the log K oc of PFASs. The model used 22 PFASs from eight different classes as the modeling dataset, equipped with strong pertinence which greatly improves the applicability of the model for PFASs. Molecular descriptors with clear definitions were included in the model, which identified the potential mechanism of PFASs distribution between sediment/soil and water from the perspective of the molecular level quickly and accurately, facilitating a better understanding of the distribution behavior of PFASs between the two phases. This combination of effects has significant practical implications for enriching the migration theory of PFASs with different structures between sediment/soil and water, and provides a reference for predicting the deposition concentration of emerging PFASs in environmental media.

Model Construction and Validation
After stepwise linear regression, the optimal QSPR model (Equation (1)) was obtained. The model contained four molecular descriptors: molecular weight (MW), dipole moment (µ), lowest unoccupied molecular orbital energy (E LUMO ), and maximum average local ionization energy on the molecular surface (ALIE max ).  [29], coefficient of determination (R 2 ) > 0.8, multiple correlation coefficient of leave-one-out cross-validation (Q 2 LOO ) > 0.5, external validation indicators (Q 2 F1 , Q 2 F2 , and Q 2 F3 ) > 0.5, indicating that the model has sufficient goodness of fit, robustness, and predictive ability and meets the requirements of the QSPR model construction guidelines [28]. In addition, the R 2 − Q 2 LOO value of this model is less than 0.3, indicating that there is no overfitting phenomenon in this model [30]. Q 2 LOO , Q 2 F1 , Q 2 F2 , and Q 2 F3 were calculated according to the method in a previous study [29], and the calculation formulas of these parameters are presented in the supplementary materials; R 2 was obtained using SPSS 26 software (IBM SPSS Inc., Chicago, IL, USA). Notes: n: the number of data points; R 2 : coefficient of determination; Q 2 LOO : multiple correlation coefficient of leave-one-out cross-validation; RMSE: root mean square error; F: variance ratio; p: significance index; when p < 0.05, this indicates that the model is significant; Q 2 F1 , Q 2 F2 , and Q 2 F3 : external validation indicators. Figure 1 shows the error distribution of the model prediction. The prediction errors of the PFASs were randomly distributed on both sides of the baseline, and there was no obvious regularity, indicating that the built model had no obvious systematic errors. Table 2 lists the significance index (p) and variance inflation coefficient (VIF) values of the molecular descriptors contained in the model. When p < 0.05, this indicates that the molecular descriptor was significant; when VIF < 10, this indicates that there was no multicollinearity among the molecular descriptors [31]. It can be seen from the table that all molecular descriptors in the model were key descriptors, and there was no collinearity among them. p and VIF were obtained using SPSS 26 software (IBM SPSS Inc., Chicago, IL, USA).   As shown in Figure 2, the good consistency between the predicted value and t perimental value indicates that the established model has high prediction accuracy f log Koc of the PFASs.    As shown in Figure 2, the good consistency between the predicted value and th perimental value indicates that the established model has high prediction accuracy fo log Koc of the PFASs.

Application Domain
According to the Williams plot (

Application Domain
According to the Williams plot (Figure 3), none of the standardized residuals of the log K oc of the PFASs obtained from the QSPR model exceeded the thresholds (−3, 3) [1]. In the test set, leverage values (h) > warning leverage value (h*) (h* = 0.83) of 6:2 chlorinated polyfluorinated ether sulfonate (6:2 Cl-PFAES), indicating that this substance was structurally quite different from most of the PFASs in the training set; 6:2 Cl-PFAES is an emerging PFAS that has been widely used in industry as a substitute for traditional PFASs (such as perfluorooctane sulfonic acid (PFOS)) [32,33]. The QSPR model predicts that the standardized residual of its log K oc value is 0.454, which does not exceed the threshold. It can be seen that the QSPR model has a wide range of applications and strong generalization ability, which can successfully predict not only traditional PFASs but also emerging PFASs. standardized residual of its log Koc value is 0.454, which does not exceed the thresho can be seen that the QSPR model has a wide range of applications and strong genera tion ability, which can successfully predict not only traditional PFASs but also emer PFASs.

Mechanistic Interpretation of the Model
The standardized regression coefficient refers to the regression coefficient whe variables are expressed in standardized form. Since the same measurement unit is u the independent variables are more comparable [34]. The standardized regression co cients of MW, μ, ELUMO, and ALIEmax in the QSPR model were 1.048, −0.219, −0.495, −0.362, respectively. Based on a comparison of their absolute values, it can be seen tha influence of the four molecular descriptors on log Koc was MW > ELUMO > ALIEmax > μ.
MW is related to the molecular size and hydrophobicity and can reflect the effe molecules on the formation and destruction of holes in water [35]. When MW incre this increases the PFASs' molecular size and strengthens their hydrophobicity [36,37 this time, the energy of the adsorbate required for the formation of holes between w molecules will lead to stronger hydrophobic repulsion of water on the surface of the P molecules [38], thus driving the adsorption of PFASs in solid media (such as sedime soil) [17]. In this study, the QSPR model showed that MW was positively correlated log Koc. When MW increased, the log Koc of PFSAs showed an increasing trend, and log Koc of PFCAs roughly showed an increasing trend (except for perfluorobutanoic (PFBA); the possible explanation is given below). This result is consistent with the p ously reported results that the log Koc of the same type of PFASs usually increases wit increase in the carbon chain [39]. PFASs of the same class have similar structures (wit same functional groups). With the increase in the carbon chain length, its MW and hy phobicity increased, which promoted the adsorption of PFASs in solid medium an creased their log Koc. In addition, the size of MW explained the change of log Koc of d ent types of PFASs to a certain extent. For example, perfluorododecanoic acid (PFDo 6:6 perfluoroalkyl phosphinic acid (6:6 PFPiA), and n-ethyl perfluorooctane su amidoacetic acid (N-EtFOSAA) have the same carbon chain length (12 carbon ato their MW are 6:6 PFPiA > PFDoDA > N-EtFOSAA; their μ are N-EtFOSAA > PFDoD

Mechanistic Interpretation of the Model
The standardized regression coefficient refers to the regression coefficient when all variables are expressed in standardized form. Since the same measurement unit is used, the independent variables are more comparable [34]. The standardized regression coefficients of MW, µ, E LUMO , and ALIE max in the QSPR model were 1.048, −0.219, −0.495, and −0.362, respectively. Based on a comparison of their absolute values, it can be seen that the influence of the four molecular descriptors on log K oc was MW > E LUMO > ALIE max > µ.
MW is related to the molecular size and hydrophobicity and can reflect the effect of molecules on the formation and destruction of holes in water [35]. When MW increases, this increases the PFASs' molecular size and strengthens their hydrophobicity [36,37]. At this time, the energy of the adsorbate required for the formation of holes between water molecules will lead to stronger hydrophobic repulsion of water on the surface of the PFAS molecules [38], thus driving the adsorption of PFASs in solid media (such as sediment or soil) [17]. In this study, the QSPR model showed that MW was positively correlated with log K oc . When MW increased, the log K oc of PFSAs showed an increasing trend, and the log K oc of PFCAs roughly showed an increasing trend (except for perfluorobutanoic acid (PFBA); the possible explanation is given below). This result is consistent with the previously reported results that the log K oc of the same type of PFASs usually increases with the increase in the carbon chain [39]. PFASs of the same class have similar structures (with the same functional groups). With the increase in the carbon chain length, its MW and hydrophobicity increased, which promoted the adsorption of PFASs in solid medium and increased their log K oc . In addition, the size of MW explained the change of log K oc of different types of PFASs to a certain extent. For example, perfluorododecanoic acid (PFDoDA), 6:6 perfluoroalkyl phosphinic acid (6:6 PFPiA), and n-ethyl perfluorooctane sulfonamidoacetic acid (N-EtFOSAA) have the same carbon chain length (12 carbon atoms); their MW are 6:6 PFPiA > PFDoDA > N-EtFOSAA; their µ are N-EtFOSAA > PFDoDA > 6:6 PFPiA; their E LUMO are PFDoDA > 6:6 PFPiA > N-EtFOSAA; their ALIE max are N-EtFOSAA > 6:6 PFPiA > PFDoDA, and their log K oc are 6:6 PFPiA > PFDoDA > N-EtFOSAA, which shows the same ranking as MW.
E LUMO reflects the ability of molecules to receive electrons [40]. When E LUMO is more positive, it is more difficult for molecules to obtain electrons from the external environment. ALIE max is the average energy required to ionize electrons at any point in molecular space [41], which is related to their molecular electrostatic potential, electronegativity, hardness, and other properties [42]. When ALIE max is smaller, the electron activity is stronger, and electrophilic and free radical reactions are more likely to occur [43]. According to the QSPR model, E LUMO and ALIE max were negatively correlated with log K oc , which reflects the electrostatic interaction between molecules. These two molecular descriptors were used to explain why the molecular weights of perfluoroheptanoic acid (PFPeA) and perfluorohexanoic acid (PFHxA) were both larger than that of PFBA but their log K oc was smaller than that of PFBA. From the perspective of E LUMO , it is more difficult for PFBA to obtain electrons from the external environment than PFPeA and PFHxA, but the difference between the compounds is small. From the perspective of ALIE max , PFBA has stronger electronic activity than PFPeA and PFHxA and is more prone to electrophilic reactions, and the difference between the compounds is larger. Compared with PFPeA and PFHxA, PFBA may be more capable of electrostatic interaction with the external environment, increasing its log K oc . This explanation is consistent with the results reported previously that electrostatic interactions may be the main factor affecting the adsorption of short-chain PFASs in solid-phase media [44].
µ is often used to describe the intermolecular dipole-dipole interactions in QSPR studies [45]. As µ increases, ionic compounds are more easily solvated in liquid phases (water) [46], and PFASs are adsorbed with greater difficulty in solid phases (aqueous aerosol) [1]. It can be seen from the QSPR model that µ was negatively correlated with log K oc . The larger the µ value, the easier it is to solvate in water, so it is more difficult for PFASs to be adsorbed in solid media. For example, N-EtFOSAA is the precursor of perfluorooctane sulfonamide (PFOSA) [47]; the two have similar structures and they have the same carbon chain length; only the terminal functional groups are different. The MW of N-EtFOSAA is larger than that of PFOSA, and the ALLE max is smaller, but the log K oc of PFOSA is much larger than that of N-EtFOSAA. In addition to the influence of E LUMO, it is mainly due to the µ of N-EtFOSAA being relatively large (nearly threefold).

Model Comparison
The QSPR model with high R 2 was developed using the molar volume as a single molecule descriptor [27], but the robustness and external prediction ability of the model were not verified in the process of model construction, and the log K oc prediction ability of PFASs with molar volume less than 160 cm 3 ·mol −1 was limited. In this study, the robustness and external prediction ability of the developed QSPR model were verified through the Q 2 LOO and Q 2 F1 , Q 2 F2 , and Q 2 F3 . The results show that the model has good robustness and external prediction ability. For example, the model built by Brusseau had prediction errors of 1.4 and 0.5 for PFBA and PFPeA (two PFASs with molar volumes of less than 160 cm 3 ·mol −1 ) [27], while the QSPR model developed in this study had prediction errors of 0.52 and −0.13 for PFBA and PFPeA, respectively. The log K oc prediction ability of this model is better for PFASs with a smaller molar volume.
A QSPR model was developed using nine molecular descriptors based on the structural properties of 824 compounds [26], but only six PFASs were included in these compounds. The QSPR model established by Wang et al. has a lower prediction accuracy of log K oc for PFASs than the QSPR model, which only takes PFASs as the modeling data set [26]. The QSPR model developed in this study uses more PFAS modeling datasets and fewer molecular descriptors to obtain better model statistical parameters, and the log K oc prediction accuracy for PFASs is higher. For example, the QSPR model developed by Wang et al. has prediction errors of −0.38 and 0.18 for traditional PFASs (such as perfluorooctanoic acid (PFOA) and PFOS) [26], while the prediction errors of this model for PFOA and PFOS are only −0.29 and 0.07, respectively.
To sum up, compared with the models developed in previous studies (Table 3), the data set used in this study has more types and numbers of PFASs, the developed QSPR model has higher prediction accuracy, and the model has a wider application field. It covers a large number of structurally diverse PFASs, including not only traditional PFASs but also emerging PFASs. In addition, the modeling process of this study is completely based on the standard framework of the QSPR model [28], which not only establishes a simple linear relationship between the structural properties of PFASs and their log K oc but also analyzes the distribution behavior of PFASs between sediment/soil and water in detail through the mechanical interpretation of the QSPR model.

Data Collection and Processing
The log K oc values were collected from the research literature on the adsorption of PFASs in sediments and soils [17,21,[48][49][50][51][52][53], including 11 PFCAs, 5 PFSAs, 1 perfluoroalkane sulfonamide (FOSAs), 1 perfluoroalkyl phosphinic acid (PFPiAs), and 4 other PFASs. Since the experimental data in the previous studies were measured by different experimenters and under different experimental environments, in order to ensure the reliability of the data, this study first removed the outliers that obviously deviated from the overall data samples from the multiple log K oc experimental values of the same PFASs collected and then calculated the average value to develop the QSPR model.
The log K oc values of 22 PFASs ranged from 1.54 to 5.04, the span range was 3.50, the average value (mean) was 3.22, and the corresponding standard deviation (SD) was 1.11. All data fell within the interval of (mean −3SD, mean +3SD) and did not require further processing [54]. In total, 80% of the data in the dataset were randomly selected as the training set (18 PFASs) for developing the QSPR model; the remaining 20% of the data were the test set (4 PFASs) for external validation of the model. Details about PFASs, experimental values and references pertaining to the modelling and external validation sets are given in Table S1 in the Supplementary Materials.

Calculation of Molecular Descriptors
The B3LYP/6-31G* algorithm in the Gaussian program package (ver. G09W, Michael F, Wallingford, CT, USA) was used to optimize the molecular structure of the PFASs in the neutral electron ground state, and the stable molecular configuration of the PFASs with the lowest energy was obtained. The Multiwfn program (ver. 3.8, Lu T, Beijing, China) [55] calculated the optimized molecular structure of the PFASs and obtained 62 molecular descriptors, including the molecular structure features, orbital energy levels, electronegativity, atomic charge, polarity, and other physical and chemical information about the PFASs. The multiple physicochemical properties of the PFASs were successfully predicted using these molecular descriptors [1,56].

Model Development and Validation
Firstly, correlation analysis was performed between all molecular descriptors. For molecular descriptors with a correlation coefficient (R) higher than 0.9, only one molecular descriptor with a high correlation coefficient with log K oc was retained. Secondly, based on SPSS 26 software (IBM SPSS Inc., Chicago, IL, USA), the retained molecular descriptors were taken as independent variables with log K oc as the dependent variable to perform a stepwise linear regression to obtain the QSPR models containing different numbers of molecular descriptors. Lastly, the optimal QSPR model with the largest adjusted coefficient of determination (R 2 adj ) and the smallest root mean square error (RMSE) was selected as the final model in this study. R, R 2 adj and RMSE were obtained by SPSS 26 software (IBM SPSS Inc., Chicago, IL, USA).
According to the QSPR model construction guidelines [28], the QSPR model should have sufficient goodness of fit, robustness, and predictive ability. In this study, the R 2 was used to evaluate the goodness of fit of the QSPR model, the Q 2 LOO was used to evaluate the robustness of the QSPR model, the test set was used to externally validate the QSPR model, and the Q 2 F1 , Q 2 F2 , Q 2 F3 were used to evaluate the prediction ability of the QSPR model. In addition, in order to further verify the reliability of the developed model, the error distribution of the model prediction was used to evaluate whether the model had systematic errors; the p and VIF of the molecular descriptors contained in the QSPR model were used to determine whether each molecular descriptor was significant and whether there was multicollinearity among the molecular descriptors.

Application Domain
A Williams diagram [57] was used to characterize the application domain of the QSPR model, evaluate its scope of application, and determine whether there were outliers in the modeling samples. The composition of the Williams diagram and its calculation method are described in the Supplementary Materials.

Conclusions
In this study, we successfully developed an optimal QSPR model to predict the log K oc of PFASs. The dataset of this model includes 22 PFASs in eight different categories, covering the common PFASs in current industrial production and daily life, and the model has a wide range of applicability. The comprehensive verification and evaluation of the model show that the developed model has sufficient goodness of fit, robustness, and predictive ability and can accurately predict the log K oc of PFASs (within the application field defined by the model).
Through the mechanistic interpretation of the model, we found that the MW, E LUMO , ALIE max, and µ of PFASs are the main structural factors affecting the partitioning behavior between the solid and liquid phases, and the order of influence is MW > E LUMO > ALIE max > µ. Specifically, MW reflects the hydrophobic property of the compound, µ reflects the polarity of the compound, while E LUMO and ALIE max are related to the electrostatic interaction between molecules. The partitioning behavior of PFASs between the two phases is the result of the joint influence of multiple mechanisms. The hydrophobic interaction, electrostatic interaction, and dipole-dipole interaction play key roles in determining the partitioning of PFASs between the two phases. PFASs that can produce a strong hydrophobic interaction tend to be distributed in solid-phase media. The results of this study are of great significance to understanding the migration behavior and environmental fate of PFASs between sediment/soil and water, providing basic data for further environmental risk assessment.
In future research, the relationship between the log K oc of PFASs and other partition coefficients can be compared and analyzed in order to illustrate the transport of PFASs in multiphase media as well. Moreover, new PFASs with less impact on the environment can be designed based on the structural factors that affect the distribution behavior of PFASs to reduce the environmental load caused by this kind of compound.
Supplementary Materials: The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/molecules27175610/s1. 1. The five guidelines for QSPR construction; 2. Descriptor selection; 3. Model validation 4. The application domain; Table S1. The values of log K oc for PFASs; Table S2. Description of the descriptors generated from Multiwfn; Table S3. Statistical parameters of QSPR model; Table S4. The observed and predicted log K oc values of PFASs. The values of the descriptors used in the QSPR models.