Computational Modeling of Human Serum Albumin Binding of Per- and Polyfluoroalkyl Substances Employing QSAR, Read-Across, and Docking

Per- and polyfluoroalkyl substances (PFAS) are synthetic chemicals in widespread use that have been shown to be toxic to wildlife and humans. Human serum albumin (HSA) is a known transport protein that binds PFAS at various sites, leading to bioaccumulation and long-term toxicity. In silico tools like quantitative structure-activity relationship (QSAR), read-across, and quantitative read-across structure-property relationship (q-RASPR) are proven techniques for modeling chemical toxicity based on experimental data which can be used to predict the toxicity of untested and new chemicals, while at the same time, help to identify the major features responsible for toxicity. Classification-based and regression-based QSAR models are employed in the present study to predict the binding affinities of 24 PFAS to HSA. Regression-based QSAR models revealed that the packing density index (PDI) and quantitative estimation of drug-likeness (QED) descriptors were both positively correlated with higher binding affinity, while the classification-based QSAR model showed the average connectivity index of order 4 (X4A) descriptor was inversely correlated with binding affinity. Whereas molecular docking studies suggested that PFAS with the highest binding affinity to HSA create hydrogen bonds with Arg348 and salt bridges with Arg348 and Arg485, PFAS with lower binding affinity either showed no interactions with either amino acid or only interactions with Arg348. Among the studied PFAS, perfluoroalkyl acids (PFAA) with large carbon chain length (>C10) have one of the lowest binding affinities, compared to PFAA with carbon chain length ranging from 7 to 9, which showed the highest affinity to HSA. Generalized Read-Across (GenRA) was used to predict toxicity outcomes for the top five highest binding affinity PFAS based on 10 structural analogs for each and found that all are predicted as being chronic to sub-chronically toxic to HSA. The developed in silico models presented in this work can provide a framework for designing PFAS alternatives, screening compounds currently in use, and for the study of PFAS mixture toxicity, which is an area of intense research.

After combining training and test data sets, the model classified 11 PFAS as having 'high' (H) binding affinity and 13 PFAS as having 'low' (L) binding affinity with HSA. Of the 16 PFAS in the training set, 9 were classified as having H binding affinity and 7 were classified as having L binding affinity. Among the remaining 8 PFAS in the test set, 4 were classified as having H binding affinity and 4 were classified as having L binding affinity ( Table 1)  Detailed validation metrics results are illustrated in Table 2. Values above the potency threshold are classified as L and values below the potency threshold are classified as H. As all three descriptors in the equation contribute positively, higher values for these descriptors will most likely result in L classification and lower values are more likely to result in H classification. As can be seen in the equation, X4A has the highest contribution, meaning it is the most important factor in determining the binding affinity in this model. X4A refers to the average connectivity index of order 4, while Eig12_AEA(bo) refers to the eigenvalue number12 from the augmented edge adjacency matrix weighted by bond order and DECC refers to eccentric topological indices. The receiver operating characteristic (ROC) curves in Figure 1 illustrate the perfect classification ability of the developed model (Equation (1)), where training and test ROC curves achieved 0.97 and 1 value, respectively. Examining the contribution plot in Figure 2, we are certain that X4A is the most important discriminating feature between higher and lower binding affinity of PFAS towards HSA. From the plot, it is quite evident that the higher the X4A value of a PFAS, the lower the binding affinity. Importantly, an additional two features: Eig12_AEA(bo) and DECC are equally important for the development of the model, but considering the discrimination between H and L binding groups, they don't have much clear distinction.  The receiver operating characteristic (ROC) curves in Figure 1 illustrate the perfect classification ability of the developed model (Equation (1)), where training and test ROC curves achieved 0.97 and 1 value, respectively. Examining the contribution plot in Figure  2, we are certain that X4A is the most important discriminating feature between higher and lower binding affinity of PFAS towards HSA. From the plot, it is quite evident that the higher the X4A value of a PFAS, the lower the binding affinity. Importantly, an additional two features: Eig12_AEA(bo) and DECC are equally important for the development of the model, but considering the discrimination between H and L binding groups, they don't have much clear distinction.

Regression-Based 'Small Dataset QSAR' Model for Undivided Dataset
Two PFAS had no binding affinity values; therefore, for the development of the regression-based QSAR model, we considered only 22 PFAS. As the number is quite low, we decided to take the whole dataset for modeling purposes employing 'SmallDataModeler', which is an approved modeling tool when there is not enough data to utilize training and test data sets separately. We developed two models employing multiple linear regression (MLR) and partial least squares (PLS) chemometric tools. Based on the goodness-of-fit and internal validation metrics, it was determined that the PLS model (Model 2) was slightly better than the MLR model (Model 1). The quality of both models is depicted in Table 3.

Regression-Based 'Small Dataset QSAR' Model for Undivided Dataset
Two PFAS had no binding affinity values; therefore, for the development of the gression-based QSAR model, we considered only 22 PFAS. As the number is quite lo we decided to take the whole dataset for modeling purposes employing 'SmallDataMo eler', which is an approved modeling tool when there is not enough data to utilize train and test data sets separately. We developed two models employing multiple linear regr sion (MLR) and partial least squares (PLS) chemometric tools. Based on the goodnessfit and internal validation metrics, it was determined that the PLS model (Model 2) w slightly better than the MLR model (Model 1). The quality of both models is depicted   The developed equation for Model 2 is as follows: The regression-based Small Dataset MLR model consists of 4 descriptors: MATS8m, GATS8v, PDI, and QED. The PDI descriptor refers to packing density index, defined as the ratio between the McGowan volume (Vx) and the total surface area from P_VSA-like descriptors (SAtot), or PDI = Vx SAtot . The MATS8m descriptor refers to Moran autocorrelation of lag 8 weighted by mass. The GATS8v descriptor refers to Geary autocorrelation of lag 8 weighted by van der Waals volume. The QED descriptor refers to the quantitative estimation of drug-likeness. This descriptor is based on 8 other molecular properties, which are used to obtain a set of desirability functions through asymmetric double sigmoidal ability of the property, w i is the weight applied to each function, and n is the number of desirability functions. All four descriptors in the equation have a negative contribution to the equation, meaning that higher values for each descriptor will lead to a lower EC 50 or higher binding affinity. To obtain a PFAS with low binding affinity to HSA and, therefore, lower toxicity, ideally the values for all four descriptors should be low, resulting in a more positive EC 50 . The scatter plot employing the best PLS-based model in Figure 3 (Left) shows that experimental binding affinities in terms of EC 50 are well correlated with the predicted affinities. The scatter plot shows that the points fell close to the line of perfect fit, which further supports the predictive efficacy of the developed QSAR model. The variable significance plot in Figure 3  , where di is the desirability of the property, wi is the weight applied to each function, and n is the number of desirability functions.
All four descriptors in the equation have a negative contribution to the equation, meaning that higher values for each descriptor will lead to a lower EC50 or higher binding affinity. To obtain a PFAS with low binding affinity to HSA and, therefore, lower toxicity, ideally the values for all four descriptors should be low, resulting in a more positive EC50. The scatter plot employing the best PLS-based model in Figure 3 (Left) shows that experimental binding affinities in terms of EC50 are well correlated with the predicted affinities. The scatter plot shows that the points fell close to the line of perfect fit, which further supports the predictive efficacy of the developed QSAR model. The variable significance plot in Figure 3

Read-Across Results
Based on the top 10 ToxPrint Chemotype analogues for each PFAS, GenRA predicted the toxic effects of S5, C6, C4, C5, and C12 (top five highest binding affinity to HSA in the present study). Notably, all 5 PFAS were predicted to be sub-chronically toxic to albumin (ACT scores of 1, 0.818, 0.812, 0.812, and 0.816, respectively), with S5 and C12 having ACT scores of 1 and 1, respectively were also predicted to be chronically toxic. However, it should be noted that the predictions were validated in combination with AUC values and p-values. All AUC values for each prediction were 0, and p-values ranged from 0.73 to 1. Ideally, the AUC should be greater than 0.7 and p-value should be less than 0.1. While the predictions have high ACT scores, the validation metrics show that the predictions are not that reliable due to low AUC and high p-values. One of the major reasons is we are just challenged by the coverage of PFAS toxicity data and the extent to which we can quantify the performance for target chemicals with any degree of robustness. Figure 4 depicts the top 10 ToxPrint Chemotype analogues for 6:2 fluorotelomer sulfonic acid (6:2 FTSA), the PFAS with the highest binding affinity to HSA in the studied dataset. The top 10 ToxPrint

Read-Across Results
Based on the top 10 ToxPrint Chemotype analogues for each PFAS, GenRA predicted the toxic effects of S5, C6, C4, C5, and C12 (top five highest binding affinity to HSA in the present study). Notably, all 5 PFAS were predicted to be sub-chronically toxic to albumin (ACT scores of 1, 0.818, 0.812, 0.812, and 0.816, respectively), with S5 and C12 having ACT scores of 1 and 1, respectively were also predicted to be chronically toxic. However, it should be noted that the predictions were validated in combination with AUC values and p-values. All AUC values for each prediction were 0, and p-values ranged from 0.73 to 1. Ideally, the AUC should be greater than 0.7 and p-value should be less than 0.1. While the predictions have high ACT scores, the validation metrics show that the predictions are not that reliable due to low AUC and high p-values. One of the major reasons is we are just challenged by the coverage of PFAS toxicity data and the extent to which we can quantify the performance for target chemicals with any degree of robustness.  Four of the structural analogues were common between all five molecules and contributed to the albumin toxicity prediction with known data. Potassium perfluorobutanesulfonate, an analogue for all five molecules, had subchronic toxic albumin effects at 600 mg/kg/day. N-Ethylperfluorooctanesulfonamide, an analogue for all five molecules, had subchronic toxic albumin effects at 10.1 mg/kg/day. 1,1,2,2-Tetrachloroethane, an analogue for S5, C6, C4, and C5, had subchronic toxic albumin effects at 40 mg/kg/day. Chlorethoxyfos, an analogue for S5 and C12, had subchronic toxic albumin effects at 1.25 mg/kg/day and chronic toxic albumin effects at 1.86 mg/kg/day. Toxicity data and predictions for all five sets of GenRA can be viewed in Supplementary Materials.

Docking Results
Docking interaction diagrams for the three highest binding affinities (C6, C4, and C5) and three lowest binding affinities (C1, C9, and E1) show the crucial amino acids involved in binding ( Figure 5). Interestingly, all three of the highest affinity PFAS form a hydrogen bond with Arg348 and two salt bridges between Arg485 and Arg348, as well as some hydrophobic and polar interactions. In contrast, C9 and E1 do not form any hydrogen bonds or salt bridges, only hydrophobic and polar interactions, leading to their lowest binding affinity to HSA among the studied PFAS. C1 forms two hydrogen bonds with Arg348, but no salt bridges, indicating that the salt bridges are essential to high binding affinity to HSA. It is interesting to note that despite C9 containing a carboxylate group like C4, C5 and C6, it still has one of the lowest binding affinities, likely due to its large carbon chain length (>C10), which is corroborated by our earlier work [11]. PFAA with carbon chain length ranging from 7 to 9 (C4, C5 and C6) show the highest affinity to HSA. Four of the structural analogues were common between all five molecules and contributed to the albumin toxicity prediction with known data. Potassium perfluorobutanesulfonate, an analogue for all five molecules, had subchronic toxic albumin effects at 600 mg/kg/day. N-Ethylperfluorooctanesulfonamide, an analogue for all five molecules, had subchronic toxic albumin effects at 10.1 mg/kg/day. 1,1,2,2-Tetrachloroethane, an analogue for S5, C6, C4, and C5, had subchronic toxic albumin effects at 40 mg/kg/day. Chlorethoxyfos, an analogue for S5 and C12, had subchronic toxic albumin effects at 1.25 mg/kg/day and chronic toxic albumin effects at 1.86 mg/kg/day. Toxicity data and predictions for all five sets of GenRA can be viewed in Supplementary Materials.

Docking Results
Docking interaction diagrams for the three highest binding affinities (C6, C4, and C5) and three lowest binding affinities (C1, C9, and E1) show the crucial amino acids involved in binding ( Figure 5). Interestingly, all three of the highest affinity PFAS form a hydrogen bond with Arg348 and two salt bridges between Arg485 and Arg348, as well as some hydrophobic and polar interactions. In contrast, C9 and E1 do not form any hydrogen bonds or salt bridges, only hydrophobic and polar interactions, leading to their lowest binding affinity to HSA among the studied PFAS. C1 forms two hydrogen bonds with Arg348, but no salt bridges, indicating that the salt bridges are essential to high binding affinity to HSA. It is interesting to note that despite C9 containing a carboxylate group like C4, C5 and C6, it still has one of the lowest binding affinities, likely due to its large carbon chain length (>C10), which is corroborated by our earlier work [11]. PFAA with carbon chain length ranging from 7 to 9 (C4, C5 and C6) show the highest affinity to HSA.

Dataset
Binding affinities of PFAS to HSA in terms of half maximal effective concentrations (EC50 in mM) for 24 diverse PFAS including perfluoroalkyl sulfonic acids (C4-C8), perfluoroalkyl carboxylic acids (C4-C12), mono-and polyether perfluoroalkyl ether acids, and polyfluoroalkyl fluorotelomer were collected from the literature [7]. Out of the 24 PFAS, EC50 were determined for 22 PFAS. The EC50 values were obtained from the concentration-response curves using a 4-parameter variable slope model. EC50 quantifies binding affinity, specifically quantifies the concentration of ligand at which half of the target protein is bound. EC50 also measures the concentration of ligand necessary to induce half of the maximum possible effect. Out of the 24 PFAS, the two fluorotelomer alcohols (4:2 FTOH and 6:2 FTOH) did not bind to PFAS as per experimental study; therefore, they can be classified as 'non-toxic' among the studied PFAS. Thus, all 24 PFAS were used in the development of a classification-based QSAR model and 22 PFAS were used in the regression-based QSAR model (Table 1). PFAS with EC50 of 1.45 mM or lower were classified as having H binding affinity to HSA, while the rest were classified as L binding affinity to HSA including the 2 fluorotelomer alcohols with no interactions for the classification QSAR.

Descriptor Calculation
The chemical structures for the 24 PFAS were uploaded in alvaDesc 2.0.16 [13] for the descriptors calculation. For the classification-based QSAR model, a pool of 753 2D descriptors were calculated for the 24 PFAS. For the regression-based QSAR model, a pool of 736 2D descriptors were calculated as 4:2 FTOH and 6:2 FTOH were excluded due to non-interactions with HSA. Both descriptor pools included the following types of

Dataset
Binding affinities of PFAS to HSA in terms of half maximal effective concentrations (EC 50 in mM) for 24 diverse PFAS including perfluoroalkyl sulfonic acids (C4-C8), perfluoroalkyl carboxylic acids (C4-C12), mono-and polyether perfluoroalkyl ether acids, and polyfluoroalkyl fluorotelomer were collected from the literature [7]. Out of the 24 PFAS, EC 50 were determined for 22 PFAS. The EC 50 values were obtained from the concentrationresponse curves using a 4-parameter variable slope model. EC 50 quantifies binding affinity, specifically quantifies the concentration of ligand at which half of the target protein is bound. EC 50 also measures the concentration of ligand necessary to induce half of the maximum possible effect. Out of the 24 PFAS, the two fluorotelomer alcohols (4:2 FTOH and 6:2 FTOH) did not bind to PFAS as per experimental study; therefore, they can be classified as 'non-toxic' among the studied PFAS. Thus, all 24 PFAS were used in the development of a classification-based QSAR model and 22 PFAS were used in the regression-based QSAR model (Table 1). PFAS with EC 50 of 1.45 mM or lower were classified as having H binding affinity to HSA, while the rest were classified as L binding affinity to HSA including the 2 fluorotelomer alcohols with no interactions for the classification QSAR.

Descriptor Calculation
The chemical structures for the 24 PFAS were uploaded in alvaDesc 2.0.16 [13] for the descriptors calculation. For the classification-based QSAR model, a pool of 753 2D descriptors were calculated for the 24 PFAS. For the regression-based QSAR model, a pool of 736 2D descriptors were calculated as 4:2 FTOH and 6:2 FTOH were excluded due to non-interactions with HSA. Both descriptor pools included the following types of descriptor classes: constitutional indices, ring descriptors, topological indices, walk and path counts, connectivity indices, information indices, 2D matrix-based descriptors, 2D autocorrelations, Burden eigenvalues, P_VSA-like descriptors, ETA indices, drug-like indices, MDE descriptors, chirality descriptors, atom-centered fragments, atom-type E-state indices, charge descriptors, edge adjacency indices, functional group counts, molecular properties, pharmacophore descriptors, and 2D atom pairs.

QSAR Modeling
The classification-based QSAR model was developed to identify major discriminatory features between the higher to lower binding affinity classes of PFAS, using genetic algorithm-linear discriminant analysis (GA-LDA) with the 'ClassificationBasedQSAR v1.0.0 tool by DTC Lab Software [14]. The classification dataset was divided into a training and a test set for modeling with random method at a ratio of 2:1, which resulted in n test = 8 and n training = 16. Due to the small number of data points, in case of the regressionbased QSAR modeling, the 'Small Dataset Modeler' tool from DTC Lab Software [14] was used to utilize the whole data set for the exhaustive double cross-validation approach, which does not require dataset division. SmallDataModeler v1.0 by DTC Lab Software was employed using 3 compounds in the validation set by genetic algorithm-multiple linear regression (GA-MLR) selection. The SmallDataModeler employs an exhaustive double cross-validation approach and a set of optimal model selection techniques including consensus predictions for performing the small-dataset QSAR modelling. It performs four basic steps, i.e., (i) data pre-treatment, (ii) model development using exhaustive double cross-validation approach, (iii) selection of optimal model, and (iv) model validation (both internal and external).

Validation, Applicability Domain, and Randomization
Accuracy, Sensitivity, Specificity, Precision and F-measure, geometric means (G-means), Cohen's kappa, Matthew's correlation coefficient (MCC), and receiver operating characteristic (ROC) were evaluated as validation metrics for classification-based QSAR [15]. The area under the ROC curve (AUROC) evaluates the performance of a diagnostic variable, where an AUROC of 1 is ideal and an AUROC of 0.5 indicates a random guess, while the regression-based QSAR model and q-RASAR models were validated using goodness-of-fit (R 2 ), internal validation metric leave-one-out cross-validation (Q 2 LOO ), r m 2 metrics, and mean absolute error (MAE (95%) ). The mathematical definition of major statistical metrics is listed in Table 4.
Here, W g = cross-product matrix for within group variance, B g = cross-product matrix for between group variance, σ 1 is the standard deviation of population 1, s 1 is the standard deviation of the sample drawn from population 1, σ 2 is the standard deviation of population 2, s 2 is the standard deviation of the sample drawn from population 2, λ i = eigen value, det = determinant of a matrix, TP = true positive, FN = false negative, TN = true negative, FP = false positive, λ = Wilk's lambda; P r (a): relative observed agreement between the predicted classification of the model and the known classification; P r (e): hypothetical probability of chance agreement, Y obs = observed response, Y calc = calculated response, Y = mean response of the analyzed set (any), Y training = mean response of training set, Y obs(training) = observed response of the training set, Y pred(training) = predicted response of the training set, r 2 = squared correlation coefficient, and r 0 2 = squared correlation coefficient with zero intercept.

Docking Study
Molecular docking was performed to identify important interactions between the 24 PFAS under study and the protein structure of HSA in complex with PFOA (PDB: 7AAI). The ligands and protein were prepared with the LigPrep [16] and Protein Preparation tools on Maestro module of Schrodinger 2023. The Receptor Grid Generation tool generated a grid file around the top docking site, with an enclosed size of 20 Å. The ligands were docked using the extra precision (XP) Glide module [17] of Schrodinger 2023. The docking method was validated by redocking the co-crystallized ligand PFOA and calculating the RMSD between the docked ligand and the original, which was 0.092 Å. Once the docking protocol and binding active site were validated, we docked all 24 PFAS. Later glide docking energies were considered for correlations between the binding affinity of PFAS to HSA. Table 4. Mathematical formula of statistical validation metrics employed in the present classificationand regression-based QSAR models.

Metrics
The parameters r 2 and r 0 2 are defined as follows: The terms k and k' are defined as: The Y obs and Y pred values have been scaled at the beginning using the following formula: Scaled r m 2 metrics for internal predictivity

Read-Across
Chemical read-across (RA) is one of the major in silico approaches to fill data gaps when there is not enough data available. The available data of a particular substance (referred to as the source) are utilized to forecast the corresponding endpoint(s) for another substance (known as the target) that lacks data but is deemed 'similar' in certain aspects, such as structural similarity among chemicals. A disadvantage of the RA approach is that it is subjective and driven by an expert's specific approach, which can lead to issues of reproducibility and scalability. An alternative to RA is Generalized Read-Across (GenRA) [18], which is an algorithmic developed by the US EPA that enables more objective and reproducible predictions of in vivo toxicity and in vitro bioactivity. In the present study, we used GenRA to predict the toxicity effects of the top five highly bound PFAS to HSA from the experimental data: S5, C6, C4, C5, and C12. For each molecule, the top 10 structural analogues were identified with pairwise similarity metrics, based on ToxPrint Chemotypes. A data matrix was generated for each set of the 10 analogues, displaying toxicity effects from available data. This approach predicted the toxicity effects of each of the five molecules under various categories including chronic, multigenerational, developmental, subacute, and subchronic effects. Based on the GenRA, we predicted a total of 307 chronic and subchronic toxicity endpoints for HSA resulting from PFAS. Results were further validated by similarity-weighted activity scores (ACT), AUC, and p-values. The complete workflow employed in the present study is depicted in Figure 6.

Overview and Conclusions
We have employed multiple in silico modeling approaches like QSAR, RA, and docking to model HSA binding affinity of PFAS. We found that PFAS with long carbon chains (>C10) have lower binding affinities with HSA compared to shorter chain PFAS (C7 to C9), which interacted with HSA with the highest affinity. The RA study also predicted and confirmed that the top five highest binding affinity PFAS as per studied data are also chronic to sub-chronically toxic to HSA. The developed models are not only important to predict the binding affinities of the 25 PFAS tested, but also can be efficiently employed to predict new and untested PFAS' binding to HSA considering applicability domain in

Overview and Conclusions
We have employed multiple in silico modeling approaches like QSAR, RA, and docking to model HSA binding affinity of PFAS. We found that PFAS with long carbon chains (>C10) have lower binding affinities with HSA compared to shorter chain PFAS (C7 to C9), which interacted with HSA with the highest affinity. The RA study also predicted and confirmed that the top five highest binding affinity PFAS as per studied data are also chronic to sub-chronically toxic to HSA. The developed models are not only important to predict the binding affinities of the 25 PFAS tested, but also can be efficiently employed to predict new and untested PFAS' binding to HSA considering applicability domain in mind. The docking study also offered major insights on amino acid interactions between PFAS and HSA, which will aid in the identification of potentially hazardous PFAS. Additionally, these models can be used to evaluate the interactions of PFAS mixtures, which is what humans and other biota are exposed to in the real world, and is an area of intense research.