Application of Machine Learning Methods to Predict the Air Half-Lives of Persistent Organic Pollutants

Persistent organic pollutants (POPs) are ubiquitous and bioaccumulative, posing potential and long-term threats to human health and the ecological environment. Quantitative structure–activity relationship (QSAR) studies play a guiding role in analyzing the toxicity and environmental fate of different organic pollutants. In the current work, five molecular descriptors are utilized to construct QSAR models for predicting the mean and maximum air half-lives of POPs, including specifically the energy of the highest occupied molecular orbital (HOMO_Energy_DMol3), a component of the dipole moment along the z-axis (Dipole_Z), fragment contribution to SAscore (SAscore_Fragments), subgraph counts (SC_3_P), and structural information content (SIC). The QSAR models were achieved through the application of three machine learning methods: partial least squares (PLS), multiple linear regression (MLR), and genetic function approximation (GFA). The determination coefficients (R2) and relative errors (RE) for the mean air half-life of each model are 0.916 and 3.489% (PLS), 0.939 and 5.048% (MLR), 0.938 and 5.131% (GFA), respectively. Similarly, the determination coefficients (R2) and RE for the maximum air half-life of each model are 0.915 and 5.629% (PLS), 0.940 and 10.090% (MLR), 0.939 and 11.172% (GFA), respectively. Furthermore, the mechanisms that elucidate the significant factors impacting the air half-lives of POPs have been explored. The three regression models show good predictive and extrapolation abilities for POPs within the application domain.


Introduction
With the continuous development of the chemical industry, there are an increasing number of persistent organic pollutants (POPs) that are synthesized due to human activities.They are transmitted through specific environmental media, which not only pose potential risks to ecosystems but also pose potential threats to human health [1].POPs are unavoidable byproducts generated during the processes of using pesticides, industrial chemicals, urban waste, wood combustion, and automobile emissions.POPs are hydrophobic organic compounds that are difficult to degrade and have low water solubility and high toxicity [2].The severity and persistence of POP pollution have always been important issues affecting global and human health.The Seveso chemical pollution incident in Italy and the Love Canal pollution incident in the United States, both occurring in the 20th century, were environmental events caused by POP contamination.However, with the signing of the Stockholm Convention in the 21st century, countries around the world have begun to actively strengthen their management and control efforts to identify, reduce, and eliminate the release and use of POPs [3].Due to the global nature of environmental pollution caused by POPs, it is urgently necessary to determine the physicochemical properties required for comprehensive risk assessment of POPs [4].This will further reveal their environmental behavior, ecological risks, and bioaccumulative effects.
Performing experiments to obtain the necessary data of POPs can be a costly and time-consuming task, and handling hazardous or reactive chemicals can pose difficulties [5].The quantitative structure-activity relationship (QSAR) is currently the most widely used computational method that predicts the biological activity, properties, and toxicity of compounds by deriving descriptors from their chemical structures [6,7].This method is particularly suitable for dangerous, toxic, and unstable compounds.It is based on existing experimental data and utilizes techniques such as artificial intelligence, machine learning, and mathematical modeling to establish the relationship between the structural characteristics of POPs and their properties, behavior, and toxicity, thereby developing predictive models.QSAR studies have demonstrated several characteristics such as comprehensiveness, theoretical basis, intelligence, programmability, and practicality [8][9][10].Therefore, they are considered cost-effective tools for preventing human health and environmental safety hazards caused by toxic substances [11][12][13].Molecular fingerprint QSAR (MF-QSAR) is simpler and more efficient approach to QSAR modeling, where molecular fingerprints encode the molecular features of compounds as binary vectors, with the positions and values of the vectors representing the structural information of the compounds.Compared to molecular descriptors, QSAR models based on molecular fingerprints have advantages such as faster computation speed, smaller prediction errors, and a more comprehensive representation of molecular structures [14][15][16].
In previous literature, there have been reports on the use of various theoretical descriptors to construct QSAR models for predicting the degradation rates of POPs in air [17].Compared to the previous studies, Khan et al. utilized the partial least squares (PLS) regression method to establish a QSAR model for the average half-life of 302 different POPs in air [7].They employed a genetic algorithm (GA) to screen molecular descriptors and obtained the following parameter results: the optimal model consisted of six descriptors, with an R 2 value of 0.663 and Q 2 value of 0.640.Gramatica et al. utilized more sophisticated 3D descriptors to construct QSAR models for the air half-life of 60 POPs [17].Although the models achieved an R 2 of 0.850 and Q 2 of 0.80, they did not validate the predictive quality of unknown compounds with any testing or external datasets.Zhu and Tao employed seven different machine learning methods to construct 13 quantitative structure-property relationship (QSPR) models for predicting the polydimethylsiloxane-air partition coefficient (K PDMS-air ) [18].They found that the gradient boosting decision tree (GBDT) model demonstrated superior predictive accuracy and interpretability, with parameter results of R 2 adj = 0.995 and Q 2 BOOT = 0.980.In the current work, the average and maximum half-lives of 60 POPs in air were obtained from previous literature and their molecular structures are listed in Figure S1 [17].The main differences among these pollutants are their degree of chlorination and the position of carbon-chlorine bonds [19].To enhance the accuracy of the model predictions, six QSAR models were built to explore correlations between the mean and maximum half-life in air with selected molecular descriptors and molecular fingerprints.The model is built using PLS, genetic function approximation (GFA), and multiple linear regression (MLR) techniques.The robustness and predictive stability of the constructed models are evaluated using internal and external validation methods [20].The resulting QSAR model provides predictive values and characterization of the outcomes.

Results
In the present study, we employed the stepwise regression (SR) method for variable selection.The SR method requires setting a significance level for variable entry and variable removal to assess the significance of the dependent variable.If a variable significantly improves the explanatory power of the model, it is included in the model.Conversely, if a variable does not contribute significantly to the model, it is removed.Ultimately, we obtained a regression model that includes the optimal combination of variables.
In the process of developing QSAR models, selecting suitable molecular descriptors can provide information about the molecular structure and properties and enhance the accuracy and stability of the model predictions.To determine the optimal combination of descriptors, we employed stepwise regression analysis to construct QSAR models for POPs with mean and maximum half-lives in air and performed analyses under different threshold conditions.Figure 1 illustrates the relationship between the model coefficient R 2 and different threshold values F for three regression methods.It can be observed that when the threshold value F is between 0.05 and 0.15, the R 2 values for all three regression methods reach their maximum.Specifically, PLS has an R 2 value of 0.793.MLR has an R 2 value of 0.818, and GFA has an R 2 value of 0.792.In addition, five descriptors were identified as most closely related to the air mean and maximum half-life of POPs.Table 1 presents the selected molecular descriptors for different threshold conditions.
Molecules 2023, 28, x FOR PEER REVIEW 3 of 18 versely, if a variable does not contribute significantly to the model, it is removed.Ultimately, we obtained a regression model that includes the optimal combination of variables.
In the process of developing QSAR models, selecting suitable molecular descriptors can provide information about the molecular structure and properties and enhance the accuracy and stability of the model predictions.To determine the optimal combination of descriptors, we employed stepwise regression analysis to construct QSAR models for POPs with mean and maximum half-lives in air and performed analyses under different threshold conditions.Figure 1 illustrates the relationship between the model coefficient R 2 and different threshold values F for three regression methods.It can be observed that when the threshold value F is between 0.05 and 0.15, the R 2 values for all three regression methods reach their maximum.Specifically, PLS has an R 2 value of 0.793.MLR has an R 2 value of 0.818, and GFA has an R 2 value of 0.792.In addition, five descriptors were identified as most closely related to the air mean and maximum half-life of POPs.Table 1 presents the selected molecular descriptors for different threshold conditions.HOMO_Energy_DMol3, ES_Sum_aasC, Jurs_FNSA_3, SC_3_P, ALogP98 F6 = 0.8-0.9 Jurs_FNSA_3, Num_RingFusionBonds, SC_3_P, ALogP98 F7 = 0.9-0.99ALogP98, HOMO_Energy_DMol3, Num_AtomClasses, Jurs_FNSA_3, BIC

Dividing the Dataset
The structure of a molecule is responsible for its biological activity and chemical reactivity, so compounds with similar physical and chemical properties or structures tend to have similar activities [21].We used an unsupervised learning method, principal component analysis (PCA), to classify a dataset containing 59 compounds, dividing the dataset into a training set and a test set according to the aggregation of these compounds in the principal component space with the ratio of 4:1.By observing the spatial distribution of each compound, we divided the dataset into seven categories, where G1-G7 rep-  HOMO_Energy_DMol3, ES_Sum_aasC, Jurs_FNSA_3, SC_3_P, ALogP98 F5 = 0.7-0.8 HOMO_Energy_DMol3, ES_Sum_aasC, Jurs_FNSA_3, SC_3_P, ALogP98 F6 = 0.8-0.9 Jurs_FNSA_3, Num_RingFusionBonds, SC_3_P, ALogP98 F7 = 0.9-0.99ALogP98, HOMO_Energy_DMol3, Num_AtomClasses, Jurs_FNSA_3, BIC

Dividing the Dataset
The structure of a molecule is responsible for its biological activity and chemical reactivity, so compounds with similar physical and chemical properties or structures tend to have similar activities [21].We used an unsupervised learning method, principal component analysis (PCA), to classify a dataset containing 59 compounds, dividing the dataset into a training set and a test set according to the aggregation of these compounds in the principal component space with the ratio of 4:1.By observing the spatial distribution of each compound, we divided the dataset into seven categories, where G1-G7 represent cycloalkanes, tricyclic compounds, chlorinated biphenyls, hexachlorocyclohexanes, mixtures of chlorinated biphenyls and alkynes, insecticides, and dibenzofurans and their derivatives, respectively.In a two-dimensional graph, the arrows represent the principal component loadings, which reflect the correlation coefficients between the original variables and the principal components.The values of the projections of the arrows onto the coordinate axes represent the positive or negative correlation and the magnitude between the variables and the principal components.The large circles represent confidence ellipses (usually 95% confidence intervals), and the points outside the circles are not statistically significant.The small dots represent the sample points, and the distances between the lines connecting the sample points reflect the similarity between the variables.If the distances between the lines connecting the sample points of each group are short, it indicates that there is less variability between the samples.Based on the information shown in Figure 2 and Table S1, it can be observed that the cumulative variance contribution of PC1 and PC2 is 70.62%, and all eigenvalues are greater than 1.PC1 and PC2 explained 40.42% and 30.20% of the variance, respectively.Therefore, it can be concluded that the first three principal components effectively capture information on various indicators of persistent pollutants.Among them, there is a strong correlation between the mean and maximum air half-lives of POPs and the descriptor HOMO_Energy_DMol3.Two-dimensional PCA plots show the importance of each variable in the two principal components: log air half-lives (mean and max) and HOMO_Energy_DMol3 in the first principal component, while Dipole_Z, SA_score_Fragments, and SC_3_P were mainly indicated in the second principal component.
resent cycloalkanes, tricyclic compounds, chlorinated biphenyls, hexachlorocyclohex anes, mixtures of chlorinated biphenyls and alkynes, insecticides, and dibenzofuran their derivatives, respectively.In a two-dimensional graph, the arrows represent the principal component loadings, which reflect the correlation coefficients between the original variables and the principal components.The values of the projections of the rows onto the coordinate axes represent the positive or negative correlation and the magnitude between the variables and the principal components.The large circles re sent confidence ellipses (usually 95% confidence intervals), and the points outside th circles are not statistically significant.The small dots represent the sample points, an distances between the lines connecting the sample points reflect the similarity betwe the variables.If the distances between the lines connecting the sample points of each group are short, it indicates that there is less variability between the samples.Based the information shown in Figure 2 and Table S1, it can be observed that the cumulat variance contribution of PC1 and PC2 is 70.62%, and all eigenvalues are greater than PC1 and PC2 explained 40.42% and 30.20% of the variance, respectively.Therefore, be concluded that the first three principal components effectively capture informatio various indicators of persistent pollutants.Among them, there is a strong correlation between the mean and maximum air half-lives of POPs and the descriptor HO-MO_Energy_DMol3.Two-dimensional PCA plots show the importance of each vari in the two principal components: log air half-lives (mean and max) and HO-MO_Energy_DMol3 in the first principal component, while Dipole_Z, SA_score_Fragments, and SC_3_P were mainly indicated in the second principal com ponent. .

Developing QSAR Models
We employed PLS, MLR, and GFA to establish quantitative relationship models between molecular descriptors and air half-lives.The distributions of experimental and predicted values for mean and maximum half-lives are shown in Figures 3 and 4. All data points are scattered around the best-fit line, indicating the consistency between the predicted and experimental values of the constructed models and further confirming the acceptability of the models [22].To visually represent the parameters of each model, we appropriately normalized the data and provided the metrics R 2 , R 2 text , RE, MAE test , and RMSE test for three regression models (Figure 5).The fitting effects, hyperparameters (Table 2), and statistical parameters (Table 3) of the three regression models are shown below: Molecules 2023, 28, x FOR PEER REVIEW 6 of 19       According to Table 3, the statistics of the three models show that they perform satisfactorily in terms of internal and external predictive ability.The parameters of the three models constructed in this study meet the criteria [23] (except for GFA), with R 2 > 0.70 and Q 2 > 0.60, and R 2 − Q 2 < 0.3, indicating good model fit, robustness, and predictive performance.By comparing the comprehensive performance parameters of the three models, it can be seen that the MLR model performs the best in all aspects among the three models, followed by the PLS model, and finally the GFA model.The R 2 and R 2 adj of the MLR model are 0.931 and 0.915, respectively.R 2 represents the goodness of fit to the observed values, and R 2 adj takes into account the influence of model complexity, which can better evaluate the predictive ability of the model.The RMSE test and MAE test values of the MLR model are all lower than the values of the other two models.This indicates that the MLR model has smaller prediction errors and more stable prediction performance.The fitting graph of the model also demonstrates a strong correlation between the experimental and predicted values in both the training and validation datasets, further confirming the good quality of the model.The MLR method was chosen to predict 10 unknown POP compounds for external validation to make our prediction results more convincing [20], and the determination coefficients (R 2 ) and relative errors (RE) for half-life are 0.919 and 3.156, further indicating that our model has good predictive power (Tables S3 and S4).

Discussion
By interpreting and analyzing the molecular descriptors of the model, insights into the molecular structures related to the mean and maximum half-lives of POPs in air are provided.In this study, a total of five molecular descriptors, namely HOMO_Energy_DMol3, Dipole_Z, SAscore_Fragments, SC_3_P, and SIC, were selected to build QSAR models based on PLS, MLR, and GFA.According to the regression equation of the model, it can be inferred that the descriptors HOMO_Energy_DMol3 and SAscore_Fragments have significant contributions to the prediction of the air half-life of POPs.From the regression coefficient in the formula, it can be seen that the values of the mean and maximum half-lives of air increase with the values of Dipole_Z, SAscore_Fragments, SC_3_P, and SIC and decrease with the value of HOMO_Energy_DMol3.
HOMO_Energy_DMol3 is a molecular descriptor that utilizes the DMol3 code to measure the highest occupied molecular orbital energy level of a molecule.HOMO represents the highest occupied molecular orbital and is used to determine the energy level of molecular orbitals as an indicator of electron-releasing capacity.Certain chlorinated organic compounds, such as DDT, DDE, and DDD, have lower HOMO energies.Since they are chlorinated compounds with high electronegativity, they attract surrounding electrons and reduce their electron-giving ability, making them less reactive and susceptible to electron transfer reactions, thus affecting the degradation rate of the compounds.On the contrary, some compounds that do not contain chlorine atoms, such as acenaphthene and dibenz[a,h]anthracene, have higher HOMO energies.These compounds contain only carbon, hydrogen, and other atoms and therefore have fewer electrons in the electron cloud.They are relatively more reactive and are more readily degraded or oxidized, thus shortening the degradation half-life of the compounds in air.
Dipole_Z is used to measure the dipole nature of a compound.Due to the geometry of the molecule and the inhomogeneity of the distribution of the electron cloud, a partial separation of positive and negative charges between the carbon and chlorine atoms occurs, forming a complex dipole.When the position of the carbon-chlorine bond is closer to the center of the molecule, the compound has a smaller difference in electronegativity, a more uniform charge distribution within the molecule, and a lower dipole.On the contrary, when the position of the carbon-chlorine bond is closer to the end of the molecule, the compound has a larger difference in electronegativity, a higher energy of chemical bonding, and a higher dipole.In general, organic pollutants with higher Dipole_Z values will have the carbon-chlorine bond position closer to the center of the molecule, which will be more easily protected by the surrounding groups.The compounds will be in a more stable state and therefore may be more difficult to degrade.Organic pollutants with lower Dipole_Z values have carbon and chlorine bonds closer to the end of the molecule, which are more susceptible to photolysis, oxidation, etc. and form larger dipole molecules, resulting in an uneven distribution of charge and a larger dipole moment of the compounds, which exacerbate the degradation of POPs.SAscore_Fragments, on the other hand, is a molecular synthetic accessibility score based on fragment contributions that describe the surface area of the molecule at a macroscopic level and reflect the exposure of the molecule to the environment.The main definition of the score is based on the similarity to structural features observed in the PubChem subset, as well as estimates of synthetic accessibility to unusual loops and many stereocenters.Its value generally lies between 1 and 10, and the molecules with a high SAscore are more difficult to synthesize, while molecules with low SAscore values are easier to synthesize.The synthesis of compounds is subject to some variability and complexity due to various factors, including reaction conditions, reaction by-products, and chemical stability.For example, 2,2 ,3,4,5-pentachlorobiphenyl contains more chlorine atoms and requires multiple chlorination reactions, which may produce some non-target products and increase the difficulty of reaction and purification.In contrast, 3-chlorobiphenyl contains only one chlorine atom, and its molecular structure is simpler and relatively unstable, so it is easier to carry out the chlorination reaction.Therefore, the synthesis of 2,2 ,3,4,5-pentachlorobiphenyl was relatively more difficult than the synthesis of 3-chlorobiphenyl.It was proven that the SAscore value of 2,2 ,3,4,5-pentachlorobiphenyl (0.81594) was significantly higher than the SAscore value of 3-chlorobiphenyl (0.64936), and the air average and maximum half-life values of 2,2 ,3,4,5-pentachlorobiphenyl were higher than those of 3-chlorobiphenyl.
SC_3_P is a descriptor for the number and type of structural units in a molecule, which has an important influence on the properties and reactivity of the molecule.A larger value of SC_3_P indicates that the compound has a more diffuse molecular shape, a larger number of structural units, and a higher degree of volatility and lipophilicity, which results in a shorter degradation of the POPs in the atmosphere.DDT, DDE, and DDD, for example, are organochlorine compounds with a single structure (containing only benzene rings and chlorine atoms) and a regular molecular shape that are mainly removed from the atmosphere by gas-phase wet deposition and rainfall.In contrast, acenaphthene and dibenz[a,h]anthracene are polycyclic aromatic hydrocarbon compounds with flat structures and conjugated systems, which are less stable and photostable, more susceptible to photochemical and oxidative reactions, and therefore have faster degradation rates in air.It should be noted that although the SC_3_P values of DDT, DDE, and DDD were 0.50923, 0.52037, and 0.57366, respectively, which were significantly lower than the SC_3_P values of acenaphthene (0.76754) and dibenz[a,h]anthracene (0.65625), acenaphthene and dibenz[a,h]anthracene had smaller air mean and maximum half-life values than DDT, DDE, and DDD.
SIC is a metric used to characterize the structural complexity of a compound.The structural complexity of a compound is evaluated by considering several factors, such as the number and type of chemical bonds as well as the arrangement of atoms in the molecule.A higher SIC value indicates a more complex structure of the compound.For example, DDT, DDE, and DDD are larger, single molecules with relatively looser spaces and fewer chemical bonds within the molecule.As a result, they are more susceptible to attack and decomposition by oxygen, water, or other chemicals in the environment.In contrast, acenaphthene and dibenz[a,h]anthracene are large molecules composed of multiple aromatic rings.Due to the smaller space inside the molecule, the atoms in the molecule are close to each other, possessing more chemical bonds and ways of connecting the atoms.Therefore, they are relatively stable and difficult to decompose in the environment.In summary, higher SIC values are usually associated with slower and more difficult degradation of compounds in air.
According to the OECD guidelines, a practical QSAR model should have a clear applicability domain (AD) [24].In this study, we used the leverage value to quantitatively measure the extent of extrapolation and evaluate the applicability range of the model.According to Figure 6a, most of the compounds in the training set and validation set are within an acceptable range, except for the dieldrin and p, p'-DDE compounds.This indicates that the training data for the model are representative.From Figure 6b, it can be observed that there are a total of four substances in the dataset with δ > 3, which are defined as outliers.Among them, the training set contains three compounds (2,3 ,4,4tetrachlorobiphenyl, benzo[b]fluoranthene, and benzo[a]pyrene) with δ > 3, indicating that these substances have a significant impact on the construction of the model.Furthermore, the test set includes one compound (endrin) with h < h* and δ > 3, suggesting that the structure of these two compounds significantly differs from the compounds in the training set used to build the model.

Data Collection and Partitioning
In order to make the experimental data conform to a normal distribution [25][26][27][28], the logarithmic numerical forms of the air mean half-life and maximum half-life of the 60 POPs were collected from previous studies (Table 4) [17].The air mean half-life values range from −0.14 to 7.33 and the air maximum half-life values range from 0.04 to 7.62.The 60 POPs include polycyclic aromatic hydrocarbons (PAHs), polychlorinated biphenyls (PCBs), halogenated benzenes (HBs), dioxins, furans, and pesticides.The atmospheric half-life, which is the time required for the content of POPs to decrease to half of the initial content after entering the atmosphere, was used to study the degree of degradation and dispersion pattern of these pollutants and is important for assessing their ecological risks.The division of the dataset into training and test sets is an important step in performing QSAR modeling.The training set is used for model development, while the test set compounds are used for model validation [29].Based on OECD guidelines and rules of thumb, the entire dataset is usually randomly divided into a training set (containing 47 compounds, with mirex excluded for its high impact on the model) and a test set (containing 12 compounds) in a certain ratio (usually 4:1).However, it is difficult to ensure the diversity of compounds in the training set using the randomized division method, and thus there is some uncertainty.Therefore, in order to improve the generalization ability and prediction performance of the model, this study introduces suitable molecular descriptors by setting different threshold ranges, and then randomly divides the training set and test set according to the ratio of 4:1. Figure 3 shows the comparison results of the correlation coefficient R 2 of the QSAR models constructed by the three regression methods under different threshold ranges.It is found that the R 2 values of the three QSAR models are the largest in the range of threshold F from 0.05 to 0.15, which initially indicates that the models are better fitted.Finally, we selected the molecular descriptors introduced with the threshold F in the range of 0.05 to 0.15 and used PCA to convert the properties of the compounds (molecular descriptors and half-life properties) into a set of principal component scores.The aggregation of the compounds on the spatial axes of the principal components can be used to compare the similarity between different compounds, thus identifying compounds with similar properties.Therefore, not only was the best combination of molecular descriptors selected in this study but the diversity of the training and test sets was also ensured.

Data Collection and Partitioning
In order to make the experimental data conform to a normal distribution [25][26][27][28], the logarithmic numerical forms of the air mean half-life and maximum half-life of the 60 POPs were collected from previous studies (Table 4) [17].The air mean half-life values range from −0.14 to 7.33 and the air maximum half-life values range from 0.04 to 7.62.The 60 POPs include polycyclic aromatic hydrocarbons (PAHs), polychlorinated biphenyls (PCBs), halogenated benzenes (HBs), dioxins, furans, and pesticides.The atmospheric half-life, which is the time required for the content of POPs to decrease to half of the initial content after entering the atmosphere, was used to study the degree of degradation and dispersion pattern of these pollutants and is important for assessing their ecological risks.The division of the dataset into training and test sets is an important step in performing QSAR modeling.The training set is used for model development, while the test set compounds are used for model validation [29].Based on OECD guidelines and rules of thumb, the entire dataset is usually randomly divided into a training set (containing 47 compounds, with mirex excluded for its high impact on the model) and a test set (containing 12 compounds) in a certain ratio (usually 4:1).However, it is difficult to ensure the diversity of compounds in the training set using the randomized division method, and thus there is some uncertainty.Therefore, in order to improve the generalization ability and prediction performance of the model, this study introduces suitable molecular descriptors by setting different threshold ranges, and then randomly divides the training set and test set according to the ratio of 4:1. Figure 3 shows the comparison results of the correlation coefficient R 2 of the QSAR models constructed by the three regression methods under different threshold ranges.It is found that the R 2 values of the three QSAR models are the largest in the range of threshold F from 0.05 to 0.15, which initially indicates that the models are better fitted.Finally, we selected the molecular descriptors introduced with the threshold F in the range of 0.05 to 0.15 and used PCA to convert the properties of the compounds (molecular descriptors and half-life properties) into a set of principal component scores.The aggregation of the compounds on the spatial axes of the principal components can be used to compare the similarity between different compounds, thus identifying compounds with similar properties.Therefore, not only was the best combination of molecular descriptors selected in this study but the diversity of the training and test sets was also ensured.

Calculation and Filtering of Molecular Descriptors
Molecular descriptors are properties of chemical structures that are measured experimentally or derived theoretically [30].They are highly related to the fitting performance and predictive ability, so the selection of molecular descriptors is crucial in QSAR model research.When selecting molecular descriptors, the following principles should be followed: (1) the chosen molecular descriptors should have interpretable physical and chemical significance and be relevant to the research question; (2) the chosen molecular descriptors should be as diverse as possible to cover multiple aspects of molecular structure and properties; (3) the chosen molecular descriptors should be associated with the target property or activity; (4) the chosen molecular descriptors should be computationally stable and have good repeatability.
First, simplified molecular input line entry system (SMILES) structures for 60 POPs were retrieved from PubChem (https://pubchem.ncbi.nlm.nih.gov(accessed on 23 November 2022)) [31].Second, the chemical structures of 60 POPs were built with Maestro and processed with Ligprep in Schrödinger.Discovery Studio (DS) 3.5.software was then used to calculate approximately 586 molecular descriptors.These descriptors included about 10 2D, 3D, structural, thermodynamic, topological, and spatial descriptors [32].Finally, the construction of the QSAR model was continued using the PLS, MLR, and GFA regression methods built into the commercial DS 3.5 molecular simulation software.For descriptors, we removed constant terms or descriptors that were close to constant and processed descriptors with missing values and outliers.To reduce redundancy in the descriptor data matrix [33], we used the stepwise multiple regression method to eliminate some collinear descriptors.The selected molecular descriptors should meet the inclusion threshold of F = 0.05 and exclude the threshold of F = 0.15.Additionally, the variance inflation factor (VIF) should be less than 10, and the significance level (p) should be less than 0.001.Finally, a Pearson correlation analysis was conducted by using Origin 2020 software.Following the specified requirements, five optimal descriptors were obtained.There is a good correlation between these parameters and the half-life of POPs [34].The meanings of each descriptor are shown in Table 5.The correlation analysis among the five descriptors is illustrated in Figure 7, and the descriptor correlation matrix is presented in Table S2.
Table 5.The names and meanings of the five descriptors used in this study.

HOMO_Energy_DMol3
The energy of the highest occupied molecular orbital

Species Designation Description
Molecular descriptors

HOMO_Energy_DMol3
The energy of the highest occupied molecular orbital Dipole_Z Component of the dipole moment along the z-axis SAscore_Fragments Fragment contribution to SAscore SC_3_P Subgraph counts SIC Structural information content.Graph-theoretical info content descriptor which differentiates molecules according to their size, degree of branching, and flexibility ECFP is an advanced chemical fingerprinting technique that can rapidly characterize the structural features of molecules, regardless of their specific molecular family [35].This technique utilizes standardized resources and is capable of evaluating an individual's exposure level by detecting the metabolites of various pollutants in the body.It achieves this by transforming the structural features of molecules into a series of bit strings.Molecular fingerprints can provide a better representation of the functional groups, chemical bonds, and chemical structures of compounds, making it easier to explain the chemical properties of compounds.To improve the quality of the prediction model, this study further introduces molecular fingerprints based on the use of molecular descriptors.Figure S2 shows the five typical molecular fingerprint patterns selected in ECFP is an advanced chemical fingerprinting technique that can rapidly characterize the structural features of molecules, regardless of their specific molecular family [35].This technique utilizes standardized resources and is capable of evaluating an individual's exposure level by detecting the metabolites of various pollutants in the body.It achieves this by transforming the structural features of molecules into a series of bit strings.Molecular fingerprints can provide a better representation of the functional groups, chemical bonds, and chemical structures of compounds, making it easier to explain the chemical properties of compounds.To improve the quality of the prediction model, this study further introduces molecular fingerprints based on the use of molecular descriptors.Figure S2 shows the five typical molecular fingerprint patterns selected in this study.

Machine Learning Methods
Due to the presence of multicollinearity between the sets of independent variables and dependent variables, PLS was used to establish a regression model.PLS is a modified version of PCA regression that can analyze data with missing values, strong collinearity, noise, and numerous predictor variables [36].PLS can identify the molecular descriptors that have a high contribution to the atmospheric half-lives of POPs and can also assist in establishing regression equations and models for these descriptors.
where X is a matrix of size n × p, n is the number of samples, and p is the number of independent variables.Y is a matrix of size n × q, and q is the number of dependent variables.PLS aims to find a new set of synthetic variables such that the covariance between them is maximized.T can explain the change in X and Y well.MLR is a widely used statistical method that applies to situations involving multiple input variables.Compared to simple linear regression, MLR has a more complex structure and provides a more direct interpretation [37,38].In MLR, we can incorporate multiple explanatory variables to measure the explanation of individuals in different aspects across multiple dimensions.The main purpose is to use the relevant model for interpretation and prediction to gain deeper insights and understanding.Thus, MLR is a powerful tool that provides us with a reliable means to solve practical problems and achieve research outcomes.
where Y is an n × 1 column vector with n being the number of samples, representing the observations of the dependent variable Y; X is an n × (p + 1) matrix with p being the number of independent variables; β is a (p + 1) × 1 column vector representing the regression coefficients; and ε is an n × 1 column vector representing the error terms.GFA is a combination algorithm that combines a genetic algorithm (GA) with multiple adaptive regression (MAR), and it can optimize parameters.GFA identifies several independent variables that have a strong correlation with the dependent variable from a large number of independent variables.The quality of the model is evaluated using evaluation methods such as LOF.The higher the numerical value, the higher the model score [39].
where SSE is the standard error; c is the number of GFA models; d is the smoothing parameter of the model; p is the number of eigenvalues in the model; n is the number of compounds in the training set.

Model Evaluation and Validation
In this study, we evaluate the goodness of fit, stability, and predictive ability of the QSAR model using modeling statistical parameters and internal and external validation metrics [40][41][42].The stability of the model under interference was evaluated using leaveone-out cross-validation (LOO CV ).Leave-one-out cross-validation involves repeatedly training the model on the remaining samples after leaving out one sample from the training set and then uses the trained model to predict the properties of the left-out sample.The internal validation metrics for the regression model are as follows: R 2 ranges from 0 to 1.The closer R 2 is to 1, the more information about the dependent variable the model can explain.
RMSE indicates the dispersion of the random error.

RMSE =
MAE is used to measure the error between the predicted and experimental values.
Q 2 CV is the value of the cross-validation of the extraction method, which reflects the predictive power of the QSAR model.
RE is the relative error.
where Y exp and Y pred denote the experimental values and the predicted values of the models.Y mean denotes the mean of the experimental values of the compounds, and N denotes the number of compounds in the model External validation is a crucial and effective step in evaluating the predictive capability of the model.It involves checking the prediction results and the actual results of molecules in the test set using the QSAR model that was built using the training set.This helps further validate the true predictive ability of the model.Common validation parameters include where the X exp , X pred , and X mean denote the experimental, predicted, and mean values of the prediction set, respectively.According to the OECD guidelines, application domain characterization is an important part of assessing the reliability and accuracy of the model to determine the range of applicability of the model.Standard residuals (δ), leverage values (h), and Williams plots were used for model application domain characterization.h and δ were calculated as follows:

Figure 1 .
Figure 1.The relationship between the model's coefficient of determination (R 2 ) and different threshold values (F) by PLS, MLR, and GFA.

Figure 1 .
Figure 1.The relationship between the model's coefficient of determination (R 2 ) and different threshold values (F) by PLS, MLR, and GFA.

Figure 2 .
Figure 2. A two-dimensional plot of principal component analysis.Figure 2. A two-dimensional plot of principal component analysis.

Figure 2 .
Figure 2. A two-dimensional plot of principal component analysis.Figure 2. A two-dimensional plot of principal component analysis.

Figure 3 .
Figure 3. Plot of experimental and predicted values for mean air half-life modeled by PLS (a), MLR (b), and GFA (c).Figure 3. Plot of experimental and predicted values for mean air half-life modeled by PLS (a), MLR (b), and GFA (c).

Figure 3 .
Figure 3. Plot of experimental and predicted values for mean air half-life modeled by PLS (a), MLR (b), and GFA (c).Figure 3. Plot of experimental and predicted values for mean air half-life modeled by PLS (a), MLR (b), and GFA (c).

Figure 4 .
Figure 4. Plot of experimental and predicted values for maximum air half-life modeled by PLS (a), MLR (b), and GFA (c).Figure 4. Plot of experimental and predicted values for maximum air half-life modeled by PLS (a), MLR (b), and GFA (c).

Figure 4 . 18 Figure 4 .
Figure 4. Plot of experimental and predicted values for maximum air half-life modeled by PLS (a), MLR (b), and GFA (c).Figure 4. Plot of experimental and predicted values for maximum air half-life modeled by PLS (a), MLR (b), and GFA (c).

Figure 5 .
Figure 5. Radar model diagram of three models and five indicators (PLS, MLR, GFA) for air half-life.Figure 5. Radar model diagram of three models and five indicators (PLS, MLR, GFA) for air half-life.

Figure 5 .Table 2 .
Figure 5. Radar model diagram of three models and five indicators (PLS, MLR, GFA) for air half-life.Figure 5. Radar model diagram of three models and five indicators (PLS, MLR, GFA) for air half-life.

Molecules 2023 , 18 Figure 6 .
Figure 6.The Williams plot of the mean and maximum half-lives in air by MLR.

Figure 6 .
Figure 6.The Williams plot of the mean and maximum half-lives in air by MLR.

18 Figure 7 .
Figure 7. Pearson correlation of the five descriptors in this study.

Figure 7 .
Figure 7. Pearson correlation of the five descriptors in this study.

h * = 3 (m + 1 )
x i denotes the number of the ith compound descriptor, X denotes the matrix composed of compound descriptors.m denotes the number of molecular descriptors in the model; n tr denotes the number of training sets; y expi denotes the experimental values of compounds; and y predi denotes the predicted values of compounds.

Table 3 .
Comparison of parameters between the training set and test set in three regression models.

Table 4 .
Comparison of three regression methods for air half-life data.

Table 5 .
The names and meanings of the five descriptors used in this study.