Abstract
The scarcity of reliably determined STELs for numerous chemicals severely impedes occupational health risk assessment. To address this gap, this study establishes and validates a suite of robust quantitative structure–activity relationship (QSAR) models to efficiently predict STELs for hydrocarbons and their derivatives. A dataset of 60 compounds was partitioned using Affinity Propagation clustering, and the validity of this division was verified using Tanimoto similarity analysis and Uniform Manifold Approximation and Projection (UMAP). Four optimal molecular descriptors, indicative of molecular size and spatial configuration, were identified using a genetic algorithm. These descriptors served as inputs for one linear model—multiple linear regression (MLR)—and three nonlinear models: support vector machine (SVM), back-propagation artificial neural network (BP-ANN), and extreme gradient boosting (XGBoost). All models were rigorously validated according to OECD principles. The results demonstrated that the XGBoost model achieved superior performance, with key metrics (\({R}^{2}, {Q}_{\text{loo}}^{2}, {Q}_{\text{ext}}^{2}\)) all exceeding 0.9. Interpretability analysis using SHAP (SHapley Additive exPlanations) revealed that molecular size and symmetry descriptors (E3u, G2m) positively correlate with STEL, while the degree of unsaturation (n = CHR) shows a significant negative influence, providing novel mechanistic insights into the structure–toxicity relationship. Notably, 96% of the predictions fell within the defined applicability domain, confirming the model’s reliability. This study therefore serves as a rapid, accurate, interpretable, and reliable computational tool, with the potential to significantly inform and enhance occupational health and safety decision-making, especially for novel or data-poor chemicals.
Keywords:
quantitative structure–activity relationship (QSAR); short-term exposure limit (STEL); hy-drocarbons and their derivatives; molecular descriptors; multiple linear regression (MLR); support vector machine (SVM); back-propagation artificial neural network (BP-ANN); ex-treme gradient boosting (XGBoost)