3.2. Model Development and Validation
- (i)
Best model results comprises the basic information for the selected model (GA-LDA) as a fitness score value, qualitative validation metrics (for both training and test set), and tables (
Tables S2 and S3, found in
Supplementary Material File S4) comprising some information about the training and test set.
The final GA-LDA model resulted in a fitness score (using the training set only) of −0.6946, with a Wilks’ Lambda (train) of 0.8473 and a Wilks’ Lambda (test) of 0.8239. The validation metrics for the training and test sets are presented in
Table 1 while the training and test set are presented on
Figure 2A,B and
Figure 3A,B, respectively. The data from
Figure 2 are presented in the
Supplementary Material File S4 (Tables S2–S4).
Both Positive and Negative polymer classes were correctly classified in 60% of cases, indicating balanced performance.
For both series (training and test), a balanced performance is observed among all polymers tested and similar results were obtained. The confusion matrix indicates a value of 0.60 and 0.50 for Positive/Positive (True Positive), considering training and test series, respectively. The other Real Positive condition (False Negative) was 0.40 and 0.50, indicating this amount of miss/underestimation. For the Real Negative conditions (False Positive and True Negative), the values obtained were 0.40 and 0.60 for the training series and 0.38 and 0.62 for the test series, respectively. For the former series, a false alarm/overestimation is considered while for the latter a correct rejection is considered.
For
Figure 2 and
Figure 3B, each point represents a polymer, ordered according to the predicted probability on the
x-axis, while the
y-axis indicates the posterior probability P(Pdescriptors). The green points correspond to correctly classified samples, and the red points indicate incorrect classifications. The dashed line at 0.5 represents the decision threshold that separates classes P and N. It can be observed that classification errors are concentrated near this threshold, while samples with more extreme probabilities exhibit greater confidence in the prediction, highlighting the uncertainty behavior of the model in boundary regions between classes [
22].
- (ii)
Compare top models presents all the top GA-LDA models selected in the final generation/iteration. The redundant models are removed so the final number of top models may vary with each run.
Table 2 presents only the top five models. All models are presented in the
Supplementary Material File S3.
The descriptors’ category significances are described below [
21,
23]:
E2 (E2u and E2s) refers to a specific category of descriptors which encode angular and radial information about atomic configuration, usually in relation to two atoms.
TDB (TDB10u, TDB10e, TDB10s) describes the relationship between the average three-dimensional (Euclidean) distance and the topological distance (path length, or number of bonds) between possible atom pairs in a molecule.
RDF (RDF25i, RDF65u, RDF25u) describes the density of atoms at different distances from a reference atom, capturing information about the local structure of the molecule.
For further development, only Model Number 1 (the best model obtained) was considered. For Model Number 1, the training and test models are presented in
Tables S3 and S4. For both tables, the response class (P or N) and the numerical values of the descriptors from Model Number 1 (
Table 2) are presented.
Tables S3 and S4 can be found in the
Supplementary Material File S4.
The descriptors’ significances are described below [
21,
23]:
E2u: second component of accessibility directional WHIM index/unweighted.
TDB10u: a ring descriptor related to deviation/distance indices (D/Dtr) of order 10 (referred to as D/Dtr10). The suffix “u” likely indicates that it is the “unscaled” or “unnormalized” version of the description. D topological distance-based descriptors—lag 10 unweighted.
RDF25i: three-dimensional (3D) radial distribution function (RDF) descriptor that encodes information about the spatial distribution of atoms in a molecule, weighted by a specific type of atomic property. The suffix “i” indicates that the descriptor calculation is weighted by the first ionization potential of the atoms involved. The number “25” refers to a specific data point or “lag” (distance interval) within the series of calculated RDF descriptors (typically in increments of 0.5 Å, but the exact value refers to a position in the series, not necessarily 25 Å). Radial distribution function—025/unweighted.
Figure 4 shows the LDA score plot based on DF1 (discriminant function 1) and an auxiliary DF2, which was derived from PCA (Principal Component Analysis), used for visualization purpose only, since binary LDA generates only one true discriminant function by nature. There is a clear trending of separation between N and P classes along DF1 axis. The class P polymers (blue) cluster is predominant in the positive DF1 region, whereas class N polymers (orange) concentrate mostly in negative DF1 values. In the center there is an overlap over the decision boundary.
The auxiliary DF2 axis reflects internal structural variability not directly related to class differentiation. A small number of samples appear as vertical outliers, indicating polymers with distinct 3D descriptor profiles. Overall, the plot confirms that DF1 captures the essential structural information responsible for class discrimination in the GA-LDA model.
To understand which structural features drive class separation in the GA–LDA model, the coefficients (loadings) of the discriminant function DF1 were analyzed. Since the system contains only two response classes (P and N), the LDA generates a single discriminant function, meaning that DF1 concentrates all discriminative information.
The loadings reveal a strong dominance of the descriptor E2u, which presents a coefficient of –7.10. This indicates that the discrimination between the two classes is governed primarily by energetic/electronic 3D characteristics captured by this descriptor. In contrast, the remaining descriptors selected by the genetic algorithm (TDB10u and RDF25i) show negligible contributions (–3.0 × 10−5 and 5.0 × 10−6, respectively), suggesting that these variables provide little incremental structural information to the classification.
Such behavior is common in binary LDA systems, where only one discriminant direction exists, and a single descriptor often dominates the mode [
24]. The results indicate that DF1 is essentially a function of E2u, which aligns with the trends observed in the score plot and with the structural patterns present in
Tables S5 and S6.
To further characterize the internal behavior of GA-LDA model, the distribution of DF1 scores for classes P and N was analyzed (
Figure 5). As already seen in
Figure 4, class P polymers tend to appear on positive DF1 score and class N polymers on negative ones. Class P exhibited compact distribution that indicates structural homogeneity among the polymers associated with this class. Class N presented a wider dispersion and lower DF1 median, possibly reflecting greater structural variability.
Although the GA-LDA revealed partial structural separation between low- and high-Tg polymers, the moderate discrimination performance suggested that simplified binary classification based exclusively on 3D descriptors was insufficient to fully represent the complexity of Tg behavior across chemically heterogeneous polymer systems. Consequently, continuous regression-based QSPR models using graph-based 2D descriptors were subsequently developed to preserve the full quantitative information associated with Tg.
3.3. Robustness Testing and Applicability Domain
The standard approach training and test as well as the Y-randomization are presented in
Tables S7 and S8 and
Table 3, respectively.
Figure 6 presents the Applicability Domain (AD) of the GA–LDA model, evaluated using Hotelling’s T
2 method based on the first two principal components (PC1 and PC2) calculated from the training set. The green dashed ellipse represents the 95% confidence region that defines the chemical space learned by the model during training. Training polymers are shown as blue points, whereas test polymers are represented by red hollow markers.
Most of the test samples fall inside the confidence ellipse, indicating that they lie within the chemical space defined by the training descriptors and therefore belong to the valid AD of the model. A few samples appear close to the boundary or slightly outside the ellipse, suggesting limited structural extrapolation. These cases warrant more cautious interpretation but still provide valuable insight into the robustness of the GA–LDA model.
Tables S7 and S8 present the three descriptors (Model 1 from
Table 2) with the respective values as well as the response and predicted class (P or N) and the status (Applicability Domain) (inside or outlier). A sample is considered inside the Applicability Domain if its characteristics (features or descriptors) are similar to the data used to train the model, ensuring the prediction is a form of interpolation. Conversely, a sample is an outlier (or out-of-domain) if it falls outside these boundaries, meaning its prediction would be an extrapolation and thus less reliable.
Inside the AD (Inlier): Data points within the AD are similar to the training data in the feature space. The model is expected to provide accurate and reliable predictions for these points because it has “seen” similar examples during training.
Outside the AD (Outlier): Data points outside the AD are considered outliers in the prediction context. Predictions for these samples are generally less reliable and should be treated with caution, as the model may not have learned the underlying relationships for these novel conditions.
The Y-randomization test is presented in
Figure 7. The original model value is 0.8473, the average Wilks’ Lambda from 50 random models is 0.9628, and the difference from random to original is −0.1155. This test is a validation technique to assess if a model is statistically significant or just a result of random chance. It works by comparing the performance of a model built on the original data (y) to models built on many shuffled versions of the (y) variable, while keeping the descriptor data (x) intact. If the original model performs significantly better than the randomized models, it indicates the model is not based on chance correlation.
Our results presented Wilks’ Lambda values for the original model, that are lower compared to the other 50 models, indicating that the present models are not developed by chance, as demonstrated by Ambure et al. [
25]. All data can be found in
Supplementary Material File S4 (Tables S3–S5).
In this part the predict class for query compounds was obtained using the input training model, in addition to the Applicability Domain for every query compound using the standardization approach and the posterior probabilities, which can help to judge the reliability for each prediction using the Concept of Confidence Estimation approach.
Tables S7 and S8 (can be found in
Supplementary Material File S5) showed the query results for the LDA results.
Figure 8 shows the relationship between DF1 and the posterior probability for class P for the polymers without the experimental response class (
Table S7). As expected for a linear discriminant model, samples with positive DF1 values exhibit higher posterior probabilities and are assigned to class P, while samples with negative DF1 values show probabilities below 0.5 and are classified as N.
The clear monotonic trend between DF1 and posterior probability confirms the internal coherence of the GA–LDA model. Samples located outside the Applicability Domain (red-edged markers) appear closer to the decision boundary and display lower confidence, indicating that their predictions should be interpreted cautiously.
E2u: Second component accessibility directional WHIM (Weighted Holistic Invariant Molecular) index/unweighted. WHIM descriptors are molecular descriptors based on statistical indices calculated on the projections of the atoms along principal axes. These descriptors are built in such a way as to capture relevant molecular 3D information regarding molecular size, shape, symmetry and atomic distribution with respect to invariant reference frames.
TDB10u: 3D topological distance-based autocorrelation—lag 10/unweighted.
RDF25i: Radial distribution function/weighted by relative first ionization potential.
It was certain that the model using similar polymers would lead to better predictive power, as mentioned in the studies proposed by refs. [
11,
12,
13,
14] (presented in the Introduction section). But our objective was to use the most different structures as possible to evaluate this new possibility. Our results did not show as positive when using 3D descriptors but a good fit was obtained using 2D descriptors, in spite of this variability of polymers. A future study is intended to measure the predictive power of polymers containing similar chemical structure.
3.4. Regression Models Training and Evaluation
Four regression algorithms (PLS, SVR, Random Forest, and XGBoost) were evaluated using the same descriptor set and validation strategy.
Table 3 summarizes the predictive performance obtained from five-fold cross-validation on the training set (Q
2 and RMSECV) and from the external 20% test set (R
2, RMSE, MAE, and percentage MAE). Importantly, it should be emphasized that this study follows a general, polymer class-agnostic strategy. In other words, the proposed model was developed to predict T
g across a chemically diverse set of polymers rather than be tailored to a single family (e.g., polyolefins, polyesters, or polyamides). While restricting the scope to a specific polymer class would typically yield higher apparent accuracy due to reduced structural variability, the broader formulation adopted here is more consistent with engineering screening scenarios in which the candidate materials span multiple chemistries.
SVR exhibited poor predictive ability (R2_test = 0.101), indicating that the selected descriptor space and/or kernel configuration did not provide a stable mapping for Tg in this dataset. Random Forest achieved intermediate performance (R2_test = 0.693), while PLS provided competitive cross-validated predictivity (Q2_cv = 0.621) but a weaker external test performance compared to XGBoost. The best overall results were obtained with XGBoost, which achieved the highest test-set accuracy (R2_test = 0.825) and the lowest prediction errors (RMSE_test = 31.515 K; MAE_test = 23.400 K; MAE% = 7.141%), while maintaining a strong cross-validated predictivity (Q2_cv = 0.612). Therefore, XGBoost was selected as the final regression model for subsequent diagnostic evaluation and interpretability analysis.
In addition to conventional predictive metrics, the external reliability of the selected regression model (XGBoost) was examined using the Golbraikh–Tropsha criteria, which are widely adopted in QSPR/QSAR studies to verify that the observed–predicted relationship is close to the ideal identity line and that systematic bias is limited. For the external test set, the slopes of the regressions through the origin were close to unity (k = 1.054 and k′ = 0.942), and the deviations associated with origin-constrained fits were small (r20 = 0.869; r20′ = 0.846; Δr2 = 0.024). These results support that the model predictions are not dominated by scaling artifacts and that the predictive relationship remains stable under stringent external-validation checks, reinforcing the suitability of the proposed approach for engineering-oriented polymer screening across chemically diverse structures.
3.5. Predictive Performance and Model Interpretability
Based on the screening results, XGBoost model was selected as the final regression model because it provided the best cross-validated predictivity and a competitive external test performance. The observed versus predicted plot (
Figure 9) indicates that the model captures the global T
g trend, while the calibration curve (
Figure 10) is used to verify the absence of strong systematic bias across the T
g range.
Figure 9 shows the observed versus predicted glass transition temperatures (T
g) obtained with the final model selected. Overall, the predictions follow the expected trend and cluster around the 1:1 reference line, indicating that the model captures the dominant structure–property signal across the investigated T
g range. In the low-to-intermediate T
g region, the agreement is generally close to the identity line, supporting the suitability of the selected descriptor set and learning strategy for practical screening purposes. A systematic pattern is also observed at the upper end of the T
g range, where the model tends to underpredict very high T
g values, a behavior consistent with a mild “regression-to-the-mean” effect commonly found in data-driven polymer property models. This tendency is typically associated with (i) reduced sample density at extreme T
g values, (ii) higher structural heterogeneity among high-T
g polymers, and (iii) limited representation of the corresponding descriptor space.
Figure 10 shows that the calibration curve for the model provides an additional assessment of prediction reliability across the T
g range by comparing the mean predicted T
g to the mean observed T
g within binned intervals. Overall, the curve follows the perfect-calibration line with reasonable agreement, indicating that the model is not strongly biased over most of the investigated domain. Local deviations from the identity line are observed in some intermediate bins, which is expected when calibration is computed from a limited number of samples per bin and when the chemical space is heterogeneous. This behavior would be improved over a broader database or on a mostly homogenic polymer dataset. In the upper T
g region, the curve suggests a tendency toward conservative predictions (i.e., reduced sensitivity at the extremes), consistent with the mild compression observed in the observed–predicted analysis. Despite these bin-level fluctuations, the calibration profile supports the use of the model for engineering screening, particularly within the T
g ranges that are more densely represented in the dataset.
The residual analysis (
Figure 11) provides further insight into model adequacy. The residual distribution (observed–predicted) is centered close to zero, indicating no severe global bias; however, a mild positive skewness is observed, with a longer right tail. This behavior suggests that a subset of samples, typically those located at the upper T
g end or at sparsely represented regions of the descriptor space, tend to be underpredicted by the model. The Q–Q plot supports this interpretation: residuals follow the reference line reasonably well in the central quantiles, while noticeable departures occur at the distribution tails, particularly at the upper tail. Such tail deviations are common in polymer QSPR datasets due to chemical-space heterogeneity, limited sample size, and the presence of structurally distinct compounds. Importantly, these diagnostics indicate that prediction errors are largely controlled for the majority of samples, while larger deviations are concentrated in a limited number of extreme cases, which should be considered when interpreting model uncertainty in high-T
g regimes.
To enhance interpretability and support an engineering-oriented structure–property discussion, SHAP (SHapley Additive exPlanations) values were computed for the final model (
Figure 12). The SHAP summary plot ranks descriptors according to their average absolute contribution to the predicted T
g and simultaneously indicates how low versus high descriptor values shift the model output.
Overall, the most influential variables are predominantly graph-based (2D) descriptors associated with molecular connectivity patterns (e.g., autocorrelation-type descriptors such as ATS2*), topological/shape-related indices (TSC* family), and ring-related terms. This attribution profile is consistent with the expected physicochemical determinants of the glass transition: polymer repeat units with higher structural rigidity, constrained connectivity, and increased ring/unsaturation content typically exhibit reduced segmental mobility and, therefore, higher Tg. Importantly, SHAP does not imply causality; however, it provides a transparent attribution map that helps identify which structural motifs are driving the model’s predictions within the investigated chemical space.
To enhance interpretability and support an engineering-oriented structure–property discussion, SHAP (SHapley Additive exPlanations) values were computed for the final XGBoost model (
Figure 12). The SHAP summary plot ranks descriptors according to their average absolute contribution to the predicted T
g while simultaneously indicating how low versus high descriptor values shift the model output. Representative polymer repeats units associated with dominant descriptor trends are illustrated in
Figure 13. Overall, the most influential variables are predominantly graph-based (2D) descriptors associated with molecular connectivity patterns, atomic-property autocorrelation, topological complexity, and cyclic/aromatic structural motifs. Among these, descriptors from the ATS2 family (e.g., AATS2v and ATS2*) encode autocorrelation relationships between atomic properties distributed across the molecular graph at short topological distances, reflecting how steric and electronic effects propagate through the repeat unit. Similarly, GATS2 descriptors are related to Geary autocorrelation functions and capture local topological heterogeneity and connectivity organization.
Polymers containing aromatic rings and rigid cyclic fragments, such as polystyrene-derived structures and phenylene-containing repeat units (
Figure 13), generally exhibited higher contributions from ring-related and autocorrelation descriptors. These structural motifs restrict rotational freedom along the polymer backbone, increase local rigidity, and reduce segmental mobility, which is consistent with elevated T
g behavior. In contrast, polymers containing flexible ether linkages or long aliphatic side chains, such as poly(ethylene glycol)-type systems and long-chain alkyl acrylates (
Figure 13), tended to present lower contributions from rigidity-associated descriptors and increased flexibility-related topological patterns. Such architectures favor conformational freedom and chain mobility, typically resulting in lower T
g values.
Descriptors associated with molecular topology and graph complexity, including ATSC*, BCUT, and ring-count-related terms (e.g., n6Ring), also contributed significantly to model predictions. Higher values for these descriptors were frequently associated with densely connected repeat units, aromaticity, and sterically constrained structures, all of which are classical physicochemical factors known to increase T
g. Methacrylate-based polymers (
Figure 13) presented intermediate behavior, where bulky pendant groups increase local steric hindrance and partially restrict chain mobility without reaching the rigidity levels observed in highly aromatic systems. This intermediate structural organization is reflected by moderate SHAP contributions from autocorrelation and topological descriptors.
Importantly, SHAP does not imply direct physicochemical causality; however, it provides a transparent attribution framework that helps identify which molecular connectivity patterns and structural motifs most strongly influence Tg predictions within the investigated chemical space.