1. Introduction
Conventional plastics, based on their durability, versatility, and low costs of production, are widely applied in different industries such as packaging, healthcare, and manufacturing. However, they also directly contribute to environmental pollution, as plastic waste is discarded into oceans, rivers, and landfills [
1]. Modern methods to reduce pollution, such as recycling, are ineffective based on the small fraction of waste that they eliminate [
2]. As a result, a significant proportion of plastic waste continues to pollute the environment as it takes decades to degrade into microplastics (MPs) and nanoplastics (NPs), risking ecosystems and human health [
3]. Due to inadequate plastic waste management processes, there is an urgent need to transition to a circular economy (CE), ensuring a sustainable strategic framework governing the use of plastics.
The redesign of production, use, and end-of-life management of plastics is a promising pathway to eradicate plastic pollution and its negative environmental impacts [
4]. Additionally, developing and adopting sustainable, bio-based, and biodegradable plastic products (BBpPs) is an effective alternative [
5,
6]. Polyhydroxyalkanoates (PHAs), a family of bio-based and bio-renewable aliphatic polyesters, are an example of a BBpP that biodegrades in different natural environments and are biosynthesized by microorganisms through fermentation under nutrient-limiting conditions [
7,
8]. Bacteria synthesize PHAs intracellularly as carbon and energy storage polymers composed of hydroxyalkanoate (HA) units. However, a core challenge of PHAs is their high production costs [
9,
10,
11]. As a result, this limits their use in commercial settings.
PHAs are also categorized based on the length of the alkyl side chain in their repeating units into three groups. First, there are scl-PHAs with three to five carbon atoms, followed by medium-chain-length PHAs (mcl-PHAs) with six to 14 carbon atoms. Third, there are long-chain-length PHAs (lcl-PHAs; 15 or more carbon atoms) [
12]. Additionally, the physicochemical and thermal behavior of PHAs is governed by factors such as average molecular weight, monomer composition, side chain length, and crystallinity, which can be modified to tailor the performance of the material. Further, the ability to tune the properties of PHAs, along with their biodegradation and biocompatibility in natural environments, is a core feature that positions PHAs as strong candidates that are suitable alternatives to traditional plastics derived from petroleum [
11]. In this context, PHAs can be adopted to support the development of a circular bioeconomy.
Further insights from the literature reveal that PHB and its copolymer PHBV are the most widely studied scl-PHAs. These polymers (PHB, PHBV) have side chains attached at the third carbon atom of the backbone and exhibit relatively high crystallinity and melting temperatures. The integration of comonomers such as 3-hydroxyvalerate into PHB also disrupts the regularity of the crystals and further reduces the melting point while also introducing changes in chain mobility [
4]. Additionally, PHB demonstrates improved thermal processability in terms of lower Tm, Tc, and Tg due to its copolymer (PHBV), leading to elastic and flexible texture at normal temperature while guaranteeing easier processing.
The blending of scl-PHAs with mcl-PHAs or the integration of additives such as plasticizers also modifies intermolecular interactions and crystallinity, enhancing thermal behavior. While these approaches enhance performance and processability, the resulting structure–property associations are often nonlinear and strongly coupled, leading to difficulty in making rational design using conventional trial-and-error methods. In contrast with scl-PHAs, mcl-PHAs also contain longer side chains that hinder chain packing, thereby leading to lower crystallinity, lower melting points, and elastomeric behavior [
13,
14]. The distinct structural differences lead to distinct mechanisms of crystallization and thermal transitions, revealing the chemically and physically unique polymer class of scl-PHAs. Additionally, the thermal qualities of scl-PHAs, especially Tg, Tm, and Tc, are highly sensitive to molecular structure and formulation. PHB, the most widely industrialized PHA, is also brittle due to its high crystallinity, poor mechanical properties, and low thermal stability because its melting temperature (Tm) is close to its degradation temperature (Tdeg), making thermal processing (e.g., extrusion, injection molding) challenging [
15,
16].
Despite the unique advantageous properties of PHAs (high crystallinity and melting temperatures and easier processing), they are inherently limited by factors such as brittleness, narrow processing windows, sensitivity of thermal properties to small compositional changes, and high production costs [
15]. Such challenges are highly pronounced for scl-PHAs, where modest variations in comonomer content, blend composition, or additive concentration can lead to significant shifts in Tg, Tm, and Tc [
14]. Consequently, predictive tools for the efficient linkage of formulation parameters to thermal properties are essential to accelerate the development of application-ready scl-PHA- and PHB-based materials with targeted properties, reducing experimental costs and time. To predict polymer properties and guide the design of the materials, ML methods and polymer informatics have also emerged as a promising alternative [
17,
18,
19]. However, the success of ML models is constrained by the lack of high-quality and chemically consistent datasets. For PHAs, the availability of high-quality datasets is yet to be met despite large polymer databases such as Polymer Genome and PoLyInfo, in which PHAs represent only a small fraction of entries and thermal property data are often incomplete or inconsistently reported [
20,
21].
Moreover, clear differences do not emerge between scl- and mcl-PHAs in these databases, while they fail to account for formulation effects such as additive incorporation and blending. Currently, only a limited number of ML studies have addressed thermal property prediction within the PHA family. Previous studies have demonstrated the feasibility of predicting Tg or Tm for PHA homopolymers and copolymers based on small, literature-curated datasets, while more recent multitask deep learning models have included PHAs as a minor subset within broad collections of polymers. Such approaches either rely on heterogeneous PHA datasets, which obscure scl-specific physicochemical trends, or prioritize general polymer screening rather than focused modeling of scl-PHA formulations [
22,
23,
24,
25,
26]. As such, a gap exists in the lack of high-quality datasets related to PHAs that address formulation effects related to blending and additive incorporation.
The aim of the current study was to develop a curated and standardized dataset of PHB and PHBV materials compiled from the literature and in-house experiments and apply ML models to predict their thermal features. The objectives in the study were:
To curate a structured data library of PHB/PHBV-based materials incorporating various additives and building blocks from the literature and in-house experiments.
To implement XGBoost and Random Forests to predict the thermal features of scl-PHAs from the curated dataset.
To critically investigate the restrictions of the ML models and the curated database, thereby outlining future research directions.
The novelty of this study was its presentation of a curated dataset dedicated exclusively to scl-PHAs with side chains at the 3-position of the polymer backbone, including PHB, PHBV formulations, additive-containing systems, and blends with mcl-PHAs, unlike broad polymer databases such as Polymer Genome and PoLyInfo, in which PHAs represent only a small subset of entries, and unlike previous PHA datasets that primarily focus on homopolymers and copolymers, allowing the relationship between formulation composition and thermal behavior to be systematically explored. Experimental Tg, Tc, and Tm values, weight-average molecular weight (Mw), number-average molecular weight (Mn), polydispersity index (PDI), and compositional information were systematically obtained from the literature and critically curated to ensure data consistency and reliability. Narrowing the scope to only these specific materials ensured that a thoroughly curated dataset could be constructed and aided the development of predictive models that captured structure–thermal property relationships and additive interactions in Tg, Tc, and Tm, inaccessible in broader PHA datasets. Subsequently, the research aims to establish a foundational data infrastructure for scl-PHA informatics and demonstrate the value of targeted curation for advancing predictive polymer design and optimization of scl-PHA formulations with tailored thermal performance.
3. Results and Discussion
3.1. Data Curation
A total of 109 peer-reviewed studies were systematically mined and integrated with an in-house experimental dataset. In this manner, a structured data library of PHB/PHBV-based materials incorporating various additives and building blocks was manually curated. The curated raw dataset used in the data preprocessing stage is publicly available in the associated data repository (AUA Zenodo repository) [
32]. An overview of the compositional, molecular, physicochemical, and thermal descriptors and features that were extracted from the referenced studies is provided in
Table A2 in
Appendix A.2.
3.2. QSPR Model Performance
3.2.1. Sensitivity Analysis and Optimal Model Selection
Before the selection of the final model, a systematic sensitivity analysis was conducted to identify the optimal modeling configuration for each thermal property. Two cases were investigated: the composition of the input feature space and the effect of outlier exclusion on model generalization. All configurations were evaluated using both RF and XGBoost regressors, tuned via cross-validation (five folds), and compared using cross-validated (CV) R
2 mean and standard deviation (SD). Models exhibiting signs of overfitting (training R
2 > 0.99) or poor generalization (CV R
2 < 0.70) were excluded from further analysis. These thresholds were applied as dataset-specific quality control filters rather than universal statistical criteria, reflecting the heterogeneity and limited sample sizes of literature-derived polymer datasets. The cross-validated R
2 scores for all evaluated configurations are summarized in
Table 8,
Table 9 and
Table 10, with the selected best-performing configuration highlighted in bold. The optimal models are then presented in detail in
Section 3.2.2,
Section 3.2.3 and
Section 3.2.4.
For Tm (
Table 8), multiple configurations across both algorithms and feature spaces satisfied the selection criteria, indicating that Tm prediction is robust to different modeling choices and that outlier exclusion consistently improved cross-validation stability. In contrast, Tc and Tg showed considerably fewer eligible configurations, reflecting greater sensitivity to feature space composition and preprocessing strategy; see
Table 9 and
Table 10, respectively. For Tc, only two configurations met the criteria, with the best performance achieved by XGBoost using the basic feature set alone, suggesting that Tc is best predicted from molecular descriptors without additional thermal inputs. For Tg, RF with all features and outlier exclusion yielded the best result. The diversity of the selected models—two RF and one XGBoost, two configurations with two thermal descriptors and one with basic features alone—reflects the target-dependent nature of thermal property prediction in PHB/PHBV systems and underlines the importance of systematic sensitivity analysis before final model selection. In the case of predicting Tg and Tm, models using only compositional features (basic set) represented a more realistic scenario for new or unseen samples, but performance improved when other thermal properties were included as inputs. This did not indicate classical data leakage but rather learning driven by strong relationships between thermal properties of polymers. A restriction is that these additional properties are often unknown in practice, reflecting real-world prediction settings. However, expanding and improving the dataset is expected to further enhance model performance and robustness, even without relying on additional thermal input features.
3.2.2. Tm Prediction Model Performance
The optimal Tm prediction model was identified as the tuned RF regressor trained on the basic feature set augmented with Tc and Tg as additional thermal descriptors, with outlier exclusion applied to the training set. The nine features used for model development are summarized in the supplementary data in
Table 11. The dataset comprised 118 samples and was split into training and test sets using an 80/20 ratio, resulting in 94 samples for training and 24 samples for testing.
The optimal hyperparameter values obtained via cross-validated (cv) grid search are reported in
Table 12. The performance of the tuned model is reported in
Table 13.
This configuration achieved a cross-validated R2 of 0.817 and a training R2 of 0.935, representing the most favorable balance between predictive performance and generalization stability among all evaluated Tm models. The relatively modest train–CV gap, combined with the low CV standard deviation, confirmed that this configuration was the most robust and reliable for Tm prediction within the constraints of the available dataset.
A comparison of the predicted and the true Tm for the tuned RF model is presented in
Figure 6.
SHAP analysis was further undertaken to assess feature importance by examining how much each input variable contributed to the model’s performance. The resulting SHAP summary plots (
Figure 7) highlighted a global view of feature effects, revealing both the magnitude and direction of each feature’s influence on the model output. The synthesis of the results indicated that the most influential features identified by SHAP analysis were associated with the formation of crystals and stability of semicrystalline polymers. In particular, higher HV content was related to lower Tm values. This trend is consistent with the literature reporting that HV incorporation can disrupt chain regularity and reduce crystallinity in PHB/PHBV systems. Similarly, lower Mw was associated with reduced Tm values, which may reflect the effect of shorter chain length and a higher density of chain ends on crystal stability.
However, with the Tc values, a higher value may reflect the fact that thicker and more perfect crystals were created, leading to higher Tm. Tg was also positively associated with Tm. This behavior may be explained by the reduced mobility of the amorphous phase of the polymers, which has been reported to promote crystal formation and a higher Tm point, as more energy is required to melt the ordered crystalline structures. The analysis further indicated that the type of additive and its percentage were also important features in the prediction of Tm. This observation may reflect the influence of additives on the crystallization behavior and chain mobility of the final biopolymer formulation and its Tm point. However, these effects depend on the nature and function of each additive used.
3.2.3. Tc Prediction Model Performance
Based on the results, the best Tc prediction model was the tuned XGBoost regressor trained on the basic physicochemical and compositional feature set alone without outlier exclusion; see
Table 14. The dataset consisted of 201 samples and was split into a training set with 160 samples and a test set with 41 samples using an 80/20 ratio.
The hyperparameters used in the tuned model are summarized in
Table 15.
This configuration achieved a cross-validated R
2 of 0.762 ± 0.054 and a training R
2 of 0.972, as shown in
Table 16. Closer inspection indicated that the absence of additional thermal descriptors in the optimal Tc feature space had a significant impact, unlike Tm and Tg, because crystallization temperature was a kinetic parameter influenced by the interplay between chain mobility and nucleation, which was partially reflected in the formulation descriptors such as Hv content and Mw. However, the integration of additional thermal qualities, such as input features, did not impact Tc prediction, while in several cases, it increased the variance, consistent with the physical interpretation.
A comparison between the predicted and the true Tc is presented in
Figure 8.
The SHAP summary plot for the XGBoost model is presented in
Figure 9. Based on the results, “Mw_PHBV” had the highest impact on model output, followed by “HB_ratio,” “HV_ratio,” and additive1 percentage. The results suggested that these features were relevant to the prediction of Tc, which is commonly associated in the literature with chain mobility, regularity, and intermolecular interactions. As such, higher Mw values may be correlated with reduced chain mobility due to higher entanglements and longer polymer chains that hindered the formation of crystals and decreased Tc values. The same effect on the prediction of Tc was observed when HV content was increased, which may be linked with the fact that crystallization became more difficult due to disrupted chain regularity. High HB monomer content may also reflect the dependence between the formation of more stable and organized crystals in the polymer matrix and higher structural rigidity, leading to higher Tc predictions. Although SHAP values represent feature importance and statistical associations within the model, the results indicated reliable QSPR performance, as the identified features are consistent with known factors influencing Tc.
3.2.4. Tg Prediction Model Performance
The results indicated that the optimal Tg prediction model was the tuned Random Forest regressor trained on the basic feature set augmented with both Tm and Tc as additional thermal descriptors, with outlier exclusion applied to the training set. The features employed in the Tg model are listed in
Table 17. The dataset comprised 118 samples for the Tm model since the same feature space was used.
The hyperparameter values selected for the Tg model are presented in
Table 18.
This configuration achieved a cross-validated R
2 of 0.765 and a training R
2 of 0.960 (
Table 19). The analysis indicated that including the remaining thermal properties as input features reflected the well-established thermodynamic interdependencies governing the glass transition behavior in semicrystalline polymers. Tg was influenced by chain mobility and free volume, which were directly linked to the degree of crystallinity encoded in Tc and the melting behavior captured by Tm.
A comparison between the predicted and true Tg values of the tuned model is presented in
Figure 10.
The SHAP analysis (
Figure 11) indicated that the most influential features of the RF model included Tm, Tc, Mw, and the HV_ratio formulation. Upon closer inspection, the analysis showed that the glass transition temperature (Tg) reflected how easily polymer chains moved in the amorphous phase. Specifically, samples with higher Tm and Tc values tended to be associated with higher predicted Tg values. This association may reflect the relationship between stronger intermolecular interactions, greater chain regularity, and the development of crystalline structures, which are often linked to restricted molecular motion in amorphous regions. Similarly, the high Tg values arising from the reduction in chain mobility in the amorphous phase can be explained by high Mw values that led to an increase in chain entanglement. In contrast, an increasing HV_ratio was associated with lower predicted Tg values. This trend is consistent with reports that HV units can increase chain irregularity, reduce crystallinity, and enhance chain flexibility. In summary, the identified feature importance and the observed patterns suggest that the model captures relationships that are physically meaningful and relevant to chain mobility in the amorphous phase.
3.3. Comparison of Predictive Performance Across Thermal Properties
The synthesis of the predictive performance across the three thermal properties showed similarities and significant differences between the targets and optimal configurations. Applying systematic sensitivity analysis of feature space composition and outlier exclusion effects also helped identify the best-performing configuration for each thermal property. The results showed that the Random Forest algorithm was the most optimal for both Tm and Tg prediction, while XGBoost proved preferable for Tc, reflecting the target-dependent nature of thermal property modeling in PHB/PHBV systems.
All three optimal models achieved training R2 values below 0.990, confirming that the selected configurations avoided severe overfitting while retaining sufficient model complexity to capture nonlinear structure–property relationships. Further, the generalization performance varied by target: the Tm model achieved the highest cross-validated R2 of 0.817, followed by Tg at 0.765 and Tc at 0.762. The high scores underlined the generalizability of the ML models adopted in the research. As such, XGBoost and RFs could be applied across different datasets to predict the thermal features of scl-PHAs. Despite ranking third in terms of CV R2, the Tc model achieved the lowest cross-validation standard deviation among the three properties, indicating particularly stable generalization despite relying exclusively on the basic feature set without outlier exclusion.
The Tg model exhibited the lowest absolute error metrics among the three properties, with a cross-validated RMSE of 4.05 °C and an MAE of 2.68 °C. This outcome reflects both the performance of the model and the narrower range and lower variability in Tg values in the dataset, which naturally leads to smaller absolute errors independent of predictive strength. In contrast, the Tc model showed higher absolute errors but demonstrated the most stable generalization behavior across folds. Overall, these results highlight that predictive performance should be interpreted jointly with dataset-dependent property distributions, as error magnitudes alone may not directly reflect model quality across different thermal properties.
Feature importance analysis via SHAP revealed consistent trends: compositional descriptors such as polymer Mw, Hb, and Hv formulation ratios were among the most influential features across all three thermal properties. These findings suggested that the models assigned higher importance to variables that are known to influence the thermal behavior of PHB/PHBV systems. The SHAP trends generally agree with relationships reported in the literature, but they should be viewed as model-based associations and measures of feature importance rather than a physically direct interpretation of the features. Further insights are expected to emerge as the dataset grows and more data become available.
Nevertheless, this study has some limitations that should be considered. The thermal property data were collected from different literature sources, where experimental conditions of DSC and definitions of transition temperatures (e.g., onset, midpoint, or peak) can lead to significant shifts in the reported thermal transition temperatures. Since processing and cooling conditions can significantly affect Tc, their absence from the dataset may further limit the physical meaning of the Tc model. Additionally, characterization protocols of compositional information were not consistently reported or standardized, leading to heterogeneity in the dataset, including differences in sample preparation histories and molecular weight determination methods. In some cases, values had to be manually extracted from published graphs using PlotDigitizer, which may introduce additional noise and uncertainty.
Also, a recurring limitation across all three models was the constrained dataset size, comprising between 118 and 201 samples based on the target property. This challenge was inherent to the field of polymer informatics, where experimental thermal characterization was resource-intensive and published datasets were sparser compared to other materials domains.
For Tg and Tm prediction, models based only on compositional features represent a more realistic scenario for predicting the properties of new materials. Although better performance was achieved when other thermal properties were included as inputs, this is mainly due to the strong relationships between thermal properties rather than data leakage. However, such information is often unavailable for unknown samples.
Despite these restrictions, this work represents a first step toward a data-driven framework for predicting PHA thermal properties. Future studies with larger datasets and more standardized experimental data and testing conditions are expected to further improve model accuracy and robustness. Further insights showed that the study was also limited as it compared only XGBoost and RF models. In future work, there is a need to consider broader ML models and compare their performance in predicting the thermal features of scl-PHAs.
3.4. Model Validation, Generalization and Applicability Domain Analysis
3.4.1. Study-Wise Splitting and External Validation
To evaluate generalization across independent studies, a source-wise (leave-one-study-out) splitting strategy was initially applied. However, due to the limited number of available studies in several test folds, this approach resulted in highly unstable performance estimates. Although cross-validation performance remained relatively high (CV R2 ≈ 0.73–0.78), test set performance showed large variability and, in several cases, negative R2 values. These results highlight the limitations of strict study-wise splitting under extreme data scarcity rather than intrinsic model failure. In particular, the imbalance in study sizes and the heterogeneity in experimental conditions led to test sets that were not statistically representative of the training distribution.
To address this constraint, an alternative external validation strategy was introduced based on chemically meaningful hold-out sets, where selected high-temperature observations from studies containing more than five instances were excluded from training. This approach resulted in more balanced and interpretable evaluation of generalization performance. Pearson correlation is particularly advantageous for evaluating unseen instances, as it captures the strength of the linear relationship between predicted and experimental values independent of scale and systematic bias.
Under this scheme, the models achieved improved and more stable predictive behavior, with Pearson correlation coefficients ranging from 0.71 to 0.91 across properties, as shown in
Table 20.
Figure 12 shows parity plots for training, test, and external validation splits. Nevertheless, validation against fully independent literature sources remains an important direction for further research.
3.4.2. Effect of Molecular Weight Reconstruction on Model Performance
To evaluate the influence of molecular weight reconstruction on predictive performance, additional models were trained using datasets in which all reconstructed molecular weight descriptors were excluded. In these cases, the effective dataset size was substantially reduced; note that for Tm and Tg, the instances were fewer than 100. Across all thermal properties and model types, the resulting models exhibited mean cross-validation R2 values below 0.60, accompanied by large variance (>0.15) across folds.
This degradation in cross-validated performance is attributed mainly to the reduced datasets, amplifying the impact of individual studies and experimental conditions. Although reconstruction introduces approximate values, the resulting models consistently achieved higher and more stable cross-validated performance, indicating an improved bias–variance balance. These results highlight that, for sparse polymer datasets, reconstruction is essential for obtaining statistically robust and generalizable models. However, this procedure may introduce additional uncertainty, which should be taken into account when interpreting the results.
3.4.3. Baseline Models and Linearity Assessment
To provide a baseline for performance comparison, several commonly used linear and kernel-based regression models were evaluated, including ordinary linear regression (LR), Ridge regression, Lasso regression, and support vector regression (SVR). These models were trained and validated using the same descriptors, data splits, and cross-validation protocol employed for the ensemble models.
Across all three target properties, linear models exhibited limited predictive capability, with mean cross-validation R2 values of 0.67, 0.34 and 0.38 for Tm, Tc and Tg, respectively. Regularization through Ridge and Lasso did not yield substantial improvements, indicating that the observed limitations were not due to overfitting or multicollinearity but rather to an inability of linear formulations to capture the underlying structure–property relationships.
SVR models showed improved performance relative to linear regressions, achieving cross-validation R2 values of approximately 0.80 for Tm (acceptable) and around 0.51–0.57 for Tc and Tg. However, SVR performance remained consistently poorer than that of the tree-based ensemble models, particularly in terms of robustness and variance across folds. The superior performance of RF and XGBoost indicated that the relationships between molecular descriptors, molecular weight characteristics, and thermal properties of PHB and PHBV were inherently nonlinear and involved higher-order feature interactions that cannot be adequately represented by linear models.
3.4.4. Effect of Data Splitting Strategy
To assess whether the use of random train/test splitting leads to optimistic performance estimates, an additional evaluation was performed using a Kennard–Stone (KS) algorithm [
49] to generate a more uniform and space-filling partition of the descriptor space. This approach is commonly used to mitigate sampling bias in small, heterogeneous datasets and provides a more stringent test of model robustness compared to random splitting.
Across all three thermal properties, the KS-based results confirmed that the predictive performance of the models was generally stable with respect to the splitting strategy, as shown in
Table 21. For Tm and Tg, model performance remained comparable to that obtained using random splitting, indicating that the originally reported results were not artificially inflated. In contrast, Tc exhibited a moderate decrease in predictive performance under the KS split, suggesting a higher sensitivity to data heterogeneity and partitioning effects for this property. The observed differences were property-dependent rather than systematic, indicating that model performance was not primarily driven by sampling bias.
3.4.5. Domain of Applicability for Model Prediction
The applicability domain (DoA) of the developed models was defined using the observed ranges of the numerical input descriptors in the training dataset (
Table 22). This analysis was performed to define the regions of chemical spaces where the developed models provide trustworthy predictions, whereas outside these ranges, predictions should be treated with caution since they correspond to extrapolation.
4. Conclusions
This study demonstrated the successful development of an ML framework for predicting thermal properties of PHB- and PHBV-based materials. After a literature search, 109 relevant studies were selected, and their findings were integrated with an in-house experimental package to facilitate the construction of a raw dataset, comprising a total of 572 instances. The curated version of the dataset was employed for ML modeling, enabling reliable model development while minimizing overfitting and capturing nonlinear structure–property relationships, using 118 data points for Tg, 201 data points for Tc, and 118 data points for Tm. This reduction reflects the requirement that only instances with complete descriptor information have been used for model training and evaluation, and the reported model performance therefore reflects learning from these reduced datasets.
ML QSPR models based on RF and XGBoost were developed, tuned, and validated for the prediction of the three thermal properties. Following hyperparameter optimization, the best-performing configurations achieved cross-validated R2 values of 0.817, 0.762, and 0.765 for Tm, Tc, and Tg, respectively. RF outperformed XGBoost for Tm and Tg prediction, while XGBoost proved preferable for Tc. To examine generalization, a chemically informed external validation strategy was introduced. By holding out high-temperature data points from studies with sufficient sample sizes, an alternative split was performed. Under this setting, all thermal models achieved stable predictive behavior with strong Pearson correlation coefficients and consistent error metrics on unseen data.
Beyond predictive performance, the models provided interpretable structure–property insights and highlighted the importance of data quality and descriptor selection in polymer informatics. When data was limited and the database was relatively small, the combination of balanced performance metrics and explainability tools such as SHAP provided a practical, balanced, and effective approach for ML modeling.
Integrating datasets with interpretable ML approaches provided physical insight, supporting rational design of sustainable polymers with tailored properties, while reducing experimental time and cost and enabling application-specific optimization. In future work, the expansion of the dataset, incorporation of more detailed molecular and processing descriptors, and evaluation of alternative ML models will further validate QSPR predictive robustness. Larger datasets will enhance the predictive capability and capture more accurately the physical meaning of the features selected, revealing the importance of open-access and organized data libraries following FAIR data principles across the scientific community. The adoption of standardized testing and characterization protocols, including DSC procedures, experimental conditions, molecular weight determination and monomer composition characterization methods, would improve data consistency and comparability across studies, reducing dataset heterogeneity and further strengthening predictive performance.