1. Introduction
The production of petroleum-derived plastics has grown exponentially over the last few decades. This production increased from 1.5 million tons in 1950 to 359 million tons in 2018, and correspondingly, the amount of plastic waste increased. Currently, households account for over 60% of the plastic waste from consumer sources. This mainly comprises single-use food packaging materials made from petroleum-derived plastics. Additionally, the consumption of plastic materials in both households and industries has exceeded the global production of plastics by up to 400 Mt/year [
1]. As such, the manufacture of plastics from petroleum to meet current consumption demands poses several environmental concerns.
To mitigate environmental pollution in plastics production, polyhydroxyalkanoates (PHAs), a biobased, biodegradable material class, have emerged. Among short-chain length PHAs (scl-PHAs), the poly(3-hydroxybutyrate-co-3-hydroxyvalerate) (PHBV) has received considerable research interest due to its production by microbial fermentation processes [
2], its bio-based nature, and its excellent combination of barrier properties and biodegradability, making it suitable for packaging and related applications. However, the use of PHBV as a material for real-world applications is restricted by diverse factors: the properties of the material and its performance; the varying rates of biodegradation of the material in different environments; geometries/thicknesses, and formulations, especially when compounded with plasticizers, fillers, fibers, and bioactive additives to suit specific application needs [
3].
A key challenge in PHBV research and product development is that biodegradability is not a fixed parameter but an outcome of multiple factors such as environmental conditions (i.e., temperature, moisture, oxygen availability, microbial community, nutrient content), the polymer composition, molecular weight, crystallinity, chemical structure, presence of additives, material format (film, plaque, multilayer, composite), reduction potential, hydrophilicity and breakdown products. However, the extent of the effects from some of these factors remains unclear. Biodegradation is influenced by the susceptibility of the polymer carbon backbone to microbial attack. Research on PHBV biodegradation can be divided into two categories: those that indicate the degree of biodegradation, and those that indicate the mass or weight loss over the study duration. The latter is more prevalent in the literature due to its ease; however, based on ASTM standards, it is not adequate to determine the degree of biodegradability of polymers by themselves. Furthermore, it is the duration of biodegradation that makes complete studies following ASTM standards incredibly rare.
Consequently, laboratory studies often employ standardized test methods to ensure reproducibility and comparability of the results. For instance, ISO 14855 [
4] measures the ultimate biodegradability of PHBV in controlled composting environments and determines the extent of biodegradability in terms of CO
2 evolution, while ASTM D5338 [
5] measures biodegradability in aerobic environments under controlled composting at thermophilic temperatures. These test methods are essential to the development of biodegradable plastics such as PHBV and in the standardization of their properties and performances. However, the test conditions, although standardized, can be quite different from the conditions in the “real environment” of use, and the results of the biodegradability of PHBV can differ from the standardized tests.
Indeed, the evidence within the existing literature on PHBV supports the extent of variation across environments and formulations. For example, the neat PHBV material has been found to display low rates of mass loss within soil environments. On the other hand, the incorporation of natural fibers may significantly enhance the degradation rates. This is supported by the different water absorption rates and degradation mechanisms [
6]. Similarly, composite materials containing lignocellulosic fillers such as wood flour have shown improved degradation rates when subjected to soil burial tests [
7]. For aquatic environments, PHBV films containing phenolic additives such as catechin, ferulic acid, and vanillin were observed to degrade through respirometry analysis. This shows that the degradation kinetics may depend on the medium [
8]. Even within composting environments conducted under nominally controlled conditions, disintegration and degradation rates can vary depending on the multilayer structures or the variation in the blend composition, underlining the complexity of extrapolating laboratory results to real-world scenarios [
9].
Such a heterogeneous environment poses a practical challenge in lab testing, whereas field testing in a real-world environment is expensive and time-consuming. However, decisions regarding formulation, processing pathways, and end-of-life need to be made in early-stage testing. Hence, there is a significant interest in developing a data-driven approach that can learn relationships between material descriptors, additives, and environmental factors in predicting biodegradation. Recent work within the larger domain of polymer biodegradation has demonstrated the ability of machine learning (ML) methods to predict biodegradation endpoints such as percent biodegradation given input features [
10]. This presents a clear path forward for faster testing and hypothesis generation. Beyond polymer-specific datasets, interpretable ML frameworks have been proposed for enhancing prediction and understanding of primary vs. ultimate biodegradation endpoints, highlighting the importance of both prediction and mechanistic insight [
11]. Similar work has utilized rank-based learning methods to overcome issues and biases inherent within degradability data, where experimental conditions and reporting are highly varied [
12]. ML techniques have also been used simultaneously to predict biodegradation behavior in aquatic environments using meta-analytic datasets that include both material properties and experimental conditions [
13].
However, various key unique issues related to the predictive modeling of PHBV biodegradability should be investigated extensively. First, PHBV is often blended with a variety of plasticizers such as citrate esters, natural fibers, and fillers such as cellulose and wood flour, and mineral fillers such as calcium carbonate, and functional additives such as antimicrobial and antioxidants, which can influence water sorption, crystallinity, interfacial microstructure, and microbial degradability [
3]. Second, the results of PHBV biodegradability are expressed in diverse forms, such as ultimate biodegradation (CO
2 evolution), disintegration, fragmentation, mass loss, changes in molecular weight, and surface/morphological changes [
14]. Third, scaling up from small laboratory experiments to medium/large laboratory tests, and then to the field, is complicated by non-linear effects, e.g., oxygen transfer, temperature gradients, and moisture heterogeneities, which make it difficult to extrapolate the results from standardized conditions.
In this context, this study presents for the first time a computational framework that automates the process from literature-based data collection to the prediction of weight loss in scl-PHA biopolymers. The model accurately predicts physical disintegration across various experimental scales and environments, including soil, marine, freshwater, and compost settings. Initially, a comprehensive dataset on degradation-related weight and mass loss of PHBV-based formulations was manually curated by systematically collecting, assembling, and harmonizing data on material characteristics, environmental exposure conditions, and biodegradation behavior. The dataset included a combination of numerical and categorical variables. The mixed data types and potential nonlinear relationships motivated the use of advanced ML methods combined with appropriate preprocessing strategies. Subsequently, multiple regression-based QSAR models were developed and rigorously validated for predicting the biodegradation behavior of the investigated formulations. The models achieved high predictive accuracy, indicating robust structure-degradation relationships. However, given the complexity and heterogeneity of polymer structures and degradation processes, the model’s ability to generalize across different material–environment combinations should be further examined. Finally, the most important features governing biodegradation-induced weight loss were identified, and their effect on the predictions was examined. Degradation environment, exposure time, and hydroxybutyrate (HB) ratio were revealed as key factors in weight loss, showing that although weight loss was increased in duration and temperature, soil and polymer conditions (such as high crystallinity) hindered it significantly.
The final predictive model was implemented as a user-friendly web application on the Jaqpot computational platform (
https://jaqpot.org/, accessed on 2 April 2026) and is openly accessible to the scientific community through the ANIPH virtual organization.
2. Materials and Methods
2.1. Workflow of Model Development
The overall methodological workflow adopted for this study is presented below in
Figure 1 and involves distinct phases of data curation, preprocessing, and data-informed modeling. The raw data on weight loss were collected manually from the literature and underwent an initial curation step, including consistency checking, quality control, and time-series alignment. The next phase involved data preprocessing, including feature engineering, normalization, and handling of missing values, to yield the processed dataset. The processed dataset was split into the training and test sets, followed by outlier detection on the training set. When outliers were detected, factorial analysis of mixed data (FAMD) [
15] and interquartile range (IQR) analysis [
16] were performed to reduce their effects. One-hot encoding of the categorical variables and feature importance analysis were performed. Finally, the regression models were developed using the ensemble learning algorithms of Random Forest [
17] and Extreme Gradient Boosting (XGBoost v. 2.1.3) [
18] to predict the weight or mass loss of PHBV-based formulations under different environmental conditions and test scales. Both models are well-suited for polymers informatics applications due to their ability to capture nonlinear relationships and interactions among descriptors. Model performance was evaluated using the coefficient of determination (R
2), root mean squared error (RMSE), and mean absolute error (MAE) for training, test, and validation sets. To further enhance performance, grid search–based hyperparameter optimization was conducted using cross-validation on the training data. Optimal hyperparameters were selected based on minimizing prediction error, and the final models were retrained before evaluating the test set.
2.2. Biodegradation-Induced Weight/Mass Loss Database Construction for PHBV-Based Formulations
The first stage in constructing the biodegradation-induced weight or mass loss database of PHBV-based formulations containing different additives or building blocks was data collection. The quality of the data collected was critical to the overall study. Data were obtained through a comprehensive literature-mining process that included peer-reviewed journal articles, theses, and published experimental studies addressing PHBV degradation under diverse environmental conditions. The selected studies investigated the degradation behavior of the studied formulations in soil, compost, freshwater, and marine environments, considering medium- and large-scale laboratory tests as well as real environmental exposure conditions. Emphasis was placed on identifying studies that reported quantitative degradation metrics based on “weight or mass loss”, used as an indicator of biodegradation-induced disintegration, quantifying the physical loss of material mass over time. These studies also revealed associated environmental parameters and material characteristics of the investigated formulations.
2.2.1. Literature Search Strategy
In accordance with the PRISMA 2020 guidelines [
19,
20], a literature search was carried out in the Scopus database [
21]. The search identified relevant literature on the properties associated with biodegradation-induced weight or mass loss of PHBV-based formulations, including natural and synthetic additives such as plasticizers, fillers, stabilizers, antioxidants, bio-based compounds, and other components.
The literature search used a combination of Boolean operators, with a focus on the Title, Abstract, and Keywords fields (TITLE-ABS-KEY). Three concept blocks were used with the AND operator, and each block was used to identify literature on the following: (1) the copolymer PHBV as the biodegradable material, (2) weight/mass loss properties associated with the biodegradation of the material, (3) a set of additives/components used with the PHBV-based formulations to enhance their biodegradability.
The final search query in the Scopus database was structured as follows:
TITLE-ABS-KEY ((“PHBV” OR “polyhydroxybutyrate-co-valerate” OR “poly(3-hydroxybutyrate-co-3-hydroxyvalerate)”) AND (“biodegradation” OR “decomposition” OR “microbial degradation”) AND (“weight loss” OR “mass loss” OR “weight reduction” OR “mass reduction” OR “disintegration” OR “fragmentation” OR “erosion” OR “biodegradation rate” OR “percentage mass loss” OR “weight loss %”) AND (“lignin” OR “lignins” OR “citric acid” OR “citric ester” OR “acetyl tributyl citrate” OR “ATBC” OR “triacetin” OR “triethyl citrate” OR “TEC” OR “epoxidized soybean oil” OR “epoxidized cottonseed oil” OR “epoxidized natural rubber” OR “ENR” OR “soybean oil” OR “Vish-E filler” OR “starch” OR “starch-based fillers” OR “cornstarch” OR “alginate” OR “alginic acid” OR “pure cellulose” OR “cellulose fibers” OR “wood flour” OR “woodflour” OR “WF” OR “wheat straw fibre” OR “wheat straw fiber” OR “lignocellulosic” OR “miscanthus” OR “olive pomace” OR “propionylated abaca fiber” OR “catechin” OR “ferulic acid” OR “vanillin” OR “polylactic acid” OR “PLA” OR “poly(ε-caprolactone)” OR “PCL” OR “polybutylene adipate-co-terephthalate” OR “PBAT” OR “polyethylene oxide” OR “PEO” OR “flax fibers” OR “flax fibres” OR “calcium carbonate” OR “CaCO3” OR “halloysite” OR “modified halloysite” OR “lignin-coated cellulose nanocrystals” OR “boron nitride” OR “quercetin” OR “DDGS” OR “distillers dried grains with solubles” OR “posidonia oceanica” OR “gallic acid” OR “ammonium quaternary salts” OR “castor oil” OR “limonene” OR “thymol” OR “oregano essential oil” OR “sorbitol” OR “maltodextrin” OR “dicumyl peroxide” OR “Licowax”)).
2.2.2. Study Selection and Screening
The records identified through the database search were further exported for evaluation. Titles and abstracts were initially screened to assess their relevance in relation to the biodegradation of PHBV-based formulations containing diverse additives under laboratory or natural environmental conditions. No restrictions were placed on the publication year to ensure a comprehensive historical record was obtained. Additionally, only articles published in English were included. During the screening phase, review articles, conference papers, editorials, notes, and book chapters were excluded. Full-text articles were subsequently assessed in detail according to pre-defined inclusion and exclusion criteria.
The inclusion criteria comprised studies that:
- (i)
Investigated PHBV or PHBV-based composites;
- (ii)
Examined biodegradation, microbial degradation, or disintegration processes under laboratory or real environment conditions;
- (iii)
Reported quantitative, time-resolved degradation metrics based on weight or mass loss (%);
- (iv)
Evaluated the incorporation of additives such as fillers, plasticizers, nucleating agents, antioxidants, bio-derived materials, or other functional compounds within the polymer matrix.
The exclusion criteria eliminated studies that examined other types of polymers apart from PHBV, studied degradation mechanisms unrelated to the process of biodegradation or microbial activity, solely employed gas evolution-based mineralization methods (e.g., measurement of evolved CO2), which failed to include the measurement of mass or weight loss, or examined the addition of compounds that were not incorporated into the matrix of the polymer.
Initially, 56 publications from 1991 to March 2025 were identified from the Scopus database’s advanced search tool using the predefined search query. After removing duplicates and an initial screening, a full-text evaluation was conducted for all publications, following specific eligibility criteria. Only publications relevant to the specific scope of this study, namely on biodegradation-induced weight or mass loss of PHBV-based formulations containing specific additives, conducted under soil, compost, freshwater, marine, or other environmentally relevant conditions, including medium- to large-scale experiments conducted in the laboratory or under natural environmental conditions, were considered. Further, only publications containing primary experimental data were eligible. Although the exclusion criterion for non-English publications may have resulted in eliminating significant findings from regions where significant research activity was conducted (e.g., Japan and China), this criterion was applied to ensure English was used as a primary language for international scientific communication. Further, only full-text publications were considered to ensure access to all necessary details on the methodologies and experiments used. A total of 17 peer-reviewed publications [
6,
7,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36] formed the final database used for this study’s analysis and prediction of biodegradation-induced weight or mass loss for PHBV-based formulations containing different additives and building blocks. The specific study selection approach used in this study is illustrated in the PRISMA 2020 flow diagram [
19] depicted below in
Figure 2.
2.2.3. Data Extraction and Synthesis
For each eligible study, information related to PHBV formulation parameters and physicochemical properties was systematically extracted, including polymer composition, molecular and structural characteristics, and the type and concentration of additives or bio-derived compounds incorporated into the matrix. In addition, details regarding the biodegradation conditions, such as exposure environment, test duration, specimen size or geometry, and experimental setup, were collected with the reported quantitative degradation outcomes based on weight or mass loss (%). These data were used to conduct qualitative analysis to establish trends in biodegradation-induced disintegration behavior associated with different additive categories and environmental conditions.
All available published data, including numeric values, polymer names, additive descriptors, and physicochemical property data, were organized in tabular format (Microsoft Excel files) to facilitate transfer and subsequent analysis. When property data were only available in graphical form, numerical values were manually extracted and curated from figures, plots, and digitized tables reported in the main text using the Plotdigitizer software (version 3) [
37].
2.2.4. Data Curation
Following data extraction, curated databases were constructed in a structured spreadsheet format (Microsoft Excel) to systematically capture the multivariate aspects of biodegradation-induced weight or mass loss in PHBV-based systems. The extracted data from all eligible studies were integrated into a unified database composed of five worksheets, each corresponding to a distinct data category, as summarized below in
Table 1.
This structured organization ensured that uniform representation of material descriptors, additive properties, environmental exposure conditions, sample characteristics, and quantified degradation results was achieved.
The worksheets were developed to ensure data harmonization and cross-referencing of heterogeneous data collected from different experimental procedures and data representation styles. In addition, a worksheet was created to record all abbreviations, symbols, and definitions used for data variables to ensure proper understanding during subsequent data processing modeling.
Each study was assigned a unique identification number (study ID), and all corresponding data were subsequently entered as individual rows in the database. In situations where a study reported more than one physicochemical property or experimental condition, distinct data instances were created to represent individual conditions. However, all these data instances, which represented varying experimental conditions, still carried the study ID of their respective original study. In total, 17 distinct study IDs were included, corresponding to 226 independent instances in “Worksheet_1_Materials_features” and 133 independent instances in “Worksheet_2_Environmental_features”, “Worksheet_3_Biodegradation_features”, and Worksheet_4_Time_point_features”, collectively yielding 1546 time-resolved weight or mass loss observations. This structure enabled both within-study and between-study analyses, allowing variability to be examined at the instance level as well. In
Table 2, we summarize the distribution of instances and data pairs (time point, weight, or mass loss percentage) per study. Note that the study IDs were non-sequential as they were extracted from a larger data library.
Table A1,
Table A2,
Table A3 and
Table A4 in
Appendix A outline a comprehensive summary of all input (feature) and output (target) variables used to generate the PHBV-based formulations’ biodegradation-induced weight or mass loss data library.
Table A1 lists 46 materials descriptors (23 categorical & 23 numerical) associated with PHBV-based formulations’ composition, additives, molecular and physicochemical properties, morphological characteristics, and surface characteristics.
Table A2 outlines 29 environmental features categorized into 17 numerical and 9 categorical variables and describing a range of biochemical environment variables, physical medium, temperature, pH, moisture, salinity, nutrient composition, solids composition, and standardized testing protocols.
Table A3 details the biodegradation-induced weight or mass loss response variables included, and
Table A4 summarizes the degradation time-series output variables. Each variable is described based on its definition, measurement unit, range of values, and type (categorical or numerical). The “Worksheet_3_Biodegradation_features” and “Worksheet_4_Time_points_features” datasets were organized in a hierarchical format comprising multiple study- and instance-level observations, and representing repeated measurements recorded under the same experimental conditions, see, for example, the first five rows in
Table 3 and
Table 4, respectively. Note that not all instances have the same number of data points.
A comprehensive raw data library for PHBV biodegradation-induced weight or mass loss, including all variables across the four categories, is available in the AUA Zenodo repository [
38].
Due to manual data extraction from heterogeneous literature sources, some additional noise may be present in the dataset. This was mitigated through iterative quality checks using descriptive statistics. Thus, model predictions should be interpreted in terms of long-term degradation trends rather than short-term ones, where high numerical precision is required.
2.2.5. Data Pre-Processing
To facilitate statistical analysis, the dataset “Worksheet_3_Biodegradation_features” was transformed into a long format. As a result, each weight loss value was melted into a separate row, such that every row represented a single weight loss observation associated with its corresponding “Study_id”, “Instance”, and “Sample_name”. This restructuring preserved the hierarchical relationships in the data while enabling instance-level analysis of weight loss outcomes.
The resulting dataset was combined with the corresponding melted “Worksheet_4_Time_points_features” dataset. This integration aligned each weight loss observation with its associated time point, while retaining the study-, instance-, and sample-level identifiers. The final merged “t_y_values” dataset comprised 1546 rows and 5 columns, with each row representing a single degradation time–weight loss % observation pair.
The melted and merged “t_y_values” dataset was subsequently integrated with the “Properties” dataset, which arose from merging the “Worksheet_1_Materials_features” and “Worksheet_2_Environmental_features” datasets. This final merge enriched each observation with the corresponding physicochemical and environmental properties, resulting in a comprehensive dataset comprising 20 columns, as summarized in
Table 5 below. Features with more than 80% missing values were removed.
The final merged dataset consisted of two identifier columns (“Study_id” and “Instance”), one target variable, “Weight_loss_%”, and 17 feature columns. Of the feature variables, seven were numerical, and ten were categorical, capturing a diverse set of formulation, material, and experimental characteristics. This combination of variable types enabled comprehensive modeling of weight loss behavior.
To reduce sparsity and improve interpretability, categorical features with many small or semantically similar categories were consolidated. For instance, the “Degradation_Environment” feature originally contained 12 categories, some with very few observations (e.g., “vermicompost” with 4 instances). These categories were mapped into four broader, domain-relevant groups: Soil, Marine/Aquatic, Compost/Organic Fertilizer, and Laboratory/Mineral Media. This approach preserved experimental context while simplifying the dataset to facilitate visualization, modeling, and interpretation.
Table 6 illustrates the mapping of original to merged categories for “Degradation_Environment”. The same procedure was applied to the features: “Degradation_mechanism”, “Sample_shape/Morphology”, “PHA_degrading_microbes”, and “Additive_type_1”. In
Table 7, a summary of all features affected by this categorical dimensional reduction is detailed.
Following the categorical consolidation, missing values (18.3%) in the numerical feature “T_deg” (degradation temperature) were addressed using a targeted imputation strategy. Since degradation temperature was primarily determined by the degradation environment, missing “T_deg” values were filled using the mean temperature of comparable conditions from the same study id. For example, missing values of a given study id in the “Marine/Aquatic” environment were imputed using the mean temperature from all other instances in the same environment. This approach preserved the environmental context of each observation while ensuring that the dataset remained complete for subsequent analysis.
Finally, rows containing missing values in any of the feature columns were excluded, resulting in a final dataset of 1467 complete observations. The distributions of all features in the final dataset are summarized in two figures:
Figure 3 shows the distributions of numerical features, while
Figure 4 presents the distributions of categorical features, highlighting the range and skewness of each variable.
2.3. Biodegradation QSAR Model Development and Validation
The dataset was partitioned into training and testing subsets to enable robust and unbiased evaluation of model performance. The splitting was performed before feature importance-based selection, where dimensionality reduction via Factor Analysis of Mixed Data (FAMD) and outlier removal ensured that the test set remained unseen during feature encoding. The train-test split was performed at the instance level, allowing time and weight loss data from the same instance to appear in both sets.
To rigorously evaluate model performance on held-out data, entire instances were excluded from training and testing. From study IDs with more than two instances, one instance with more than five data points was randomly selected and reserved in a validation set. This approach enabled assessment of the model’s ability to predict polymer weight or mass loss behavior across different properties.
Figure 5 shows a representative example for Study id = 24, which consists of 4 instances. Nevertheless, as the held-out instances were derived from the same experimental studies, some correlation related to shared properties and measurement conditions may remain, and the validation results should therefore be interpreted as an initial indication of model generalization rather than a fully independent assessment.
The size of the unseen (validation) set is 167 (11.4%). Then, 20% (260) of the remaining observations were held out as the test set, while the remaining 80% (1040) were used for training.
FAMD was used to find the most informative features. This method is suitable for datasets containing both numerical and categorical variables, as it simultaneously captures variance in continuous features and associations among categorical features. We selected fewer components for the final transformation, keeping only components with eigenvalues above 1%, ensuring that each retained component contributed meaningfully to explaining the overall variance. The original feature matrix was transformed into a lower-dimensional representation, preserving the essential structure of the data and facilitating subsequent modeling and analysis.
To determine the influence of extreme values on subsequent analyses, we employed an outlier detection procedure based on the interquartile range (IQR) for each FAMD component. Observations falling outside the range [Q1 − α IQR, Q3 + α IQR] were flagged as outliers and excluded from the dataset. The effect of this procedure will be discussed in the following section, where we demonstrate that polymer informatics models benefit from data diversity rather than aggressive statistical filtering.
After removing the outliers, categorical features were further encoded using one-hot encoding to enable compatibility with tree-based learning algorithms. Feature importance analysis was also performed using a Random Forest (RF)–based model to reduce dimensionality and improve interpretability. Features exhibiting positive (importance) values were retained for further modeling, while features with negligible contributions were excluded. Thus, this importance-driven feature selection reduced the complexity of the model and prevented overfitting, while preserving the most relevant variables to polymer weight loss behavior.
2.4. Model Explanation & Information Extraction
We were interested in predicting the biodegradation-induced weight or mass loss percentage (%) based on the degradation time point (in days) and polymer properties. However, it was important to highlight that weight loss did not increase monotonically over time (
Figure 6). While the correlation matrix (see
Figure 7) confirmed that degradation time and weight or mass loss were positively correlated (0.46), the relationship was moderate rather than strong, indicating the influence of additional factors and justifying the use of a multivariate modeling approach.
RF and XGBoost models were initially trained and evaluated using the complete dataset without outlier exclusion to evaluate the robustness of the proposed polymers informatics framework. The final dataset consisted of 43 selected descriptors, with 1040 samples in the training set and 260 samples in the test set.
Model performance was evaluated using:
where
is the true value,
is the predicted value, and
The mean of the true values.
4. Conclusions
This study presented an ML framework for predicting biodegradation-induced weight (mass) loss behavior of PHBV-based formulations with different additives as a function of degradation time, using a variety of heterogeneous descriptors that capture material composition, environmental conditions, and experimental parameters. A heterogeneous dataset, comprising repeated measurements from 17 independent studies, was systematically preprocessed and transformed to maintain its hierarchical structure and support data analysis at the instance level. This was achieved through a series of data operations, including categorical consolidation, conditional mean imputation based on degradation environment, and FAMD, which enabled robust and meaningful management of mixed numerical and categorical data.
Using the engineered dataset, both ML models, i.e., RF and XGBoost, showed excellent predictive performance for polymer biodegradation-induced weight loss (%). Without outlier exclusion, tuned RF and XGBoost achieved high test coefficients of determination (R2 = 0.930), indicating that over 93% of the variance in experimental weight-loss data could be explained by the selected descriptors. Cross-validation further confirmed model reliability, with mean CV R2 values of 0.887 (SD = 0.039) for RF and 0.884 (SD = 0.040) for XGBoost.
When evaluated on an unseen dataset, XGBoost exhibited superior robustness and generalization capability compared to RF. Specifically, XGBoost achieved an unseen R2 of 0.829, whereas RF achieved a lower unseen R2 of 0.716. This indicated that XGBoost outperformed the RF model and can be used as a reliable tool for predicting polymer degradation, while keeping in mind the domain of applicability imposed by the structure of the dataset used in this study.
Additionally, feature importance analysis showed model-specific prioritization of degradation drivers. In RF, soil environment was the most important feature, followed by adjusted H/B ratio. In contrast, the XGBoost model emphasized degradation time, followed by soil environment. Overall, both models consistently identified temporal and environmental descriptors as the dominant factors.
Outlier analysis showed that statistically extreme data points often corresponded to meaningful polymer degradation mechanisms, rather than experimental noise. Although outlier exclusion improved internal model stability—evidenced by reduced cross-validation SD for both RF (from 0.039 to 0.015) and XGBoost (from 0.040 to 0.014)—it resulted in decreased test-set performance (RF: R2 = 0.911; XGBoost: R2 = 0.897). In contrast, modest improvements were observed in unseen-dataset performance after outlier exclusion (RF: R2 increased to 0.741; XGBoost to 0.856), indicating a trade-off between internal stability and predictive accuracy.
This study provided a reliable data-driven framework that can be used to predict PHBV-based biodegradable materials’ degradation behaviour and indicated that ML models can be used as a reliable tool for assessing their long-term degradation and biodegradability. However, the predictive capability of the proposed models is inherently constrained by the variability and heterogeneity of the literature-derived data, and extrapolation to real-world environmental conditions should be approached with caution. Although the current ML framework has a high degree of predictive accuracy for the phenomenon of weight loss caused by degradation, it is noteworthy that the aforementioned parameter is indicative of physical disintegration rather than complete mineralization. Weight loss is indicative of the empirical loss of physical mass and is therefore a precursor but not a definitive measure of complete conversion to CO2 and biomass. Future versions of the aforementioned framework should take into consideration the possibility of the formation of microplastics during the course of disintegration. Aligning predictive frameworks with ultimate biodegradation standards (such as CO2 evolution) remains essential to ensure that biodegradable claims meet rigorous environmental safety requirements and do not result in the persistence of persistent micro-scale polymer fragments in the environment.
In future work, the data-driven polymers informatics framework can be utilized on a larger and more heterogeneous dataset, including additional physically meaningful descriptors such as polymer crystallinity. We also plan to expand the dataset by including new experimental data, which will allow further validation of the predictive robustness of the QSAR models. Similarly, other ML models can be used to improve predictive accuracy and generalization.