Next Article in Journal
Quantitative Assessment of SBS-Modifier Content in Bituminous Binders Using Infrared Spectroscopy
Next Article in Special Issue
A Reaction–Diffusion Model for Capturing Mass Loss and Microstructure Evolution in Enzymatic Degradation of Poly(ε-Caprolactone) Films
Previous Article in Journal
Production and Characterization of Poly(lactic acid) and Poly(ε-caprolactone) Films Enriched with Pomegranate Peel Extract: Toward Biodegradable and Sustainable Food Packaging
Previous Article in Special Issue
Radiation Attenuation Calculation of 3D-Printed Polymers Across Variable Infill Densities and Phase Angles for Nuclear Medicine Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Data-Driven Framework for Predicting PHBV Biodegradation-Induced Weight Loss Based on Laboratory and Real-Environment Condition Tests

by
Marianna I. Kotzabasaki
,
Leonidas Mindrinos
,
Nikolaos P. Sotiropoulos
,
Konstantina V. Filippou
and
Chrysanthos Maraveas
*
Department of Natural Resources Development and Agricultural Engineering, Agricultural University of Athens, Iera Odos 75, 11855 Athens, Greece
*
Author to whom correspondence should be addressed.
Polymers 2026, 18(7), 897; https://doi.org/10.3390/polym18070897
Submission received: 13 February 2026 / Revised: 2 April 2026 / Accepted: 3 April 2026 / Published: 7 April 2026
(This article belongs to the Special Issue Advances in Modeling and Simulations of Polymers)

Abstract

Polyhydroxyalkanoates (PHAs) emerge as promising biodegradable polymers for sustainable applications, yet predicting their biodegradation behavior under different environmental conditions remains challenging. In this study, we propose a novel data-driven computational framework for predicting biodegradation-induced weight/mass loss in PHA-based materials. A comprehensive database of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) (PHBV)-based formulations was manually curated by systematically collecting and harmonizing material descriptors, environmental parameters, and experimental biodegradation outcomes from laboratory- and large-scale studies conducted in soil, marine, freshwater, and compost environments. Multiple regression-based quantitative structure–activity relationship (QSAR) models were developed and rigorously validated, demonstrating high predictive performance and strong correlations between polymer structure, environmental conditions and degradation behavior. “Exposure time”, “degradation environment” and “hydroxybutyrate (HB) ratio” were identified as the most important features for weight loss. Finally, the predictive model was integrated into the Jaqpot computational platform, enabling open access and facilitating data-driven assessment and design of biodegradable polymer systems.

1. Introduction

The production of petroleum-derived plastics has grown exponentially over the last few decades. This production increased from 1.5 million tons in 1950 to 359 million tons in 2018, and correspondingly, the amount of plastic waste increased. Currently, households account for over 60% of the plastic waste from consumer sources. This mainly comprises single-use food packaging materials made from petroleum-derived plastics. Additionally, the consumption of plastic materials in both households and industries has exceeded the global production of plastics by up to 400 Mt/year [1]. As such, the manufacture of plastics from petroleum to meet current consumption demands poses several environmental concerns.
To mitigate environmental pollution in plastics production, polyhydroxyalkanoates (PHAs), a biobased, biodegradable material class, have emerged. Among short-chain length PHAs (scl-PHAs), the poly(3-hydroxybutyrate-co-3-hydroxyvalerate) (PHBV) has received considerable research interest due to its production by microbial fermentation processes [2], its bio-based nature, and its excellent combination of barrier properties and biodegradability, making it suitable for packaging and related applications. However, the use of PHBV as a material for real-world applications is restricted by diverse factors: the properties of the material and its performance; the varying rates of biodegradation of the material in different environments; geometries/thicknesses, and formulations, especially when compounded with plasticizers, fillers, fibers, and bioactive additives to suit specific application needs [3].
A key challenge in PHBV research and product development is that biodegradability is not a fixed parameter but an outcome of multiple factors such as environmental conditions (i.e., temperature, moisture, oxygen availability, microbial community, nutrient content), the polymer composition, molecular weight, crystallinity, chemical structure, presence of additives, material format (film, plaque, multilayer, composite), reduction potential, hydrophilicity and breakdown products. However, the extent of the effects from some of these factors remains unclear. Biodegradation is influenced by the susceptibility of the polymer carbon backbone to microbial attack. Research on PHBV biodegradation can be divided into two categories: those that indicate the degree of biodegradation, and those that indicate the mass or weight loss over the study duration. The latter is more prevalent in the literature due to its ease; however, based on ASTM standards, it is not adequate to determine the degree of biodegradability of polymers by themselves. Furthermore, it is the duration of biodegradation that makes complete studies following ASTM standards incredibly rare.
Consequently, laboratory studies often employ standardized test methods to ensure reproducibility and comparability of the results. For instance, ISO 14855 [4] measures the ultimate biodegradability of PHBV in controlled composting environments and determines the extent of biodegradability in terms of CO2 evolution, while ASTM D5338 [5] measures biodegradability in aerobic environments under controlled composting at thermophilic temperatures. These test methods are essential to the development of biodegradable plastics such as PHBV and in the standardization of their properties and performances. However, the test conditions, although standardized, can be quite different from the conditions in the “real environment” of use, and the results of the biodegradability of PHBV can differ from the standardized tests.
Indeed, the evidence within the existing literature on PHBV supports the extent of variation across environments and formulations. For example, the neat PHBV material has been found to display low rates of mass loss within soil environments. On the other hand, the incorporation of natural fibers may significantly enhance the degradation rates. This is supported by the different water absorption rates and degradation mechanisms [6]. Similarly, composite materials containing lignocellulosic fillers such as wood flour have shown improved degradation rates when subjected to soil burial tests [7]. For aquatic environments, PHBV films containing phenolic additives such as catechin, ferulic acid, and vanillin were observed to degrade through respirometry analysis. This shows that the degradation kinetics may depend on the medium [8]. Even within composting environments conducted under nominally controlled conditions, disintegration and degradation rates can vary depending on the multilayer structures or the variation in the blend composition, underlining the complexity of extrapolating laboratory results to real-world scenarios [9].
Such a heterogeneous environment poses a practical challenge in lab testing, whereas field testing in a real-world environment is expensive and time-consuming. However, decisions regarding formulation, processing pathways, and end-of-life need to be made in early-stage testing. Hence, there is a significant interest in developing a data-driven approach that can learn relationships between material descriptors, additives, and environmental factors in predicting biodegradation. Recent work within the larger domain of polymer biodegradation has demonstrated the ability of machine learning (ML) methods to predict biodegradation endpoints such as percent biodegradation given input features [10]. This presents a clear path forward for faster testing and hypothesis generation. Beyond polymer-specific datasets, interpretable ML frameworks have been proposed for enhancing prediction and understanding of primary vs. ultimate biodegradation endpoints, highlighting the importance of both prediction and mechanistic insight [11]. Similar work has utilized rank-based learning methods to overcome issues and biases inherent within degradability data, where experimental conditions and reporting are highly varied [12]. ML techniques have also been used simultaneously to predict biodegradation behavior in aquatic environments using meta-analytic datasets that include both material properties and experimental conditions [13].
However, various key unique issues related to the predictive modeling of PHBV biodegradability should be investigated extensively. First, PHBV is often blended with a variety of plasticizers such as citrate esters, natural fibers, and fillers such as cellulose and wood flour, and mineral fillers such as calcium carbonate, and functional additives such as antimicrobial and antioxidants, which can influence water sorption, crystallinity, interfacial microstructure, and microbial degradability [3]. Second, the results of PHBV biodegradability are expressed in diverse forms, such as ultimate biodegradation (CO2 evolution), disintegration, fragmentation, mass loss, changes in molecular weight, and surface/morphological changes [14]. Third, scaling up from small laboratory experiments to medium/large laboratory tests, and then to the field, is complicated by non-linear effects, e.g., oxygen transfer, temperature gradients, and moisture heterogeneities, which make it difficult to extrapolate the results from standardized conditions.
In this context, this study presents for the first time a computational framework that automates the process from literature-based data collection to the prediction of weight loss in scl-PHA biopolymers. The model accurately predicts physical disintegration across various experimental scales and environments, including soil, marine, freshwater, and compost settings. Initially, a comprehensive dataset on degradation-related weight and mass loss of PHBV-based formulations was manually curated by systematically collecting, assembling, and harmonizing data on material characteristics, environmental exposure conditions, and biodegradation behavior. The dataset included a combination of numerical and categorical variables. The mixed data types and potential nonlinear relationships motivated the use of advanced ML methods combined with appropriate preprocessing strategies. Subsequently, multiple regression-based QSAR models were developed and rigorously validated for predicting the biodegradation behavior of the investigated formulations. The models achieved high predictive accuracy, indicating robust structure-degradation relationships. However, given the complexity and heterogeneity of polymer structures and degradation processes, the model’s ability to generalize across different material–environment combinations should be further examined. Finally, the most important features governing biodegradation-induced weight loss were identified, and their effect on the predictions was examined. Degradation environment, exposure time, and hydroxybutyrate (HB) ratio were revealed as key factors in weight loss, showing that although weight loss was increased in duration and temperature, soil and polymer conditions (such as high crystallinity) hindered it significantly.
The final predictive model was implemented as a user-friendly web application on the Jaqpot computational platform (https://jaqpot.org/, accessed on 2 April 2026) and is openly accessible to the scientific community through the ANIPH virtual organization.

2. Materials and Methods

2.1. Workflow of Model Development

The overall methodological workflow adopted for this study is presented below in Figure 1 and involves distinct phases of data curation, preprocessing, and data-informed modeling. The raw data on weight loss were collected manually from the literature and underwent an initial curation step, including consistency checking, quality control, and time-series alignment. The next phase involved data preprocessing, including feature engineering, normalization, and handling of missing values, to yield the processed dataset. The processed dataset was split into the training and test sets, followed by outlier detection on the training set. When outliers were detected, factorial analysis of mixed data (FAMD) [15] and interquartile range (IQR) analysis [16] were performed to reduce their effects. One-hot encoding of the categorical variables and feature importance analysis were performed. Finally, the regression models were developed using the ensemble learning algorithms of Random Forest [17] and Extreme Gradient Boosting (XGBoost v. 2.1.3) [18] to predict the weight or mass loss of PHBV-based formulations under different environmental conditions and test scales. Both models are well-suited for polymers informatics applications due to their ability to capture nonlinear relationships and interactions among descriptors. Model performance was evaluated using the coefficient of determination (R2), root mean squared error (RMSE), and mean absolute error (MAE) for training, test, and validation sets. To further enhance performance, grid search–based hyperparameter optimization was conducted using cross-validation on the training data. Optimal hyperparameters were selected based on minimizing prediction error, and the final models were retrained before evaluating the test set.

2.2. Biodegradation-Induced Weight/Mass Loss Database Construction for PHBV-Based Formulations

The first stage in constructing the biodegradation-induced weight or mass loss database of PHBV-based formulations containing different additives or building blocks was data collection. The quality of the data collected was critical to the overall study. Data were obtained through a comprehensive literature-mining process that included peer-reviewed journal articles, theses, and published experimental studies addressing PHBV degradation under diverse environmental conditions. The selected studies investigated the degradation behavior of the studied formulations in soil, compost, freshwater, and marine environments, considering medium- and large-scale laboratory tests as well as real environmental exposure conditions. Emphasis was placed on identifying studies that reported quantitative degradation metrics based on “weight or mass loss”, used as an indicator of biodegradation-induced disintegration, quantifying the physical loss of material mass over time. These studies also revealed associated environmental parameters and material characteristics of the investigated formulations.

2.2.1. Literature Search Strategy

In accordance with the PRISMA 2020 guidelines [19,20], a literature search was carried out in the Scopus database [21]. The search identified relevant literature on the properties associated with biodegradation-induced weight or mass loss of PHBV-based formulations, including natural and synthetic additives such as plasticizers, fillers, stabilizers, antioxidants, bio-based compounds, and other components.
The literature search used a combination of Boolean operators, with a focus on the Title, Abstract, and Keywords fields (TITLE-ABS-KEY). Three concept blocks were used with the AND operator, and each block was used to identify literature on the following: (1) the copolymer PHBV as the biodegradable material, (2) weight/mass loss properties associated with the biodegradation of the material, (3) a set of additives/components used with the PHBV-based formulations to enhance their biodegradability.
The final search query in the Scopus database was structured as follows:
TITLE-ABS-KEY ((“PHBV” OR “polyhydroxybutyrate-co-valerate” OR “poly(3-hydroxybutyrate-co-3-hydroxyvalerate)”) AND (“biodegradation” OR “decomposition” OR “microbial degradation”) AND (“weight loss” OR “mass loss” OR “weight reduction” OR “mass reduction” OR “disintegration” OR “fragmentation” OR “erosion” OR “biodegradation rate” OR “percentage mass loss” OR “weight loss %”) AND (“lignin” OR “lignins” OR “citric acid” OR “citric ester” OR “acetyl tributyl citrate” OR “ATBC” OR “triacetin” OR “triethyl citrate” OR “TEC” OR “epoxidized soybean oil” OR “epoxidized cottonseed oil” OR “epoxidized natural rubber” OR “ENR” OR “soybean oil” OR “Vish-E filler” OR “starch” OR “starch-based fillers” OR “cornstarch” OR “alginate” OR “alginic acid” OR “pure cellulose” OR “cellulose fibers” OR “wood flour” OR “woodflour” OR “WF” OR “wheat straw fibre” OR “wheat straw fiber” OR “lignocellulosic” OR “miscanthus” OR “olive pomace” OR “propionylated abaca fiber” OR “catechin” OR “ferulic acid” OR “vanillin” OR “polylactic acid” OR “PLA” OR “poly(ε-caprolactone)” OR “PCL” OR “polybutylene adipate-co-terephthalate” OR “PBAT” OR “polyethylene oxide” OR “PEO” OR “flax fibers” OR “flax fibres” OR “calcium carbonate” OR “CaCO3” OR “halloysite” OR “modified halloysite” OR “lignin-coated cellulose nanocrystals” OR “boron nitride” OR “quercetin” OR “DDGS” OR “distillers dried grains with solubles” OR “posidonia oceanica” OR “gallic acid” OR “ammonium quaternary salts” OR “castor oil” OR “limonene” OR “thymol” OR “oregano essential oil” OR “sorbitol” OR “maltodextrin” OR “dicumyl peroxide” OR “Licowax”)).

2.2.2. Study Selection and Screening

The records identified through the database search were further exported for evaluation. Titles and abstracts were initially screened to assess their relevance in relation to the biodegradation of PHBV-based formulations containing diverse additives under laboratory or natural environmental conditions. No restrictions were placed on the publication year to ensure a comprehensive historical record was obtained. Additionally, only articles published in English were included. During the screening phase, review articles, conference papers, editorials, notes, and book chapters were excluded. Full-text articles were subsequently assessed in detail according to pre-defined inclusion and exclusion criteria.
The inclusion criteria comprised studies that:
(i)
Investigated PHBV or PHBV-based composites;
(ii)
Examined biodegradation, microbial degradation, or disintegration processes under laboratory or real environment conditions;
(iii)
Reported quantitative, time-resolved degradation metrics based on weight or mass loss (%);
(iv)
Evaluated the incorporation of additives such as fillers, plasticizers, nucleating agents, antioxidants, bio-derived materials, or other functional compounds within the polymer matrix.
The exclusion criteria eliminated studies that examined other types of polymers apart from PHBV, studied degradation mechanisms unrelated to the process of biodegradation or microbial activity, solely employed gas evolution-based mineralization methods (e.g., measurement of evolved CO2), which failed to include the measurement of mass or weight loss, or examined the addition of compounds that were not incorporated into the matrix of the polymer.
Initially, 56 publications from 1991 to March 2025 were identified from the Scopus database’s advanced search tool using the predefined search query. After removing duplicates and an initial screening, a full-text evaluation was conducted for all publications, following specific eligibility criteria. Only publications relevant to the specific scope of this study, namely on biodegradation-induced weight or mass loss of PHBV-based formulations containing specific additives, conducted under soil, compost, freshwater, marine, or other environmentally relevant conditions, including medium- to large-scale experiments conducted in the laboratory or under natural environmental conditions, were considered. Further, only publications containing primary experimental data were eligible. Although the exclusion criterion for non-English publications may have resulted in eliminating significant findings from regions where significant research activity was conducted (e.g., Japan and China), this criterion was applied to ensure English was used as a primary language for international scientific communication. Further, only full-text publications were considered to ensure access to all necessary details on the methodologies and experiments used. A total of 17 peer-reviewed publications [6,7,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36] formed the final database used for this study’s analysis and prediction of biodegradation-induced weight or mass loss for PHBV-based formulations containing different additives and building blocks. The specific study selection approach used in this study is illustrated in the PRISMA 2020 flow diagram [19] depicted below in Figure 2.

2.2.3. Data Extraction and Synthesis

For each eligible study, information related to PHBV formulation parameters and physicochemical properties was systematically extracted, including polymer composition, molecular and structural characteristics, and the type and concentration of additives or bio-derived compounds incorporated into the matrix. In addition, details regarding the biodegradation conditions, such as exposure environment, test duration, specimen size or geometry, and experimental setup, were collected with the reported quantitative degradation outcomes based on weight or mass loss (%). These data were used to conduct qualitative analysis to establish trends in biodegradation-induced disintegration behavior associated with different additive categories and environmental conditions.
All available published data, including numeric values, polymer names, additive descriptors, and physicochemical property data, were organized in tabular format (Microsoft Excel files) to facilitate transfer and subsequent analysis. When property data were only available in graphical form, numerical values were manually extracted and curated from figures, plots, and digitized tables reported in the main text using the Plotdigitizer software (version 3) [37].

2.2.4. Data Curation

Following data extraction, curated databases were constructed in a structured spreadsheet format (Microsoft Excel) to systematically capture the multivariate aspects of biodegradation-induced weight or mass loss in PHBV-based systems. The extracted data from all eligible studies were integrated into a unified database composed of five worksheets, each corresponding to a distinct data category, as summarized below in Table 1.
This structured organization ensured that uniform representation of material descriptors, additive properties, environmental exposure conditions, sample characteristics, and quantified degradation results was achieved.
The worksheets were developed to ensure data harmonization and cross-referencing of heterogeneous data collected from different experimental procedures and data representation styles. In addition, a worksheet was created to record all abbreviations, symbols, and definitions used for data variables to ensure proper understanding during subsequent data processing modeling.
Each study was assigned a unique identification number (study ID), and all corresponding data were subsequently entered as individual rows in the database. In situations where a study reported more than one physicochemical property or experimental condition, distinct data instances were created to represent individual conditions. However, all these data instances, which represented varying experimental conditions, still carried the study ID of their respective original study. In total, 17 distinct study IDs were included, corresponding to 226 independent instances in “Worksheet_1_Materials_features” and 133 independent instances in “Worksheet_2_Environmental_features”, “Worksheet_3_Biodegradation_features”, and Worksheet_4_Time_point_features”, collectively yielding 1546 time-resolved weight or mass loss observations. This structure enabled both within-study and between-study analyses, allowing variability to be examined at the instance level as well. In Table 2, we summarize the distribution of instances and data pairs (time point, weight, or mass loss percentage) per study. Note that the study IDs were non-sequential as they were extracted from a larger data library.
Table A1, Table A2, Table A3 and Table A4 in Appendix A outline a comprehensive summary of all input (feature) and output (target) variables used to generate the PHBV-based formulations’ biodegradation-induced weight or mass loss data library. Table A1 lists 46 materials descriptors (23 categorical & 23 numerical) associated with PHBV-based formulations’ composition, additives, molecular and physicochemical properties, morphological characteristics, and surface characteristics. Table A2 outlines 29 environmental features categorized into 17 numerical and 9 categorical variables and describing a range of biochemical environment variables, physical medium, temperature, pH, moisture, salinity, nutrient composition, solids composition, and standardized testing protocols.
Table A3 details the biodegradation-induced weight or mass loss response variables included, and Table A4 summarizes the degradation time-series output variables. Each variable is described based on its definition, measurement unit, range of values, and type (categorical or numerical). The “Worksheet_3_Biodegradation_features” and “Worksheet_4_Time_points_features” datasets were organized in a hierarchical format comprising multiple study- and instance-level observations, and representing repeated measurements recorded under the same experimental conditions, see, for example, the first five rows in Table 3 and Table 4, respectively. Note that not all instances have the same number of data points.
A comprehensive raw data library for PHBV biodegradation-induced weight or mass loss, including all variables across the four categories, is available in the AUA Zenodo repository [38].
Due to manual data extraction from heterogeneous literature sources, some additional noise may be present in the dataset. This was mitigated through iterative quality checks using descriptive statistics. Thus, model predictions should be interpreted in terms of long-term degradation trends rather than short-term ones, where high numerical precision is required.

2.2.5. Data Pre-Processing

To facilitate statistical analysis, the dataset “Worksheet_3_Biodegradation_features” was transformed into a long format. As a result, each weight loss value was melted into a separate row, such that every row represented a single weight loss observation associated with its corresponding “Study_id”, “Instance”, and “Sample_name”. This restructuring preserved the hierarchical relationships in the data while enabling instance-level analysis of weight loss outcomes.
The resulting dataset was combined with the corresponding melted “Worksheet_4_Time_points_features” dataset. This integration aligned each weight loss observation with its associated time point, while retaining the study-, instance-, and sample-level identifiers. The final merged “t_y_values” dataset comprised 1546 rows and 5 columns, with each row representing a single degradation time–weight loss % observation pair.
The melted and merged “t_y_values” dataset was subsequently integrated with the “Properties” dataset, which arose from merging the “Worksheet_1_Materials_features” and “Worksheet_2_Environmental_features” datasets. This final merge enriched each observation with the corresponding physicochemical and environmental properties, resulting in a comprehensive dataset comprising 20 columns, as summarized in Table 5 below. Features with more than 80% missing values were removed.
The final merged dataset consisted of two identifier columns (“Study_id” and “Instance”), one target variable, “Weight_loss_%”, and 17 feature columns. Of the feature variables, seven were numerical, and ten were categorical, capturing a diverse set of formulation, material, and experimental characteristics. This combination of variable types enabled comprehensive modeling of weight loss behavior.
To reduce sparsity and improve interpretability, categorical features with many small or semantically similar categories were consolidated. For instance, the “Degradation_Environment” feature originally contained 12 categories, some with very few observations (e.g., “vermicompost” with 4 instances). These categories were mapped into four broader, domain-relevant groups: Soil, Marine/Aquatic, Compost/Organic Fertilizer, and Laboratory/Mineral Media. This approach preserved experimental context while simplifying the dataset to facilitate visualization, modeling, and interpretation. Table 6 illustrates the mapping of original to merged categories for “Degradation_Environment”. The same procedure was applied to the features: “Degradation_mechanism”, “Sample_shape/Morphology”, “PHA_degrading_microbes”, and “Additive_type_1”. In Table 7, a summary of all features affected by this categorical dimensional reduction is detailed.
Following the categorical consolidation, missing values (18.3%) in the numerical feature “T_deg” (degradation temperature) were addressed using a targeted imputation strategy. Since degradation temperature was primarily determined by the degradation environment, missing “T_deg” values were filled using the mean temperature of comparable conditions from the same study id. For example, missing values of a given study id in the “Marine/Aquatic” environment were imputed using the mean temperature from all other instances in the same environment. This approach preserved the environmental context of each observation while ensuring that the dataset remained complete for subsequent analysis.
Finally, rows containing missing values in any of the feature columns were excluded, resulting in a final dataset of 1467 complete observations. The distributions of all features in the final dataset are summarized in two figures: Figure 3 shows the distributions of numerical features, while Figure 4 presents the distributions of categorical features, highlighting the range and skewness of each variable.

2.3. Biodegradation QSAR Model Development and Validation

The dataset was partitioned into training and testing subsets to enable robust and unbiased evaluation of model performance. The splitting was performed before feature importance-based selection, where dimensionality reduction via Factor Analysis of Mixed Data (FAMD) and outlier removal ensured that the test set remained unseen during feature encoding. The train-test split was performed at the instance level, allowing time and weight loss data from the same instance to appear in both sets.
To rigorously evaluate model performance on held-out data, entire instances were excluded from training and testing. From study IDs with more than two instances, one instance with more than five data points was randomly selected and reserved in a validation set. This approach enabled assessment of the model’s ability to predict polymer weight or mass loss behavior across different properties. Figure 5 shows a representative example for Study id = 24, which consists of 4 instances. Nevertheless, as the held-out instances were derived from the same experimental studies, some correlation related to shared properties and measurement conditions may remain, and the validation results should therefore be interpreted as an initial indication of model generalization rather than a fully independent assessment.
The size of the unseen (validation) set is 167 (11.4%). Then, 20% (260) of the remaining observations were held out as the test set, while the remaining 80% (1040) were used for training.
FAMD was used to find the most informative features. This method is suitable for datasets containing both numerical and categorical variables, as it simultaneously captures variance in continuous features and associations among categorical features. We selected fewer components for the final transformation, keeping only components with eigenvalues above 1%, ensuring that each retained component contributed meaningfully to explaining the overall variance. The original feature matrix was transformed into a lower-dimensional representation, preserving the essential structure of the data and facilitating subsequent modeling and analysis.
To determine the influence of extreme values on subsequent analyses, we employed an outlier detection procedure based on the interquartile range (IQR) for each FAMD component. Observations falling outside the range [Q1 − α IQR, Q3 + α IQR] were flagged as outliers and excluded from the dataset. The effect of this procedure will be discussed in the following section, where we demonstrate that polymer informatics models benefit from data diversity rather than aggressive statistical filtering.
After removing the outliers, categorical features were further encoded using one-hot encoding to enable compatibility with tree-based learning algorithms. Feature importance analysis was also performed using a Random Forest (RF)–based model to reduce dimensionality and improve interpretability. Features exhibiting positive (importance) values were retained for further modeling, while features with negligible contributions were excluded. Thus, this importance-driven feature selection reduced the complexity of the model and prevented overfitting, while preserving the most relevant variables to polymer weight loss behavior.

2.4. Model Explanation & Information Extraction

We were interested in predicting the biodegradation-induced weight or mass loss percentage (%) based on the degradation time point (in days) and polymer properties. However, it was important to highlight that weight loss did not increase monotonically over time (Figure 6). While the correlation matrix (see Figure 7) confirmed that degradation time and weight or mass loss were positively correlated (0.46), the relationship was moderate rather than strong, indicating the influence of additional factors and justifying the use of a multivariate modeling approach.
The complete processed dataset used for analysis and modelling is provided in https://github.com/FSL-AUA/Weight-loss-model.git, accessed on 2 April 2026.
RF and XGBoost models were initially trained and evaluated using the complete dataset without outlier exclusion to evaluate the robustness of the proposed polymers informatics framework. The final dataset consisted of 43 selected descriptors, with 1040 samples in the training set and 260 samples in the test set.
Model performance was evaluated using:
  • R2, which measures the proportion of variance in the observed data, was explained by the model:
R 2 = 1 i = 1 n y i y ^ i 2 i = 1 n ( ) y i y ¯ 2
where y i is the true value, y ^ i is the predicted value, and y ¯ The mean of the true values.
  • MAE defined by
M A E = 1 n i = 1 n | y i y ^ i |
  • RMSE quantifies the average magnitude of prediction errors:
R M S E = 1 n i = 1 n y i y ^ i 2

3. Results & Discussion

3.1. Dataset Construction

Literature data mining of 17 peer-reviewed research studies was undertaken to manually curate a structured dataset on the biodegradation-induced weight or mass loss of PHBV biopolymers formulated with a range of additives and building blocks. The curated raw dataset used for the data pre-processing step is publicly available in the associated data repository (AUA Zenodo repository) [38]. An overview of the compositional, molecular, physicochemical, environmental descriptors, and biodegradation outcomes of the PHBV-based formulations assessed across the multiple cited studies is presented in Table A1, Table A2, Table A3 and Table A4 in Appendix A. The biodegradation endpoint was precisely defined in line with standardized biodegradability assessment practices and based on biodegradation-induced disintegration measured as time-series data and expressed as “weight/mass loss percentage”. Degradation performance metrics were obtained directly from published weight/mass loss percentage–time curves, extracted from the literature using the digitization of graphical data.

3.2. Performance of QSAR-Based Degradation Models

RF and XGBoost models effectively predicted polymer weight loss, achieving test coefficients of determination (R2) values above 0.92. The predictive performances of the models are summarized in Table 8. Both models achieved high accuracy on the training dataset, with R2 exceeding 0.96. While test R2 values remained strong, the corresponding RMSE and MAE indicated moderate prediction errors on individual observations. These error metrics reflected the inherent variability in polymer degradation data and showed that, although the models effectively captured overall trends, some deviations remained at specific time points or under less common environmental conditions. Overall, the relatively low MAE compared to the typical range of weight loss values indicated that predictions were generally reliable for practical applications, especially in assessing long-term degradation behavior.
Grid search–based hyperparameter optimization was performed to enhance the predictive performance of both models. This procedure systematically evaluated a predefined range of hyperparameter combinations, such as the number of trees, maximum tree depth, learning rate, and minimum samples per leaf, using cross-validation on the training dataset. By assessing model performance across these configurations, the grid search identified the combination of hyperparameters that minimized prediction error and improved generalization to unseen data. The optimal hyperparameters selected for each model, which guided the final model training, are summarized in Table 9.
The tuned models were evaluated using cross-validation (5-fold) and test datasets, as summarized in Table 10. Both tuned models showed improved generalization compared to the baseline configurations, with reduced overfitting as evidenced by the smaller discrepancy between training and test R2 values.
The comparisons of the predicted and the true weight or mass loss percentages for the tuned models are presented in Figure 8 and Figure 9 for the RF and the XGBoost model, respectively.
In Figure 10a, the convergence of the RMSE as a function of the number of trees is also presented. Training RMSE decreased with increasing number of trees, while test RMSE quickly stabilized, indicating a stable generalization behavior of the RF model. The convergence of the RMSE with respect to the number of estimators is shown in Figure 10b. The evolution of training and test RMSE with increasing boosting rounds for the XGBoost model exhibited early convergence of test performance.
To evaluate whether the imputation of the degradation temperature (t_deg) introduced bias into the model, a sensitivity analysis was conducted by retraining the models without this feature while keeping the dataset and optimized hyperparameters unchanged. The resulting performance metrics were compared with those obtained using the imputed dataset (Table 11). Only minor decreases in predictive performance were observed (approximately 0.01 in test R2 for both models), indicating that the targeted imputation strategy did not substantially bias the models toward average environmental conditions.
The tuned models were evaluated on an independent unseen dataset (validation set) not used during training or hyperparameter optimization to further assess model robustness, see Figure 5. The resulting R2 scores are reported in Table 12. The XGBoost model retained strong predictive performance on the unseen dataset, whereas the RF model exhibited a substantial decrease in R2. This result indicated that XGBoost provided superior robustness to dataset shifts and experimental variability when outliers were not excluded. The best-performing estimated time series from each model are presented in Figure 11, along with their corresponding R2 scores. Overall, five out of six cases were estimated accurately. For the remaining case that performed poorly, the predictions were, however, accurate for longer degradation times. This was particularly important because accurately estimating long degradation times was critical to correctly classifying a polymer as biodegradable or not. These results further highlight that the model performs best in capturing general degradation trends, whereas precise quantitative predictions may be less reliable for formulations that are underrepresented in the training data or for shorter degradation times.

3.3. SHAP-Based Interpretation of Degradation Models

SHapley Additive exPlanations (SHAP) values were applied to quantify feature importance by measuring the contribution of each input variable to the model’s predictions. The SHAP summary plot provided a global overview of how different features impacted an ML model’s output across the entire dataset. It should be noted that SHAP values describe the contribution of variables to model predictions and therefore reflect statistical associations within the dataset rather than causal relationships governing biodegradation mechanisms. The SHAP summary plot for the RF model (Figure 12) indicated that biodegradation-induced weight or mass loss was primarily governed by the “degradation environment”, and soil environment emerged as the most influential feature. Observations corresponding to soil conditions predominantly exhibited negative SHAP values, indicating reduced predicted weight loss relative to other environments. This reflected the restrictive nature of soil systems, where limited oxygen diffusion, heterogeneous moisture distribution, and variable microbial accessibility constrained degradation processes. Beyond environmental effects, polymer composition also played a critical role, with higher HB content consistently associated with lower predicted weight loss, likely due to increased crystallinity and reduced enzymatic accessibility. Degradation time exhibited a clear positive contribution, although its impact was secondary to environmental constraints in the RF model. Additional features, including “experimental scale”, “degradation temperature”, “additive content” and “microbial dominance” revealed moderate contributions, whereas “degradation mechanisms” exhibited the lowest contribution.
On the contrary, the feature importance plot of the SHAP analysis of the XGBoost model (Figure 13) indicated that the primary factor affecting the weight/mass loss was the “degradation time”. Short degradation times were associated with strongly negative SHAP values, while longer exposure periods increased the predicted mass loss, reflecting cumulative degradation processes and potential acceleration phases. Although the “degradation environment” remained a highly important factor, it ranked lower than degradation time in the XGBoost model. Other variables, including “temperature”, “additive concentration”, “polymer composition”, exhibited more pronounced effects compared to the RF model, indicating differences in feature utilization between the two modeling approaches.
The difference in top-ranked features between the two models was attributed to the distinct ways in which RF and XGBoost handle feature interactions and data structure. RF, relying on bagged decision trees, tends to highlight features that consistently reduce impurity across many trees, favoring variables with strong marginal effects such as categorical environment types. XGBoost, on the other hand, builds trees sequentially with gradient boosting, emphasizing features that help correct residual errors from previous iterations. This enables XGBoost to capture temporal effects and subtle nonlinear trends, making degradation time a more influential factor in its predictions. Overall, these results suggest that environmental conditions play a crucial role in determining the possibility of biodegradation, while time-dependent and formulation-specific factors regulate the progression and kinetics of biodegradation. It should be noted that these findings reflect statistical associations captured by the model rather than direct causal mechanisms controlling biodegradation.
Figure 14, Figure 15 show the SHAP force plot explaining the RF and XGBoost regression prediction, respectively. The prediction of weight loss (%) for an individual test observation was illustrated by decomposing the model output into feature-level contributions. The prediction was constructed by starting from the model’s expected value (base value) and sequentially adding the contributions of each input feature. In the force plot, features shown in red pushed the prediction toward higher predicted weight loss, while features shown in blue pushed it toward lower predicted weight loss, with the length of each bar representing the magnitude of its contribution. The differing effects of features across test instances were clearly shown. For example, the feature “degradation_enviroment_soil” in the top case forced a higher predicted value, whereas in the bottom case, the opposite.

3.4. Key Features Influencing PHBV Weight Loss

The study revealed that biodegradation-induced weight loss of PHBV-based formulation was primarily governed by a combination of environmental, temporal, and material-specific descriptors that are summarized below in Table 13.

Model-Specific Prioritization

It is noteworthy that both ML models used in this work ranked the features in Table 13 in order of importance, with minor differences:
-
For the RF model: The soil environment was ranked first, followed by the adjusted H/B ratio.
-
For the XGBoost model: Degradation time was ranked first, followed by the soil environment.
Despite minor differences in ranking between the two ML models, it is apparent that both models consistently ranked environmental and temporal descriptors as the dominant factors in determining the physical degradation (weight loss) of PHBV materials. These results, being specific to the experimental dataset, indicated that the predictions were better interpreted as representations of broader trends rather than accurate quantitative forecasts under varying conditions outside the DOA of this framework.

3.5. Effect of Outlier Exclusion

To examine the influence of outlier exclusion on model robustness, the performance of the tuned RF and XGBoost models was compared for datasets with and without statistical outlier removal. 36 samples out of 1040 (3.5%) were excluded from the training set based on distribution-based criteria (see Section 2).
As summarized above in Table 14, outlier exclusion modestly increased cross-validation (CV) mean R2 and significantly reduced CV standard deviation (SD) for both models, indicating improved internal stability and reduced variance across folds. However, this increased stability did not consistently translate to improved predictive performance on test data. In particular, test set R2 values decreased after outlier exclusion for both RF (from 0.930 to 0.911) and XGBoost (from 0.931 to 0.897), suggesting a loss of generalization when statistically extreme observations were removed.
Evaluation on the independent unseen dataset revealed a subtle effect. Both models demonstrated a clearer improvement, maintaining strong generalization in both cases. Nevertheless, the overall reduction in test set performance indicated that many observations identified as statistical outliers likely corresponded to meaningful polymer degradation mechanisms rather than experimental noise. Given the intrinsic heterogeneity of polymer degradation processes, excluding such samples could reduce the diversity of the training data and limit the model’s ability to generalize across different material–environment combinations. Similar observations were reported in previous polymer informatics studies [39,40].
The presence of outliers was further examined using a Williams plot [41], in which standardized residuals (δ) were plotted against leverage values (h). The leverage threshold was defined as h = 3(p + 1)/n, where p is the number of model parameters, and n is the number of samples. Samples with |δ| > 3 or h > h* were considered potential outliers or influential points. Although some samples exhibited leverage values exceeding the warning threshold (Figure 16), their standardized residuals remained within acceptable limits, indicating reliable predictions and confirming the robustness of the model within an extended applicability domain (AD). In Table A5 of Appendix A, the ranges of the input numerical features are reported, defining the model’s AD.

3.6. Web Implementation of the Model

The source code for developing the model is publicly available at: https://github.com/FSL-AUA/Weight-loss-model.git, accessed on 2 April 2026. The model has been implemented as a web service on the Jaqpot 5 modelling platform (https://app.jaqpot.org/, accessed on 2 April 2026) and is publicly available (following free registration) at the following URL: https://app.jaqpot.org/dashboard/models/2343/description (accessed on 27 January 2026). Before deployment, the model was evaluated using comprehensive AD analysis to ensure new input samples belong to the descriptor space covered by the training data. The Leverage and Bounding Box methods were applied. Thus, the user can verify if the selected values lie within the AD.

4. Conclusions

This study presented an ML framework for predicting biodegradation-induced weight (mass) loss behavior of PHBV-based formulations with different additives as a function of degradation time, using a variety of heterogeneous descriptors that capture material composition, environmental conditions, and experimental parameters. A heterogeneous dataset, comprising repeated measurements from 17 independent studies, was systematically preprocessed and transformed to maintain its hierarchical structure and support data analysis at the instance level. This was achieved through a series of data operations, including categorical consolidation, conditional mean imputation based on degradation environment, and FAMD, which enabled robust and meaningful management of mixed numerical and categorical data.
Using the engineered dataset, both ML models, i.e., RF and XGBoost, showed excellent predictive performance for polymer biodegradation-induced weight loss (%). Without outlier exclusion, tuned RF and XGBoost achieved high test coefficients of determination (R2 = 0.930), indicating that over 93% of the variance in experimental weight-loss data could be explained by the selected descriptors. Cross-validation further confirmed model reliability, with mean CV R2 values of 0.887 (SD = 0.039) for RF and 0.884 (SD = 0.040) for XGBoost.
When evaluated on an unseen dataset, XGBoost exhibited superior robustness and generalization capability compared to RF. Specifically, XGBoost achieved an unseen R2 of 0.829, whereas RF achieved a lower unseen R2 of 0.716. This indicated that XGBoost outperformed the RF model and can be used as a reliable tool for predicting polymer degradation, while keeping in mind the domain of applicability imposed by the structure of the dataset used in this study.
Additionally, feature importance analysis showed model-specific prioritization of degradation drivers. In RF, soil environment was the most important feature, followed by adjusted H/B ratio. In contrast, the XGBoost model emphasized degradation time, followed by soil environment. Overall, both models consistently identified temporal and environmental descriptors as the dominant factors.
Outlier analysis showed that statistically extreme data points often corresponded to meaningful polymer degradation mechanisms, rather than experimental noise. Although outlier exclusion improved internal model stability—evidenced by reduced cross-validation SD for both RF (from 0.039 to 0.015) and XGBoost (from 0.040 to 0.014)—it resulted in decreased test-set performance (RF: R2 = 0.911; XGBoost: R2 = 0.897). In contrast, modest improvements were observed in unseen-dataset performance after outlier exclusion (RF: R2 increased to 0.741; XGBoost to 0.856), indicating a trade-off between internal stability and predictive accuracy.
This study provided a reliable data-driven framework that can be used to predict PHBV-based biodegradable materials’ degradation behaviour and indicated that ML models can be used as a reliable tool for assessing their long-term degradation and biodegradability. However, the predictive capability of the proposed models is inherently constrained by the variability and heterogeneity of the literature-derived data, and extrapolation to real-world environmental conditions should be approached with caution. Although the current ML framework has a high degree of predictive accuracy for the phenomenon of weight loss caused by degradation, it is noteworthy that the aforementioned parameter is indicative of physical disintegration rather than complete mineralization. Weight loss is indicative of the empirical loss of physical mass and is therefore a precursor but not a definitive measure of complete conversion to CO2 and biomass. Future versions of the aforementioned framework should take into consideration the possibility of the formation of microplastics during the course of disintegration. Aligning predictive frameworks with ultimate biodegradation standards (such as CO2 evolution) remains essential to ensure that biodegradable claims meet rigorous environmental safety requirements and do not result in the persistence of persistent micro-scale polymer fragments in the environment.
In future work, the data-driven polymers informatics framework can be utilized on a larger and more heterogeneous dataset, including additional physically meaningful descriptors such as polymer crystallinity. We also plan to expand the dataset by including new experimental data, which will allow further validation of the predictive robustness of the QSAR models. Similarly, other ML models can be used to improve predictive accuracy and generalization.

Author Contributions

Conceptualization, M.I.K. and C.M.; methodology, L.M. and M.I.K.; software, L.M.; validation, C.M.; formal analysis, L.M. and M.I.K.; investigation, L.M., M.I.K. and C.M. resources, M.I.K.; data curation, M.I.K., N.P.S. and K.V.F.; writing—original draft preparation, M.I.K.; writing—review and editing, C.M.; visualization, L.M., N.P.S. and K.V.F.; supervision, C.M.; project administration, C.M.; funding acquisition, C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the Horizon Europe European Commission project ANIPH (Grant Agreement No. 101181943).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets used for training and evaluating the ML models were derived from curated biodegradation data reported in the literature and compiled within the framework of this study. The implementation of the ML models, including data preprocessing, feature selection, model training, and evaluation scripts, is publicly available via GitHub at: https://github.com/FSL-AUA/Weight-loss-model.git, accessed on 27 January 2026. In addition, the trained models and their associated feature sets are accessible through the Jaqpot platform: Weight or mass loss regression model A.N.I.P.H. (Jaqpot ID: 2343): https://app.jaqpot.org/dashboard/models/2343/description, accessed on 27 January 2026.

Acknowledgments

The work presented is based on research conducted within the framework of the Horizon Europe European Commission project ANIPH (Grant Agreement No. 101181943). The content of the paper is the sole responsibility of its authors and does not necessarily reflect the views of the European Commission. All the authors would like to thank Haralambos Sarimveis, National Technical University of Athens, for hosting the biodegradation-induced weight/mass loss web-tool in Jaqpot platform.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations were used in this manuscript:
MLMachine Learning
QSARQuantitative Structure-Activity Relationship
FAMDFactorial Analysis of Mixed Data
IQRInterquartile Range
RMSERoot Mean Square Error
MAEMean Absolute Error
XGBoostExtreme Gradient Boosting
SHAPShapley Additive Explanations
ATBCAcetyl Tributyl Citrate
CaCO3Calcium Carbonate
CO2Carbon Dioxide
CVCross-Validation
DDGSDistillers Dried Grains with Solubles
ENREpoxidized Natural Rubber
HBHydroxybutyrate
HVHydroxyvalerate
ISOInternational Organization for Standardization
kDaKilodalton
MnNumber-Average Molecular Weight
MwWeight-Average Molecular Weight
MvViscosity-Average Molecular Weight
PBATPoly(butylene adipate-co-terephthalate)
PCLPoly(ε-caprolactone)
PDIPolydispersity Index
PEOPolyethylene Oxide
PHAPolyhydroxyalkanoate
PHAsPolyhydroxyalkanoates
PHBVPoly(3-hydroxybutyrate-co-3-hydroxyvalerate)
PLAPolylactic Acid
PRISMAPreferred Reporting Items for Systematic Reviews and Meta-Analyses
R2Coefficient of Determination
TECTriethyl Citrate
TOCATotal Organic Carbon Availability
VSVolatile Solids
WFWood Flour

Appendix A

Table A1 reports the material features included in the PHBV biodegradation-induced weight/mass loss database, which is uploaded to the AUA Zenodo repository [38]. For each feature, the table provides the feature name as used in the dataset, a concise scientific description, the measurement unit when applicable, and the observed value range across all samples. The materials properties dataset comprises 46 distinct features (23 categorical & 23 numerical) describing material descriptors encompassing composition, additives, molecular descriptors, physicochemical properties, morphological characteristics, and surface characteristics of PHBV-based materials. The data were obtained from references [6,7,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36].
Table A1. Overview of materials descriptors (independent variables) of the PHBV-based biodegradation-induced weight/mass loss database.
Table A1. Overview of materials descriptors (independent variables) of the PHBV-based biodegradation-induced weight/mass loss database.
Feature NameDescriptionUnitRangeType
Study_idUnique identifier for study instance-variescategorical
InstanceExperimental run number-variescategorical
Sample_nameIdentifier of the PHBV-based sample or formulationvariesCategorical
Sample ratioWeight-to-weight ratio of components in the formulation (wt/wt)ratio, %0–100Categorical
Monomer_APrimary monomer composing the PHBV copolymervariesCategorical
Monomer_BSecondary monomer composing the PHBV copolymervariesCategorical
Adjusted_HB_ratio_formulationHB ratio in formulationmol%14–99Numerical
Adjusted_HV_ratio_formulationHydroxyvalerate (HV) ratio in formulationmol%0.5–19Numerical
AdditivesPresence of additives in the formulationYes/NoCategorical
Additive1_nameName of the first additivevariesCategorical
Additive_type_1Type of the first additivevariesCategorical
Additive1_percentageWeight fraction of the first additivewt%0–100Numerical
Additive2_nameName of the second additivevariesCategorical
Additive_type_2Type of the second additivevariesCategorical
Additive2_percentageWeight fraction of the second additivewt%0–100Numerical
Additive3_nameName of the third additivevariesCategorical
Additive_type_3Type of the third additivevariesCategorical
Additive3_percentageWeight fraction of the third additivewt%0–100Numerical
PHBV_weight_percentage_final_formulationPHBV content in the final formulationwt%0–100Numerical
Viscocimetric_molar_mass, MvMolar mass by viscosityg/mol128–155numerical
Soil_burial_time_for_MvSoil burial time before testmonths0–3numerical
MwWeight-average molecular weightkDa375–690Numerical
Soil_burial_time_for_Mw (months)Soil burial time before testmonths0–12numerical
Relative Mw (%)Relative molecular weight%29–100numerical
Incubation_time_for_relative_Mw (days)Incubation period for relative Mwdays1–27numerical
MnNumber-average molar masskDa111–350numerical
Soil_burial_time_for_Mn (months)Soil burial time before testmonths0–12numerical
PDIPolydispersity indexratio2.2–2.9numerical
Soil_burial_time_for_PDI (months)Soil burial time before testmonths0–12numerical
Moisture_content_formulations (%)Moisture in the formulation%0–19numerical
Soil_burial_time _for_moisture_content (months)Soil burial time before testmonths0–12numerical
DensityDensity of PHBV-based materialg/cm30.79–1.25Numerical
Void_content (%)Void content in the sample%0–5numerical
Sample_shape/MorphologyPhysical shape or morphology of the PHBV-based sample (e.g., film, pellet, fiber, composite)-variesCategorical
Size_diameterDiameter of the PHBV sample or morphological element (e.g., particle, sphere, or cylindrical structure), as reported in the source studiescm9.0Numerical
Size_lengthLength of the PHBV sample or morphological element (e.g., films or tubes), as reported in the source studiescm2.5–20Numerical
Size_WidthWidth of the PHBV sample or morphological element (e.g., films), as reported in the source studiescm1–20Numerical
Size_ThicknessThickness of the PHBV sample or morphological element (e.g., films), as reported in the source studiescm0.01–0.3Numerical
Water_absorption_capacityWater absorption capacity%0.7–33Numerical
Immersion_daysDuration of exposuredays1–9Numerical
Thickness_swelling (%)Thickness swelling after immersion%0.96–6.5numerical
Immersion_daysDuration of exposuredays1–9Numerical
Film_solubilitySolubility of PHBV films%1–21Numerical
Static_water_contact_angleStatic water contact angle indicating hydrophilicitydeg57–83Numerical
Table A2 summarizes the environmental and experimental conditions under which PHBV biodegradation experiments were conducted, which were included in the biodegradation database uploaded to the AUA Zenodo repository [38]. For each feature, the table reports the feature name, a concise description, measurement unit (when applicable), the observed range across all studies and the feature type (categorical or numerical). Environmental conditions were described using 29 features, including 13 numerical and 16 categorical variables encompassing biochemical condition, the physical medium, temperature, pH, moisture, salinity, nutrient content, solids composition, microbial presence, and standardized testing protocols. The data were obtained from references [6,7,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36].
Table A2. Overview of environmental and experimental features (independent variables) of the PHBV-based biodegradation-induced weight/mass loss database.
Table A2. Overview of environmental and experimental features (independent variables) of the PHBV-based biodegradation-induced weight/mass loss database.
Feature NameDescriptionUnitRangeType
Sample_nameIdentifier of the PHBV-based sample testedvariesCategorical
Parameter_evaluatedEnvironmental parameters evaluated in the studyvariesCategorical
Degradation_conditionGeneral biodegradation condition (e.g., aerobic, anaerobic) soil, marine, compost)variesCategorical
Degradation_mechanismDominant biodegradation mechanism reportedvariesCategorical
T_biodegTemperature during the biodegradation experiment°C20–60Numerical
T_biodeg_unitsUnit used to report biodegradation temperaturecelsiusCategorical
T_biodeg_winter_marineMarine biodegradation temperature under winter conditions°C8–15Numerical
T_biodeg_summer_marineMarine biodegradation temperature under summer conditions°C20–25Numerical
ΤH20_winter_marineWater exposure time (Th20) in winter marine conditions12–29Numerical
ΤH20_summer_marineWater exposure time (Th20) in summer marine conditions27Numerical
Water_pH_marinepH of marine water during biodegradation7.2–8.2Numerical
Salinity_marineSalinity of the marine environmentppt37–38Numerical
Soil_moistureMoisture content of soil%20–50.5Numerical
Soil_pHpH of soil during biodegradation6.6–7.2Numerical
Soil_TSoil temperature during biodegradation°C22.7–29Numerical
Compost_moistureMoisture content of compost%55–81Numerical
Compost_pHpH of the compost environment6–8.2Numerical
Compost_TCompost temperature during biodegradation°C21–58Numerical
TOCATotal organic carbon availability%0.13–0.55 Numerical
C_N_ratioCarbon-to-nitrogen ratio of the environment18–28.9Numerical
VSVolatile solids fraction of the biodegradation environment%26.6Numerical
PHA_degrading_microbesPresence or abundance of PHA-degrading microorganismsvariesCategorical
Degradation_EnvironmentClassified degradation environmentvariesCategorical
ASTM/ISOStandard used for biodegradation testingvariesCategorical
Table A3 summarizes the biodegradation response variables included in the PHBV biodegradation database uploaded to the AUA Zenodo repository [38]. For each feature, the table reports the feature name, a concise description, the measurement unit, the observed value range across all studies, and whether the feature is categorical or numerical. The data were obtained from references [6,7,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36].
Table A3. Overview of weight loss features of the PHBV-based biodegradation database.
Table A3. Overview of weight loss features of the PHBV-based biodegradation database.
Feature NameDescriptionUnitRangeType
Sample_nameIdentifier of the PHBV-based samplerangesCategorical
Weight_loss_mass_loss_percentage
(Dependent Target value)
Lost mass due to biodegradation%0–100Numerical
Table A4 summarizes the temporal variables used to describe the progression of PHBV biodegradation included in the biodegradation dataset uploaded to the AUA Zenodo repository [38]. For each feature, the table reports the feature name, a concise description, the measurement unit, the observed value range across all studies, and whether the feature is categorical or numerical. The data were obtained from references [6,7,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36].
Table A4. Overview of time-point features of the PHBV-based biodegradation database.
Table A4. Overview of time-point features of the PHBV-based biodegradation database.
Feature NameDescriptionUnitRangeType
Sample_nameIdentifier of the PHBV-based sampleCategorical
Biodegradation_timeTime at which the weight or mass loss was measuredDays0–360Numerical
The numerical input ranges reported in Table A5 corresponded to the minimum and maximum values observed in the final training dataset used for model development. These ranges defined also the recommended limits for user inputs in the Jaqpot platform to avoid predictions outside the model’s applicability domain (the reported ranges are also visible in the platform).
Table A5. Numerical input ranges used for the implementation of the QSAR model.
Table A5. Numerical input ranges used for the implementation of the QSAR model.
Feature NameUnitRange
Additive1_percentagewt%0–70
Additive2_percentagewt%0–50
Additive3_percentagewt%0–20
Adjusted_HB_ratio_formulationmol%28–99
Adjusted_HV_ratio_formulationmol%0.5–19
T_deg°C20–60
Degradation_timedays0–360

References

  1. Bolla, M.; Pettinato, M.; Perego, P.; Ferrari, P.F.; Fabiano, B. Polyhydroxyalkanoates production from laboratory to industrial scale: A review. Int. J. Biol. Macromol. 2025, 310, 143255. [Google Scholar] [CrossRef]
  2. Filippou, K.; Bouzani, E.; Kora, E.; Ntaikou, I.; Papadopoulou, K.; Lyberatos, G. Polydroxyalkanoates Production from Simulated Food Waste Condensate Using Mixed Microbial Cultures. Polymers 2025, 17, 2042. [Google Scholar] [CrossRef] [PubMed]
  3. Ibrahim, M.I.; Alsafadi, D.; Alamry, K.A.; Hussein, M.A. Properties and Applications of Poly(3-hydroxybutyrate-co-3-hydroxyvalerate) Biocomposites. J. Polym. Environ. 2021, 29, 1010–1030. [Google Scholar] [CrossRef]
  4. ISO 14855-2:2018; Determination of the Ultimate Aerobic Biodegradability of Plastic Materials under Controlled Composting Conditions—Method by Analysis of Evolved Carbon Dioxide—Part 2: Gravimetric Measurement of Carbon Dioxide Evolved in a Laboratory-Scale Test. International Organization for Standardization: Geneva, Switzerland, 2018.
  5. ASTM D5338-15R21; Standard Test Method for Determining Aerobic Biodegradation of Plastic Materials under Controlled Composting Conditions, Incorporating Thermophilic Temperatures. ASTM International: West Conshohocken, PA, USA, 2021. Available online: https://store.astm.org/d5338-15r21.html (accessed on 2 April 2026).
  6. Zaidi, Z.; Mawad, D.; Crosky, A. Soil Biodegradation of Unidirectional Polyhydroxybutyrate-co-Valerate (PHBV) Biocomposites Toughened with Polybutylene-Adipate-co-Terephthalate (PBAT) and Epoxidized Natural Rubber (ENR). Front. Mater. 2019, 6, 275. [Google Scholar] [CrossRef]
  7. Chan, C.M.; Vandi, L.J.; Pratt, S.; Halley, P.; Richardson, D.; Werker, A.; Laycock, B. Insights into the biodegradation of PHA/wood composites: Micro- and macroscopic changes. Sustain. Mater. Technol. 2019, 21, e00099. [Google Scholar] [CrossRef]
  8. Read, T.; Chan, C.M.; Chaléat, C.; Laycock, B.; Pratt, S.; Lant, P. The effect of additives on the biodegradation of polyhydroxyalkanoate (PHA) in marine field trials. Sci. Total Environ. 2024, 931, 172771. [Google Scholar] [CrossRef]
  9. Lyshtva, P.; Kuusik, A.; Voronova, V. Degradation and Disintegration Behavior of PHBV- and PLA-Based Films Under Composting Conditions. Sustainability 2025, 17, 8657. [Google Scholar] [CrossRef]
  10. Cardoso, R.; André da Costa, C.; Marques de Figueiredo, R.; Zehetmeyer, G.; Schmith, J. A method to predict the percentage of biodegradation in polymeric materials using LSTM neural networks. Comput. Electr. Eng. 2024, 118, 109473. [Google Scholar] [CrossRef]
  11. Jiang, S.; Liang, Y.; Shi, S.; Wu, C.; Shi, Z. Improving predictions and understanding of primary and ultimate biodegradation rates with machine learning models. Sci. Total Environ. 2023, 893, 166623. [Google Scholar] [CrossRef]
  12. Yuan, W.; Hibi, Y.; Tamura, R.; Sumita, M.; Nakamura, Y.; Naito, M.; Tsuda, K. Revealing factors influencing polymer degradation with rank-based machine learning. Patterns 2023, 4, 100846. [Google Scholar] [CrossRef]
  13. Lin, C.; Zhang, H. Polymer biodegradation in aquatic environments: A machine learning model informed by meta-analysis of structure–biodegradation relationships. Environ. Sci. Technol. 2024, 58, 11245–11255. [Google Scholar] [CrossRef]
  14. Silva, R.R.A.; Suprani Marques, C.; Rodrigues Arruda, T.; Cocco Teixeira, S.; Veloso de Oliveira, T. Biodegradation of polymers: Stages, measurement, standards and prospects. Macromol 2023, 3, 371–399. [Google Scholar] [CrossRef]
  15. Pagès, J. Multiple Factor Analysis by Example Using R; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar] [CrossRef]
  16. Rousseeuw, P.J.; Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 73–79. [Google Scholar] [CrossRef]
  17. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef]
  18. Chen, T. XGBoost: A Scalable Tree Boosting System; Cornell University: Ithaca, NY, USA, 2016. [Google Scholar]
  19. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
  20. Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ 2021, 372, n160. [Google Scholar] [CrossRef] [PubMed]
  21. Scopus Database; Elsevier: Amsterdam, The Netherlands; Available online: https://www.scopus.com (accessed on 8 January 2026).
  22. Seggiani, M.; Cinelli, P.; Balestri, E.; Mallegni, N.; Stefanelli, E.; Rossi, A.; Lardicci, C.; Lazzeri, A. Novel Sustainable Composites Based on Poly(hydroxybutyrate-co-hydroxyvalerate) and Seagrass Beach-CAST Fibers: Performance and Degradability in Marine Environments. Materials 2018, 11, 772. [Google Scholar] [CrossRef]
  23. Syahirah, W.N.; Azami, N.A.; Huong, K.H.; Amirul, A.A. Preparation, characterization and biodegradation of blend films of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) with natural biopolymers. Polym. Bull. 2021, 78, 3973–3993. [Google Scholar] [CrossRef]
  24. Gordon, S.H.; Imam, S.H.; Shogren, R.L.; Govind, N.S.; Greene, R.V. A semiempirical model for predicting biodegradation profiles of individual polymers in starch–poly(β-hydroxybutyrate-co-β-hydroxyvalerate) bioplastic. J. Appl. Polym. Sci. 2000, 76, 1767–1776. [Google Scholar] [CrossRef]
  25. Choi, J.S.; Park, W.H. Thermal and mechanical properties of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) plasticized by biodegradable soybean oils. Macromol. Symp. 2003, 197, 65–76. [Google Scholar] [CrossRef]
  26. Liu, H.; Gao, Z.; Hu, X.; Wang, Z.; Su, T.; Yang, L.; Yan, S. Blending modification of PHBV/PCL and its biodegradation by Pseudomonas mendocina. J. Polym. Environ. 2017, 25, 156–164. [Google Scholar] [CrossRef]
  27. Brunel, D.G.; Pachekoski, W.M.; Dalmolin, C.; Agnelli, J.A.M. Natural additives for poly(hydroxybutyrate-co-hydroxyvalerate) (PHBV): Effect on mechanical properties and biodegradation. Mater. Res. 2014, 17, 1145–1156. [Google Scholar] [CrossRef]
  28. Iqbal, M.; Rizal, S.; Bairwan, R.D.; Zein, I.; Khalil, H.P.S.A. Enhanced biodegradable packaging composites: Performance of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) (PHBV) reinforced with propionylated abaca fiber mats. Polym. Compos. 2025, 46, 7614–7632. [Google Scholar] [CrossRef]
  29. Oliveira, P.R.; Mendoza, P.X.; Crespo, J.D.S.; Daitx, T.D.S.; Carli, L.N. Biodegradation study of poly(hydroxybutyrate-co-hydroxyvalerate)/halloysite/oregano essential oil compositions in simulated soil conditions. Int. J. Biol. Macromol. 2024, 277, 133768. [Google Scholar] [CrossRef] [PubMed]
  30. Feijoo, P.; Marín, A.; Samaniego-Aguilar, K.; Sánchez-Safont, E.; Lagarón, J.M.; Gámez-Pérez, J.; Cabedo, L. Effect of the Presence of Lignin from Woodflour on the Compostability of PHA-Based Biocomposites: Disintegration, Biodegradation and Microbial Dynamics. Polymers 2023, 15, 2481. [Google Scholar] [CrossRef]
  31. Brdlík, P.; Borůvka, M.; Běhálek, L.; Lenfeld, P. The Influence of Additives and Environment on Biodegradation of PHBV Biocomposites. Polymers 2022, 14, 838. [Google Scholar] [CrossRef]
  32. Imam, S.H.; Gordon, S.H.; Shogren, R.L.; Tosteson, T.R.; Govind, N.S.; Greene, R.V. Degradation of starch-poly(β-hydroxybutyrate-co-β-hydroxyvalerate) bioplastic in tropical coastal waters. Appl. Environ. Microbiol. 1999, 65, 431–437. [Google Scholar] [CrossRef]
  33. Ramsay, B.A.; Langlade, V.; Carreau, P.J.; Ramsay, J.A. Biodegradability and mechanical properties of poly(β-hydroxybutyrate-co-β-hydroxyvalerate)-starch blends. Appl. Environ. Microbiol. 1993, 59, 1242–1246. [Google Scholar] [CrossRef]
  34. Avella, M.; Rota, G.L.; Martuscelli, E.; Raimo, M.; Santagata, G.; Latterini, F. Poly(3-hydroxybutyrate-co-3-hydroxyvalerate) and wheat straw fibre composites: Thermal, mechanical properties and biodegradation behaviour. J. Mater. Sci. 2000, 35, 829–836. [Google Scholar] [CrossRef]
  35. Sánchez-Safont, E.L.; González-Ausejo, J.; Gámez-Pérez, J.; Lagarón, J.M.; Cabedo, L. Poly(3-hydroxybutyrate-co-3-hydroxyvalerate)/purified cellulose fiber composites by melt blending: Characterization and degradation in composting conditions. J. Renew. Mater. 2016, 4, 123–132. [Google Scholar] [CrossRef]
  36. La Fuente Arias, C.I.; González-Martínez, C.; Chiralt, A. Biodegradation behavior of poly(3-hydroxybutyrate-co-3-hydroxyvalerate) containing phenolic compounds in seawater in laboratory testing conditions. Sci. Total Environ. 2024, 944, 173920. [Google Scholar] [CrossRef] [PubMed]
  37. PlotDigitizer (Version 3) [Software]. Available online: https://plotdigitizer.com/ (accessed on 8 January 2026).
  38. Kotzabasaki, M. Data library of biodegradation-induced weight/mass loss of PHBV-based formulations developed under the ANIPH Project. Zenodo 2026. [Google Scholar] [CrossRef]
  39. Mairpady, A.; Mourad, A.H.I.; Mozumder, M.S. Accelerated discovery of the polymer blends for cartilage repair through data-mining tools and machine-learning algorithm. Polymers 2022, 14, 1802. [Google Scholar] [CrossRef] [PubMed]
  40. Tang, W.; Li, Y.; Yu, Y.; Wang, Z.; Xu, T.; Chen, J.; Li, X. Development of models predicting biodegradation rate rating with multiple linear regression and support vector machine algorithms. Chemosphere 2020, 253, 126666. [Google Scholar] [CrossRef]
  41. Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
Figure 1. Schematic overview of the data-driven workflow developed in this study.
Figure 1. Schematic overview of the data-driven workflow developed in this study.
Polymers 18 00897 g001
Figure 2. PRISMA 2020 flow diagram [19] illustrating the identification, screening, eligibility, and inclusion of studies investigating the biodegradation-induced weight or mass loss of PHBV-based materials containing natural and synthetic additives.
Figure 2. PRISMA 2020 flow diagram [19] illustrating the identification, screening, eligibility, and inclusion of studies investigating the biodegradation-induced weight or mass loss of PHBV-based materials containing natural and synthetic additives.
Polymers 18 00897 g002
Figure 3. Distribution of model input variables: the target value (top left) and numerical features.
Figure 3. Distribution of model input variables: the target value (top left) and numerical features.
Polymers 18 00897 g003
Figure 4. Distribution of model input categorical variables.
Figure 4. Distribution of model input categorical variables.
Polymers 18 00897 g004
Figure 5. Train/test splitting and held-out data points (validation set) of the Study id 24.
Figure 5. Train/test splitting and held-out data points (validation set) of the Study id 24.
Polymers 18 00897 g005
Figure 6. Weight loss percentage over time for two study instances.
Figure 6. Weight loss percentage over time for two study instances.
Polymers 18 00897 g006
Figure 7. Correlation matrix showing pairwise relationships among numerical features.
Figure 7. Correlation matrix showing pairwise relationships among numerical features.
Polymers 18 00897 g007
Figure 8. RF: Comparison of predicted and true weight loss percentages for the training set (left) and the test set (right).
Figure 8. RF: Comparison of predicted and true weight loss percentages for the training set (left) and the test set (right).
Polymers 18 00897 g008
Figure 9. XGBoost: Comparison of predicted and true weight loss percentages for the training set (left) and the test set (right).
Figure 9. XGBoost: Comparison of predicted and true weight loss percentages for the training set (left) and the test set (right).
Polymers 18 00897 g009
Figure 10. The convergence behavior of the RMSE for the training and test sets. For the RF, the RMSE was represented as a function of the number of trees (a) and as a function of the number of estimators for the XGBoost model (b).
Figure 10. The convergence behavior of the RMSE for the training and test sets. For the RF, the RMSE was represented as a function of the number of trees (a) and as a function of the number of estimators for the XGBoost model (b).
Polymers 18 00897 g010
Figure 11. Comparison of predicted and true weight loss (%) of fully unseen instances. The dashed red line represents perfect agreement ( y   =   x ).
Figure 11. Comparison of predicted and true weight loss (%) of fully unseen instances. The dashed red line represents perfect agreement ( y   =   x ).
Polymers 18 00897 g011
Figure 12. Top 10 feature contributions of the RF model using SHAP values.
Figure 12. Top 10 feature contributions of the RF model using SHAP values.
Polymers 18 00897 g012
Figure 13. Top 10 feature contributions of the XGBoost model using SHAP values.
Figure 13. Top 10 feature contributions of the XGBoost model using SHAP values.
Polymers 18 00897 g013
Figure 14. SHAP force plot explaining the RF regression prediction of weight loss (%) for a single test observation: study_id = 18, instance = 2b (top) and study_id = 12, instance = 1 (bottom), showing positive (red) and negative (blue) feature contributions relative to the model’s base value.
Figure 14. SHAP force plot explaining the RF regression prediction of weight loss (%) for a single test observation: study_id = 18, instance = 2b (top) and study_id = 12, instance = 1 (bottom), showing positive (red) and negative (blue) feature contributions relative to the model’s base value.
Polymers 18 00897 g014
Figure 15. SHAP force plot explaining the XGBoost regression prediction of weight loss (%) for a single test observation: study_id = 18, instance = 2b (top) and study_id = 12, instance = 1 (bottom), showing positive (red) and negative (blue) feature contributions relative to the model’s base value.
Figure 15. SHAP force plot explaining the XGBoost regression prediction of weight loss (%) for a single test observation: study_id = 18, instance = 2b (top) and study_id = 12, instance = 1 (bottom), showing positive (red) and negative (blue) feature contributions relative to the model’s base value.
Polymers 18 00897 g015
Figure 16. William’s plot showing standardized residuals (δ) versus leverage values (h) for the training (blue) and test (red) datasets. The dashed horizontal lines represented δ = ±3, and the vertical dashed line indicated the warning leverage threshold (h* = 0.0894).
Figure 16. William’s plot showing standardized residuals (δ) versus leverage values (h) for the training (blue) and test (red) datasets. The dashed horizontal lines represented δ = ±3, and the vertical dashed line indicated the warning leverage threshold (h* = 0.0894).
Polymers 18 00897 g016
Table 1. Each of the five worksheets represented a distinct category of data within the PHBV-based biodegradation-induced weight or mass loss data library.
Table 1. Each of the five worksheets represented a distinct category of data within the PHBV-based biodegradation-induced weight or mass loss data library.
WorksheetsParameters
Worksheet_1_Materials_featuresComposition details, Molecular weight distributions (Mw and Mn), Additive presence, Respective additive concentrations, etc.
Worksheet_2_Environmental_featuresTemperature, pH, Moisture levels, Oxygen availability, Microbial activity, etc.
Worksheet_3_Biodegradation_featuresWeight or mass loss percentages
Worksheet_4_Time_points_featuresDegradation time points
Worksheet_5_MetadataStudy_id, Title, Doi
Table 2. Distribution of Instances and data pairs (time point, weight or mass loss percentage) across studies.
Table 2. Distribution of Instances and data pairs (time point, weight or mass loss percentage) across studies.
Study IdNo. of InstancesNo. of Data PairsReference
4214[22]
548192[23]
6314[7]
816[24]
9324[25]
1019[26]
1126[27]
126253[28]
135142[29]
14477[6]
153126[30]
161212[31]
1820404[32]
19893[33]
20620[34]
21570[35]
24484[36]
Total1331546
Table 3. The first 5 rows of the “Worksheet_3_Biodegradation_features” dataset.
Table 3. The first 5 rows of the “Worksheet_3_Biodegradation_features” dataset.
Study_IdInstanceSample_NameWeight_Loss %
45PCA202.695114.612216.684759.586311.7624713.4723122.90236
46PCA0.674381.347962.125163.886826.062996.7365610.20806
51PHBV/SOIL0.578034.624289.2485510.4046
52PHBV/STARCH/SOIL0.867055.7803510.404612.4277
53PHBV/STARCH/SOIL2.601168.0924913.872815.8959
Table 4. The first 5 rows of the “Worksheet_4_Time_points_features” dataset.
Table 4. The first 5 rows of the “Worksheet_4_Time_points_features” dataset.
Study_IdInstanceSample_NameDegradation_Time (Days)
45PCA20306090150240300360
46PCA306090150240300360
51PHBV/SOIL7142128
52PHBV/STARCH/SOIL7142128
53PHBV/STARCH/SOIL7142128
Table 5. Data type and completeness summary of the merged dataset.
Table 5. Data type and completeness summary of the merged dataset.
ColumnCompletenessDtype
Study_id100.0%int64
Instance100.0%int64
Degradation_time_(days)100.0%float64
Weight_loss_%100.0%float64
Adjusted_HB_ratio_formulation (mol%)99.6%float64
Adjusted_HV_ratio_formulation (mol%)99.6%float64
Degradation_condition100.0%object
Degradation_mechanism100.0%object
Additives100.0%object
T_deg81.7%float64
Degradation_Environment100.0%object
Additive_type_1100.0%object
Additive1_percentage (wt%)95.3%float64
Additive_type_2100.0%object
Additive2_percentage (wt%)98.1%float64
Additive_type_3100.0%object
Additive3_percentage (wt%)98.3%float64
Sample_shape/Morphology100.0%object
PHA_degrading_microbes100.0%object
Experimental_Scale100.0%object
Table 6. Category Grouping of the feature “Degradation Environment”.
Table 6. Category Grouping of the feature “Degradation Environment”.
Original CategoryCountMerged CategoryCount
Soil572Soil586
field soil14Soil
marine508Marine/Aquatic608
freshwater100Marine/Aquatic
industrial compost126Compost/Organic Fertilizer250
compost110Compost/Organic Fertilizer
vermicompost4Compost/Organic Fertilizer
thermophilic compost4Compost/Organic Fertilizer
organic fertilizer6Compost/Organic Fertilizer
liquid mineral medium102Laboratory/Mineral Media102
Table 7. Summary of Category Grouping for the categorical features.
Table 7. Summary of Category Grouping for the categorical features.
FeatureNo. of Initial CategoriesNo. of Merged Categories
Degradation_mechanism94
Degradation_Environment104
Additive_type_175
Sample_shape/Morphology155
PHA_degrading_microbes185
Table 8. Model performance without outlier exclusion.
Table 8. Model performance without outlier exclusion.
ModelSetR2RMSEMAE
Random ForestTraining0.9695.912.93
Test0.9258.985.12
XGBoostTraining0.9676.073.52
Test0.9229.165.32
Table 9. Optimized Hyperparameter Values for both regressors.
Table 9. Optimized Hyperparameter Values for both regressors.
Random ForestXGBoost
ParameterBest ValueParameterBest Value
max_depth10max_depth6
min_samples_leaf1learning_rate0.01
min_samples_split5colsample_bytree0.8
n_estimators200n_estimators800
subsample0.7
Table 10. Performance of tuned models without outlier exclusion.
Table 10. Performance of tuned models without outlier exclusion.
MetricRandom Forest (Tuned)XGBoost (Tuned)
Train R20.9550.959
Test R20.9300.931
CV Mean R20.8870.884
CV SD R20.0390.040
Table 11. Sensitivity analysis on the effect of the imputed degradation temperature (t_deg).
Table 11. Sensitivity analysis on the effect of the imputed degradation temperature (t_deg).
Modelt_tegTrain R2Test R2
Random Forest (Tuned)With0.9550.930
Without0.9480.918
XGBoost (Tuned)With0.9590.931
Without0.9500.918
Table 12. Model performance on the unseen dataset.
Table 12. Model performance on the unseen dataset.
ModelR2 (Unseen Dataset)
Random Forest0.716
XGBoost0.829
Table 13. Key features influencing PHBV-based formulations’ weight loss as derived from ML models.
Table 13. Key features influencing PHBV-based formulations’ weight loss as derived from ML models.
FeatureInfluence on Weight Loss
Degradation EnvironmentThis environmental factor was identified as the most influential factor in the RF model. Specifically, the soil environment was associated with significantly reduced predicted weight loss compared to other environments. This was attributed to soil’s restrictive nature, which included limited oxygen diffusion and heterogeneous moisture distribution.
Degradation TimeThis temporal factor was identified as a dominant feature, especially in the XGBoost model. Naturally, weight loss increased over time as the physical disintegration of the polymer progressed.
Adjusted H/B RatioPolymer composition, specifically the HB ratio, played a critical role. The models indicated that the specific chemical structure and ratio of monomers in the PHBV-based formulations were key predictors of how quickly the material physically lost mass.
Additive Presence and TypeThe inclusion of additives, such as natural fibers (e.g., wood flour) or plasticizers, significantly modified the degradation rate. For instance, natural fibers enhanced degradation by increasing water uptake and microbial accessibility.
Table 14. Comparison of model performance with and without outlier exclusion.
Table 14. Comparison of model performance with and without outlier exclusion.
ModelOutlier ExclusionTrain R2Test R2CV Mean R2CV SD R2Unseen R2
Random ForestNo0.9550.9300.8870.0390.716
Yes0.9580.9110.8900.0150.741
XGBoostNo0.9590.9310.8840.0400.829
Yes0.9560.8970.8840.0140.856
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kotzabasaki, M.I.; Mindrinos, L.; Sotiropoulos, N.P.; Filippou, K.V.; Maraveas, C. A Data-Driven Framework for Predicting PHBV Biodegradation-Induced Weight Loss Based on Laboratory and Real-Environment Condition Tests. Polymers 2026, 18, 897. https://doi.org/10.3390/polym18070897

AMA Style

Kotzabasaki MI, Mindrinos L, Sotiropoulos NP, Filippou KV, Maraveas C. A Data-Driven Framework for Predicting PHBV Biodegradation-Induced Weight Loss Based on Laboratory and Real-Environment Condition Tests. Polymers. 2026; 18(7):897. https://doi.org/10.3390/polym18070897

Chicago/Turabian Style

Kotzabasaki, Marianna I., Leonidas Mindrinos, Nikolaos P. Sotiropoulos, Konstantina V. Filippou, and Chrysanthos Maraveas. 2026. "A Data-Driven Framework for Predicting PHBV Biodegradation-Induced Weight Loss Based on Laboratory and Real-Environment Condition Tests" Polymers 18, no. 7: 897. https://doi.org/10.3390/polym18070897

APA Style

Kotzabasaki, M. I., Mindrinos, L., Sotiropoulos, N. P., Filippou, K. V., & Maraveas, C. (2026). A Data-Driven Framework for Predicting PHBV Biodegradation-Induced Weight Loss Based on Laboratory and Real-Environment Condition Tests. Polymers, 18(7), 897. https://doi.org/10.3390/polym18070897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop