Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction

Mantiniotou, Martha; Athanasiadis, Vassilis; Liakos, Konstantinos G.; Bozinou, Eleni; Lalas, Stavros I.

doi:10.3390/pr13061879

Open AccessFeature PaperEditor’s ChoiceArticle

Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction

by

Martha Mantiniotou

¹

,

Vassilis Athanasiadis

¹

,

Konstantinos G. Liakos

²,

Eleni Bozinou

¹

and

Stavros I. Lalas

^1,*

¹

Department of Food Science and Nutrition, University of Thessaly, Terma N. Temponera Street, 43100 Karditsa, Greece

²

Department of Electrical and Computer Engineering, University of Thessaly, Sekeri Street, 38334 Volos, Greece

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(6), 1879; https://doi.org/10.3390/pr13061879

Submission received: 19 May 2025 / Revised: 10 June 2025 / Accepted: 12 June 2025 / Published: 13 June 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Industrial Process Modelling and Optimization)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Rosemary (Rosmarinus officinalis or Salvia rosmarinus) is an aromatic herb that possesses numerous health-promoting and antioxidant properties. Pressurized Liquid Extraction (PLE) is an efficient, environmentally friendly technique for obtaining valuable compounds from natural sources. The optimal PLE conditions were established as 25% v/v ethanol at 160 °C for 25 min, and a liquid-to-solid ratio of 10 mL/g. The optimal extract exhibited high polyphenol and antioxidant content through various assays. The recovered bioactive compounds possess potential applications in the food, pharmaceutical, and cosmetics sectors, in addition to serving as feed additives. This research compares two distinct optimization models: one statistical, derived from experimental data, and the other based on artificial intelligence (AI). The objective was to evaluate if AI could replicate experimental models and ultimately supplant the laborious experimental process, yielding the same results more rapidly and adaptably. To further enhance data interpretation and predictive capabilities, six machine learning models were implemented on the original dataset. Due to the limited sample size, synthetic data were generated using Random Forest (RF)-based resampling and Gaussian noise addition. The augmented dataset significantly improved the model performance. Among the models tested, the RF algorithm achieved the highest accuracy.

Keywords:

Rosmarinus officinalis; polyphenols; antioxidants; HPLC-DAD; response surface methodology; machine learning; regression models; generative models; random forest; ensemble learning

1. Introduction

Rosemary (Rosmarinus officinalis L.), a perennial species of the Lamiaceae family, is distinguished for its distinctive scent, culinary use, and therapeutic properties [1]. It has been established by phylogenetic studies that rosemary is classified within the genus Salvia, specifically referred to as Salvia rosmarinus [2]. Rosemary originates from the Mediterranean region; however, it has been cultivated successfully in numerous other locations globally [3]. This is an aromatic plant characterized by its needle-like foliage, widely cultivated around the world [3]. Rosemary’s therapeutic properties have been utilized in traditional folk medicine to address a range of ailments, such as pain relief, headaches, stomach discomfort, respiratory disorders, and others [1,4,5,6].

Several methodologies have been investigated for the recovery of bioactive compounds, mainly rosmarinic acid and carnosic acid, from rosemary leaves at a laboratory scale. Certain extraction processes, especially conventional methods, are often associated with various disadvantages, including the utilization of hazardous solvents, the degradation of target compounds resulting from elevated temperatures, prolonged extraction durations, challenges in implementation, and significant economic and energy expenditures. In recent years, the concepts of “Green chemistry” and “eco-extraction” have emerged [7]. Recent studies indicate that extraction processes have become more energy-efficient, safer for users, and environmentally friendly compared to previous methods, all while maintaining extraction efficiency. The intensification of extraction processes, considering these various aspects, should emerge as a new challenge for the design of such processes.

Despite extensive research on rosemary leaf extracts, gaps remain in exploring less-studied aspects and emerging opportunities driven by technological advancements. One such critical area is the use of green, non-toxic solvents for the sustainable recovery of bioactive compounds from plant materials. This study aims to bridge this gap by investigating the combined effects of green solvent mixtures, optimizing extraction conditions, and evaluating their performance. Rosemary leaves were subjected to Pressurized Liquid Extraction (PLE), which combines elevated pressure to enhance mass transfer and elevated temperatures to help facilitate the diffusion of the solvent into the sample by diminishing its viscosity [8]. The study examined the influence of eco-friendly solvent mixtures, specifically water and ethanol, alongside key process parameters such as temperature and extraction duration. Additionally, a partial least squares (PLS) model was utilized to identify the optimal extraction conditions.

In parallel, the growing integration of artificial intelligence (AI), particularly machine learning (ML) and Deep Learning (DL), has enabled more accurate modeling, prediction, and optimization across the food, beverage, pharmaceutical, and cosmetic industries. ML techniques are increasingly applied in bioactive compound prediction, formulation optimization [9], sensory analysis, and green extraction process modeling [10]. Recent studies demonstrate the effectiveness of algorithms such as Random Forest (RF), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs) in predicting antioxidant capacity, total polyphenol content (TPC), and other physicochemical properties from experimental variables [11]. In this context, the present study incorporates multiple machine learning approaches to predict the antioxidant potential of rosemary extracts under varying PLE conditions, thereby enhancing process efficiency and supporting sustainable product development in food, nutraceutical, and cosmetic applications [12].

While ML methods have been increasingly used for modeling extraction processes, most prior studies rely on relatively large experimental datasets or focus on specific extraction methods with extensive data availability. In contrast, PLE of rosemary leaves is a process with high experimental cost and limited data availability, which hinders the effective application of conventional ML techniques. This study addresses this gap by integrating ML models with data augmentation strategies to enhance model robustness under small-sample conditions. To our knowledge, this is one of the first studies applying such an approach to optimize green extraction of bioactive compounds from rosemary, offering insights that can support broader adoption of AI-assisted extraction workflows.

2. Materials and Methods

2.1. Chemicals and Reagents

A deionizing column was used to produce deionized water for all the experiments performed. The deionized column contains mixed-bed ion exchange resin, ensuring conductivity below 1 µS/cm, with a standard flow rate and operating pressure. All polyphenolic standards for the HPLC determination, along with L-ascorbic acid (99%), 2,4,6-tris(2-pyridyl)-s-triazine (TPTZ) (≥98%), 2,2-diphenyl-1-picrylhydrazyl (DPPH^•), and hydrochloric acid (37%), were bought from Sigma-Aldrich (Darmstadt, Germany) and were at least 97% purity or higher. Acetonitrile was acquired from Labkem (Barcelona, Spain). Sodium carbonate (anhydrous, 99.5%), rutin (≥94%), and formic acid (99.8%) were bought from Penta (Prague, Czech Republic). Iron (III) chloride hexahydrate (97%) was obtained from Merck (Darmstadt, Germany). Folin–Ciocalteu reagent, gallic acid (97%), and ethanol (99.8%) were acquired from Panreac Co. (Barcelona, Spain).

2.2. Rosemary Leaves’ Raw Material and Sample Preparation

Rosemary leaves were obtained from a local plant shop from the Karditsa region (Central Greece). The rosemary leaves were washed carefully and manually dried with paper towels. Then, they were subjected to lyophilization through a Biobase BK-FD10 (Jinan, China) freeze-drier. The moisture content was determined as 53.2 ± 3.8%. Then, the dried material was sieved in an Analysette 3 PRO (Fritsch GmbH, Oberstein, Germany) sieving machine, and the powder, consisting of an average particle diameter of 497 μm, was obtained. The obtained powder was kept in a freezer at up to –40 °C until further analysis.

2.3. Experimental Design

A custom-designed Response Surface Methodology (RSM) with four factors at five levels was employed to optimize the extraction conditions for TPC, antioxidant activity (FRAP and DPPH assays), and ascorbic acid content (AAC) using the Pressurized Liquid Extraction (PLE) technique on rosemary powder. A Pressurized Liquid Extraction (PLE) system (Fluid Management Systems, Inc., Watertown, MA, USA) was used to facilitate all extractions. The independent variables examined included the ethanol concentration (C, % v/v) as X₁, liquid-to-solid ratio (R, mL/g) as X₂, extraction temperature (T, °C) as X₃, and extraction time (t, min) as X₄, each assigned five levels. To assess the method’s repeatability, 17 experimental runs, including one central point, were conducted, with each run replicated three times, and the average response values were documented for subsequent analysis.

Stepwise regression was utilized to refine the model’s predictive precision by reducing variance from superfluous term estimation, leading to a second-order polynomial equation that delineates the interactions between the three independent variables:

Y_{k} = β_{0} + \sum_{i = 1}^{2} β_{i} X_{i} + \sum_{i = 1}^{2} β_{ii} X_{i}^{2} + \sum_{i = 1}^{2} \sum_{j = i + 1}^{3} β_{ij} X_{i} X_{j}

(1)

where the independent variables are denoted by X_i and X_j, and the predicted response variable is defined by Y_k. In the model, the intercept and regression coefficients β₀, β_i, β_ii, and β_ij represent the linear, quadratic, and interaction terms, respectively.

2.4. Total Polyphenolic Content (TPC) Determination Through Spectrophotometric Evaluation

The Folin–Ciocalteu methodology [13] was used to evaluate TPC and express the results in milligrams of gallic acid equivalents (GAEs) per gram of dry weight (dw). A calibration curve (10–100 mg/L of gallic acid, R² = 0.9996) in water was used to assess the results. Briefly, after mixing 100 μL of the properly diluted extract with 100 μL of the Folin–Ciocalteu reagent for 2 min, 800 μL of a 5% w/v sodium carbonate solution was subsequently added. Following a 20 min incubation at 40 °C, in the absence of light exposure, the absorbance of the solution was measured at 740 nm in a Shimadzu UV-1900i UV/Vis spectrophotometer (Kyoto, Japan). Sample incubation at 40 °C was conducted utilizing an Elmasonic P70H ultrasonic bath from Elma Schmidbauer GmbH (Singen, Germany). Each analysis was performed in triplicate and the average was used to assess the results.

2.5. Ferric-Reducing Antioxidant Power (FRAP) Evaluation of Antioxidant Activity

A previously established study provides a thorough description of the method used to test the antioxidant capacity of the extracts utilizing the common electron-transfer method [13]. This method entailed identifying the decrease in the iron oxidation state from +3 to +2. Briefly, 50 μL of the properly diluted sample was combined with 50 μL of FeCl₃ solution (4 mM in 0.05 M HCl). Subsequently, the samples were incubated at 37 °C for 30 min. After a 5 min interval, 900 μL of TPTZ solution (1 mM in 0.05 M HCl) was added, and the absorbance was measured at 620 nm. A calibration curve of ascorbic acid (50–500 μM in 0.05 M HCl, R² = 0.9997) was utilized, and the results were expressed as μmol of ascorbic acid equivalents (AAEs) per gram of dw. Each analysis was performed in triplicate and the average was used to evaluate the results.

2.6. Evaluation of Radical Scavenging Activity

A previously described assay [14] for DPPH^• scavenging was employed. The absorbance at 515 nm was initially measured immediately and 30 min later by combining 25 μL of properly diluted sample extract with 975 μL of DPPH^• solution (100 μmol/L in methanol). A calibration curve of the antiradical activity of ascorbic acid (100–1000 μmol/L in methanol, R² = 0.9926) was used, and the results were expressed as μmol of ascorbic acid equivalents (AAEs) per gram of dw. Each analysis was performed in triplicate and the average was used to evaluate the results.

2.7. HPLC Quantification of Polyphenolic Compounds

High-Performance Liquid Chromatography coupled with Diode Array Detector (HPLC-DAD) identification of individual polyphenols from the rosemary leaves’ extracts was based on our prior research [15]. The liquid chromatograph (model CBM-20A) and diode array detector (model SPD-M20A) utilized in this investigation were supplied by Shimadzu Europa GmbH, Duisburg, Germany. The detection wavelength ranges from 200 to 800 nm. The compounds were injected at a volume of 20 μL and separated at 40 °C using a Phenomenex Luna C18(2) column (100 Å, 5 μm, 4.6 mm × 250 mm) from Phenomenex Inc. in Torrance, CA, USA. The mobile phase consisted of 0.5% formic acid in acetonitrile (B) and 0.5% formic acid in aqueous solution (A). The gradient program involved a gradual initiation from 0 and increase to 40% B, followed by 50% B for 10 min, 70% B for another 10 min, and a constant value for 10 min. The mobile phase flow rate was kept constant at 1 mL/min. By comparing the absorbance spectrum and retention time to those of purified standards, the compounds were identified and subsequently quantified using calibration curves (0–50 μg/mL).

2.8. Ascorbic Acid Content (AAC)

The ascorbic acid content of the samples was quantified as mg/g of dry weight, as previously described by Athanasiadis et al. [15]. A total of 500 μL of 10% (v/v) Folin–Ciocalteu reagent and 100 μL of sample extract were combined with 900 μL of 10% (w/v) trichloroacetic acid in an Eppendorf tube. The absorbance was promptly assessed at 760 nm following 10 min of storage in darkness.

2.9. Statistical Analysis

The RSM and distribution analysis were statistically evaluated utilizing JMP^® Pro 16 software (SAS, Cary, NC, USA). The Kolmogorov–Smirnov test assessed the normality of the data. ANOVA and the Tukey HSD multiple comparison test were employed to ascertain any significant differences. The results were reported as means accompanied by measures of variability.

2.10. Initial Data Set Exploration and Visualization

The initial dataset comprised 17 experimental samples of rosemary extract that, as mentioned, were evaluated under varying PLE conditions. For the development of ML models, four features were used as inputs: ethanol concentration (% v/v), liquid-to-solid ratio (mL/g), extraction temperature (°C), and extraction time (min). Additionally, four features were used as the outputs: TPC, FRAP, DPPH, and AAC. The initial dataset comprised only 17 samples. Of the data, 80% was allocated for training and 20% for testing our ML models. Figure 1 presents the distribution of experimental variables and antioxidant responses, illustrating balanced sampling across extraction parameters and greater variability in antioxidant outcomes.

To assess variability and central tendencies, a combined boxplot was created (Figure 2). The extraction parameters (C, R_L/S, T, and t) showed narrow interquartile ranges and symmetry, indicating controlled conditions. In contrast, the antioxidant responses—especially FRAP and DPPH—exhibited wide variability and outliers, reflecting greater sensitivity to extraction settings. TPC showed moderate spread, while AAC remained tightly clustered. These patterns suggest that antioxidant outcomes are more affected by experimental variation than the input parameters.

To assess the variation across extraction conditions and antioxidant responses, two complementary heatmaps were produced and are presented together in Figure 3. Plot (A) displays the raw value matrix, highlighting absolute differences among samples. The high-intensity region in the FRAP column (design point 13) corresponds to elevated antioxidant activity, also reflected in TPC. AAC values remained consistently low across all samples. Plot (B) shows the standardized (z-score) version of the same matrix, enabling scale-independent comparison. The same sample exhibited z-scores above +2 in FRAP and TPC, confirming its outlier status. Other samples with moderate DPPH or AAC responses became more distinct through normalization. Together, the heatmaps reveal both high-performing conditions and hidden patterns across the dataset.

To explore the variable relationships, a Pearson correlation matrix was computed (Figure 4). Strong positive correlations between TPC, FRAP, and DPPH (r = 0.81–0.91) indicate phenolics’ central role in antioxidant capacity. The ethanol concentration (C, %) was negatively correlated with both TPC and FRAP, suggesting diminishing returns at higher concentrations. The temperature and solvent ratio showed moderate positive correlations with AAC, while the extraction time had minimal influence on any response. These findings align with the heatmap results and underscore the compound-specific effects of extraction parameters.

2.11. ML Regressor Development

To be able to develop our ML-based regressors for our initial data set, we trained six regression algorithms that were applied to model the relationships between the extraction parameters and the antioxidant responses. The models included Linear Regression [16], Ridge Regression [17], Lasso Regression [18], RF regression, Gradient Boosting (GB) Regression [19], and Adaptive Boosting (AdaBoost) Regression [20].

Each model was implemented as a multi-output regressor. Hyperparameter tuning was conducted using grid search with 5-fold cross-validation, due to the limited data. Ridge and Lasso regressors were tuned for regularization strength

(α)

, while tree-based models were optimized for the number of estimators, maximum depth, learning rate for boosting models, and minimum samples per split. The full list of model parameters and their tested values is shown in Table 1.

The selected hyperparameters were chosen for their direct impact on model complexity, generalization, and performance. For regularized linear models, Ridge and Lasso, the regularization strength “

α

” controls the degree of penalty applied to large coefficients—helping to reduce overfitting, especially in small datasets. Smaller values allow more flexibility, while larger values enforce stronger shrinkage of less informative predictors. To further mitigate the risk of overfitting, all models were trained and evaluated using 5-fold cross-validation. In addition, regularization techniques (Ridge and Lasso penalties) and parameter tuning were applied to control model complexity and improve generalization performance, particularly given the small size of the original dataset.

In the tree-based models RF, GB, and AdaBoost, the “number of estimators” determines how many trees are used to build the ensemble; more trees generally improve performance but increase computation. The “maximum tree depth” controls how complex each tree can be, balancing fit versus overfitting. The “minimum samples split” parameter sets the minimum number of samples required to split a node, helping to regularize the model by preventing overly deep trees. The “learning rate”, used in boosting algorithms, scales how much each tree contributes to the final prediction—lower rates typically yield better generalization at the cost of longer training.

2.12. Machine Learning Regressor Evaluation

In this study, the performance of the regression models was evaluated using four standard metrics: Mean Absolute Error (MAE) [21], Mean Squared Error (MSE) [21], Root Mean Squared Error (RMSE) [22], and the Coefficient of Determination (R²) [23]. These metrics quantify the difference between the predicted values

\hat{y_{i}}

and actual experimental values

y_{i}

, based on a total of n observations.

MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It is calculated as the mean of the absolute differences between actual and predicted values (2).

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(2)

MSE penalizes larger errors more strongly by squaring them. It is the average of the squared differences between actual and predicted values (3).

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}

(3)

RMSE is the square root of the MSE and provides an error measure in the same units as the original response variable, making it more interpretable (4):

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(4)

R² represents the proportion of variance in the actual values that is predictable from the independent variables. A value of 1 indicates perfect prediction, while 0 means the model explains none of the variance (5):

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(5)

These metrics together offer a robust framework for comparing the prediction accuracy, error dispersion, and explanatory power of each machine learning regressor tested in this study.

2.13. Generative Model Development

To address the limitations imposed by the small sample size (n = 17), a synthetic dataset was generated to enhance the modeling capacity and generalization of the machine learning algorithms. New input combinations were uniformly sampled within the observed range of the original extraction parameters: ethanol concentration (C, % v/v), liquid-to-solid ratio (R_L/S, mL/g), extraction temperature (T, °C), and extraction time (t, min). A total of 100 synthetic input samples were created to expand the dataset in a balanced and controlled manner.

To estimate the corresponding antioxidant responses—total polyphenol content (TPC), ferric reducing antioxidant power (FRAP), DPPH radical scavenging activity, and ascorbic acid content (AAC)—a pre-trained RF model, previously identified as the best-performing regressor, was employed. RF is an ensemble learning method based on decision trees that captures nonlinear relationships by aggregating the predictions of multiple base learners. The predicted output

\hat{y}

for a given input vector x is computed as the average of predictions from all trees in the forest (6):

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (x)

(6)

where T is the total number of decision trees in the ensemble, and f_t(x) is the prediction from the t-th tree for input x. In this study, RF was applied both as a predictive model and, in the generative phase, to estimate antioxidant outcomes for synthetically generated feature combinations.

To simulate natural variability and reduce overfitting to deterministic predictions, Gaussian noise was added to the RF-generated outputs. This augmentation mimics experimental uncertainty and improves the realism of synthetic samples. The final synthetic response

\tilde{y}

was computed as follows (7):

\tilde{y} = \hat{y} + ϵ, ϵ \sim N (0, σ^{2})

(7)

where ϵ is a noise term drawn from a normal distribution with zero mean and variance

σ^{2}

. In this study,

σ

was set to 5% of the standard deviation of the respective real target variable, providing a balance between stability and variability.

The resulting synthetic data points were merged with the original experimental dataset to create a mixed dataset. The rationale for this approach is that the RF model captures complex nonlinear interactions in the original data, and the controlled addition of Gaussian noise (set at 5% of the standard deviation of each real target variable) introduces realistic variability while avoiding overfitting to deterministic predictions. This method balances model fidelity with enhanced generalization potential.

The resulting synthetic data points were merged with the original experimental dataset to create a mixed dataset. Due to the limitations of our available computational resources, the size of the synthetic dataset was intentionally kept small (100 samples) to perform an initial proof-of-concept evaluation. Future work will explore more extensive data augmentation using more advanced generative techniques and more powerful hardware.

3. Results and Discussion

3.1. Optimization of PLE Parameters

The extraction procedure may be challenging due to the presence of several distinct bioactive compounds which lead to variations in solubility and polarity [24]. Moreover, various processing parameters along with the extraction technique might significantly affect both the extract yield and antioxidant capacity. Table 2 presents how the variables under investigation affect the examined responses, while in Table 3 the ANOVA applied to the RSM quadratic polynomial model is presented.

3.1.1. Model Analysis

The following Equations (8)–(11) represent regression models related to the extraction process, predicting key response variables: total phenolic content (TPC), ferric reducing antioxidant power (FRAP), DPPH radical scavenging capacity, and ascorbic acid content (AAC). Each equation includes linear, quadratic, and interaction terms, highlighting the complex relationships between experimental factors. The models contain only significant terms. The regression models highlight the impact of solvent composition, temperature, and duration on extraction efficiency. Notably, the linear and quadratic terms suggest nonlinear relationships between variables, indicating optimal conditions for maximizing antioxidant yield. The FRAP and DPPH equations show a strong dependency on the extraction conditions, particularly the extraction time (X₄). The presence of interaction terms suggests that the combined effects of multiple variables influence antioxidant potential, emphasizing the need for precise parameter optimization. Longer extraction times allow more bioactive compounds, including antioxidants, to dissolve into the solvent. The presence of quadratic terms (X₄²) and interactions (X₂X₄ and X₁X₄) suggests that the extraction time has an optimal range—too short may limit compound release, while excessive duration could lead to degradation or reduced efficiency. The interaction terms imply that the extraction time does not act alone. For example, X₂X₄ in DPPH suggests that time interacts with another variable, the liquid-to-solid ratio, to influence antioxidant capacity.

TPC = 11.74 + 0.49X₁ − 1.22X₂ + 0.69X₃ + 1.51X₄ − 0.006X₁² + 0.017X₂² − 0.002X₃² + 0.002X₁X₂ − 0.016X₁X₄ − 0.002X₂X₃ − 0.015X₂X₄

(8)

FRAP = −58.84 + 6.67X₁ − 1.73X₂ + 3.51X₃ + 63.39X₄ − 0.107X₁² − 1.31X₄² + 0.11X₁X₂ − 0.038X₁X₃ − 0.44X₂X₄

(9)

DPPH = −330.79 + 4.24X₁ − 6.60X₂ + 10.86X₃ + 23.97X₄ − 0.041X₁² + 0.089X₂² − 0.036X₃² + 0.061X₁X₂ − 0.018X₁X₃ − 0.182X₁X₄ − 0.029X₂X₃ − 0.274X₂X₄

(10)

AAC = 0.14 + 0.13X₁ + 0.08X₂ + 0.05X₃ − 0.0007X₁² − 0.0003X₁X₃

(11)

Figure 5 shows how each parameter and their combinations affect the responses of the parameters under study. The predicted optimal values of PLE parameters along with the predicted TPC and FRAP, DPPH, and AAC values, along with the desirability of the model, are presented in Table 4.

3.1.2. Impact of Extraction Parameters on Assays Through Pareto Plot Analysis

In a Pareto plot (Figure 6), the orthogonal estimate typically refers to a statistical method used to estimate the effects of different factors while minimizing the correlation between them. This approach helps in identifying the most significant contributors to a given outcome by ensuring that the estimates are independent of each other.

It seems that temperature (parameter X₃) has a significantly positive effect on all responses. Another factor that seems to be very important is the solvent composition (X₁), where increasing the percentage of ethanol has a negative effect on all responses except ascorbic acid, where it has a positive effect. It is worth noting that the extraction duration (X₄) does not significantly affect any of the responses, but there is a trend where increased extraction times positively affect all responses.

3.2. Principal Component Analysis (PCA) and Multivariate Component Analysis (MCA)

The interactions between assays and extraction conditions were investigated through correlation analyses, which included PCA and MCA, as illustrated in Figure 7 and described in Table 5, respectively. The correlation analyses were conducted to ascertain the relationships between the variables and TPC, FRAP, DPPH, and AAC within the context of PCA. The chart demonstrates that PC1 and PC2 each contributed 67.6% and 24.9% of the variance, respectively, accounting for 92.5% of the variance. The analysis was deemed to be significantly influenced by the independent variables. The graph demonstrated that TPC, FRAP, DPPH, and extraction temperature and duration (X₃ and X₄) were positively correlated within both components and were represented in close proximity. AAC was considerably improved by the increased concentration of ethanol (X₁) and liquid-to-solid ratio (X₂), which explains their strong correlation. Their combined impact on extraction parameters was comparable. Conversely, the favorable placement of AAC in PC2, which is situated at a significant distance from the other variables, may indicate a diminished relationship between them. Previous research has suggested a positive correlation between an increase in ethanol concentration and AAC recovery [25].

In addition, the MCA provides further insights into the interrelationships between variables. The primary benefit of this approach is its ability to determine the degree of positive or negative correlation between the variables under investigation. Table 5 delineates the results of this investigation. The pattern of robust positive correlations (>0.77) between antioxidant assays and total phenolic content (TPC) was previously substantiated [26]. Ultimately, the negative correlation between ascorbic acid (AAC) and all other responses (TPC, FRAP, and DPPH) is highly emphasized. Nevertheless, it is particularly noteworthy that molecules exhibiting considerable antioxidant activity demonstrate a negative correlation with antioxidant assays.

3.3. Partial Least Squares (PLS) Analysis

The PLS model was employed to assess the influence of the extraction condition parameters (X₁, X₂, X₃, and X₄). Figure 8 illustrates the prediction profiler alongside a desirability function that features extrapolation control and includes a variable importance plot (VIP). The extraction of bioactive compounds is significantly influenced by various factors, with temperature, solvent composition, and extraction duration being the most critical [27]. Initially, it is important to note that the extraction process can be complicated by the differing solubility and polarity of polyphenols [28]. Concerning the PLE technique, it is evident that the X₁ parameter exhibited the most statistically significant impact (p < 0.05) compared to other parameters in the extraction process, as demonstrated by the Variance Importance Plot (VIP) presented in Figure 8B. The observations previously noted from the 3D models of the response surface were corroborated in Figure 8A, indicating that the optimal concentration was 25% v/v aqueous ethanol, a liquid-to-solid ratio of 10 mL/g, and the optimal temperature was 160 °C. The high efficiency observed at 160 °C is likely due to the enhanced solubility of polyphenols and increased solvent penetration into plant material. Elevated temperatures reduce surface tension, improving mass transfer and extraction yield. Concerning the duration of extraction, it appeared to exert the least significant influence on the process; consequently, the longest duration was favored, as it favored AAC recovery. The extraction process was not significantly influenced by the temperature or extraction duration; nevertheless, elevated temperatures coupled with long extraction times were favored. The solute–matrix interaction can be significantly reduced by the PLE technique, which is primarily due to the influence of van der Waals forces or hydrogen bonds, particularly in the presence of elevated temperature and pressure. This reduces energy demands, improves the efficacy of solute molecular extraction, and decreases the viscosity of the solvent. This reduces the solvent’s resistance to the matrix, thereby facilitating its diffusion into the sample [29]. The model exhibited a prolonged extraction duration, as prior research has substantiated the effectiveness of both brief [30] and prolonged [28] intervals. While elevated temperatures facilitate the extraction of bioactive compounds by enhancing their solubility in other techniques, like stirring [31], it is important to note that many thermolabile compounds may experience degradation under these conditions [32].

Table 6 shows the values of TPC, FRAP, DPPH, and AAC of the optimal extract. The results of the present study are worth comparing with those of our previous work, where four different extraction techniques from rosemary leaves were studied, namely stirring, pulsed electric field (PEF)-assisted extraction, and ultrasound probe- and ultrasound bath-assisted extraction [31]. It is noteworthy that in that work, the highest TPC was given by stirring, and yet in the present work, PLE gave ~320% higher yield. A similar pattern was observed for FRAP, where PLE resulted in a ~455% higher yield. The highest value of DPPH was observed in ultrasound bath-assisted extraction; however, PLE gave a ~516% greater result. Regarding AAC, ultrasound probe-assisted extraction was the best value, and PLE only gave a ~10% greater performance than PEF-assisted extraction. Unlike conventional methods, PLE offers improved recovery of bioactive compounds in a shorter time frame, reducing energy consumption and solvent waste. A comparison with traditional extraction methods, such as ultrasound-assisted extraction (UAE), exhibits lower polyphenol recovery compared to PLE. Some other researchers also studied the TPC and antioxidant capacity of rosemary leaves. More specifically, Hashem Hashempur et al. [33] utilized a deep eutectic solvent, consisting of ammonium acetate and lactic acid, along with ultrasound, and their TPC was 334% lower than our result. Kabubii et al. [34] also determined a TPC in crude extracts, which was ~52% lower than ours. In general, PLE treatment of rosemary leaves seems to lead to higher yields than other extraction techniques.

The experimental results and PLS model predictions exhibit outstanding concordance, as evidenced by the high correlation coefficient of 0.981 and substantial R² value of 0.962. Furthermore, the p-value being less than 0.0001 indicates that the deviations between the actual and predicted values are statistically insignificant.

Table 7 presents a list of the individual polyphenols identified in the optimal extract by HPLC-DAD, while Table 8 provides information on the equations of the standard compounds. The compound with the highest concentration is hesperidin, followed by rosmarinic acid and Quercetin 3-D-galactoside. In our previous work, the compound with the highest concentration was rosmarinic acid in all cases, and here it is worth noting how the parameters applied by each different technique during the extraction process greatly affect the profile of the final extracts obtained. However, the same compounds were also identified in this work. Other researchers, like Xie et al. [35], Sammer and Samarrai [36], Baptista et al. [37], and Miljanović et al. [38], determined compounds like hesperidin, apigenin and its derivatives, rosmarinic acid, and carnosic acid in rosemary leaves.

3.4. Performance of Machine Learning Regressors on the Original Data

In the following sections, we focus on reporting the key findings regarding the performance of the ML regressors, with an emphasis on practical insights relevant to extraction optimization. The detailed technical analysis is intentionally limited, in line with the overall scope of this experimental study.

From Figure 9, it can be observed that all regression models achieved relatively good performance on the training dataset. In particular, RF, GB, and AdaBoost demonstrated very high training accuracy, with GB achieving an R² of 1.00 and RF and AdaBoost closely following with R² values of 0.87 and 0.99, respectively. These results indicate that ensemble-based models fit the training data extremely well, though they may risk overfitting.

In contrast, Figure 10 reveals substantial performance degradation across all models when evaluated on the testing dataset. The Linear and Ridge regressors showed moderate predictive ability with test R² scores around −3.29 and −1.31, respectively. Among all models, RF achieved the best test performance, with lower error scores, 0.81 MAE, 0.91 MSE, 0.91 RMSE, and a test R² of −1.66, outperforming the other regressors under the given constraints.

Despite the overall lower performance on test data, RF was selected as the most efficient regressor due to its balance between training fit and test error, as well as its suitability for data generation in the augmentation phase that followed. Based on this model, synthetic data were produced and evaluated in combination with the original dataset.

3.5. Performance of Machine Learning Regressors on the Synthetic Dataset

Our next step was to compare our regressors based on our synthetic data. A synthetic dataset, consisting of 100 samples, was generated using RF-based predictions with Gaussian noise added to introduce controlled variability. This synthetic dataset maintained the same feature distribution as the original dataset to ensure comparability.

Ensemble-based models demonstrated high performance compared to linear models. Specifically, the GB regressor achieved the highest training accuracy with an R² of 1.00, followed by the RF regressor with an R² of 0.98 and AdaBoost with an R² of 0.95. These models also recorded notably low error metrics, indicating an almost perfect fit to the training data. In contrast, Linear Regression, Ridge Regression, and Lasso Regression yielded identical training performance, each reaching an R² of 0.71, which indicates a moderate capacity to model the underlying relationships within the synthetic data. Their MAE and RMSE values were also consistently higher than those of the ensemble models (Figure 11).

For the test set, the GB-based and RF-based regressors again outperformed the other models, with R² values of 0.93 and 0.91 respectively, along with the lowest RMSE scores, suggesting strong predictive ability on unseen data. The AdaBoost regressor also performed well with an R² of 0.88, although with slightly higher error values. Linear models showed consistent but comparatively limited predictive performance on the test data, with all three achieving an R² of 0.70 and similar error magnitudes (Figure 12).

3.6. Performance of Machine Learning Regressors on the Mixed Dataset

To further evaluate the model performance in a data-rich scenario, we trained and tested our regressors using a mixed dataset consisting of both real experimental samples from our initial dataset and synthetically generated data.

The model training results showed clear performance differentiation among algorithm families. Ensemble models again demonstrated high accuracy. The GB-based regressor achieved the highest training performance, with an R² of 1.00 and near-zero error values across all metrics: MAE = 0.03, MSE = 0.001, and RMSE = 0.04. The RF-based regressor followed with an R² of 0.95 and comparatively low error values: MAE = 0.13, MSE = 0.05, and RMSE = 0.22. The AdaBoost regressor also performed well during training, with an R² of 0.93. In contrast, the linear models—Linear Regression, Ridge Regression, and Lasso Regression—yielded nearly identical training outcomes, each with an R² of 0.61 and RMSE of approximately 0.39 to 0.40. These results indicate that the linear models were only moderately successful in capturing the increased variance introduced through the mixed data (Figure 13).

The testing performance followed a similar trend but revealed more nuanced differences in generalization capability. The GB-based regressor achieved a test R² of 0.76, while the RF-based regressor reached 0.84, indicating that both models retained strong generalization on unseen data. The AdaBoost regressor also maintained respectable performance with a test R² of 0.74. Notably, the GB and RF regressors both achieved low RMSE values on the test set of 0.16 and 0.21, respectively, underscoring their effectiveness in handling diverse and noise-augmented data distributions. The linear models again exhibited limited predictive strength on the test set, with all three achieving an R² from 0.59 to 0.60 and RMSE values ranging from 0.28 to 0.29 (Figure 14).

3.7. Cross-Evaluation of RF Models on Real, Synthetic, and Mixed Data

Based on the previous results, our best regressor is the RF mixed-based regressor which was developed based on the original and synthetic data. To further examine the robustness and generalization capabilities of the RF mixture-based model, we conducted a cross-dataset evaluation in which models trained on one dataset were tested on different datasets. The training datasets included the original experimental data, a synthetically generated dataset, and a mixed dataset combining both sources. Each model’s predictive accuracy was then assessed across all three datasets using standard performance metrics.

The variability introduced by the Gaussian noise in synthetic data was controlled and systematically evaluated, as described in Section 2.13, to ensure that the augmented dataset preserved meaningful variance while remaining consistent with the statistical characteristics of the original data.

When the model trained on the original dataset was tested on the synthetic dataset, it produced a test MAE of 0.60 and a RMSE of 0.80, with an R² of 0.48. This indicated moderate generalization capacity to the artificial data. Slightly better performance was observed when the same model was evaluated on the mixed dataset, yielding a lower test RMSE of 0.62 and a marginally improved R² of 0.43.

In contrast, the model trained solely on the synthetic data demonstrated poor performance on the original data, with an R² of only 0.04 and an RMSE of 0.44, suggesting a substantial gap in representational fidelity between the synthetic and real data distributions. However, when the same synthetically trained model was tested on the mixed dataset, performance improved drastically, achieving an R² of 0.86 and RMSE of 0.31. This highlights that the synthetic model generalized well within synthetic-heavy contexts but struggled with real experimental variability.

The model trained on the mixed dataset exhibited the strongest overall generalization. It achieved a low RMSE of 0.27 and a high R² of 0.63 when tested on the original data. Most notably, it yielded the best cross-dataset performance when tested on the synthetic dataset, with an RMSE of 0.38 and an R² of 0.88. With our new RF mixed regressor we had a 20% increase in the performance, which is satisfactory due to the lack of samples (Figure 15).

These results validate our objective of generating new samples, demonstrating that synthetic data can enhance the development of robust machine learning regressors for accurately predicting total phenolic content.

3.8. Feature Importance Analysis Across RF-Based Models

To investigate how the model training data influences the learned relationships between extraction parameters and antioxidant responses, feature importance scores were extracted from each RF model trained on the original, synthetic, and mixed datasets. These scores were derived from the individual estimators trained for each target (TPC, FRAP, DPPH, and AAC), and visualized side by side (Figure 16).

In the model trained on the original dataset, the most influential feature overall was temperature (T, °C), particularly for FRAP and TPC, where it accounted for over 35–40% of total importance. This aligns with experimental expectations, as thermal energy often enhances compound release. R_L/S and C (%) also showed moderate contributions, while extraction time had the lowest influence across all targets.

In contrast, the model trained exclusively on synthetic data placed much greater emphasis on C (%), especially for FRAP (0.78) and TPC (0.72). This shift likely reflects the statistical bias introduced during synthetic generation, where concentration appeared as a dominant predictor due to its nonlinear interactions captured by RF. Meanwhile, T (°C) and t (min) showed very low influence (<0.10) across all targets.

In the mixed model, a balanced importance distribution emerged. C (%) again held strong predictive power for TPC and FRAP (~0.66), but now R_L/S and T (°C) also gained relevance, particularly for AAC (0.57) and DPPH (0.39), indicating a more nuanced learning of underlying relationships. Time remained the least influential, consistent across all models.

This comparison reveals how the training dataset affects not only model accuracy but also which experimental parameters are deemed most critical. Mixed training produces models that are both accurate and biologically plausible, while synthetic-only training can exaggerate the significance of specific variables.

However, it should be noted that the feature importance results presented here are influenced by the synthetic component of the mixed dataset. As such, there is a potential risk of bias in the interpretation of variable importance, particularly for features that may exhibit amplified or diminished effects in synthetic samples. This limitation highlights the need for cautious interpretation and suggests that future studies should validate these findings using larger experimental datasets.

3.9. Actual vs. Predicted Performance Across RF-Based Models

To visually assess prediction accuracy and generalization, scatter plots comparing actual vs. predicted values were generated for all four antioxidant targets, TPC, FRAP, DPPH, and AAC, using the RF-based models trained on the original, synthetic, and mixed datasets (Figure 17). All values were plotted in standardized, z-score space, and each subplot includes the R² as a quantitative measure of fit.

The model trained exclusively on the original dataset exhibited poor predictive performance across all targets. R² values were consistently negative, indicating substantial overfitting and a lack of generalization to unseen data. DPPH with R² = −3.11 and AAC with R² = −1.40 showed a complete breakdown of predictive capacity, while even the best-performing targets, TPC with R² = −1.16 and FRAP with R² = −0.97, failed to demonstrate any meaningful alignment between predicted and actual values.

In contrast, the model trained on synthetic data produced improved results, with three of the four targets yielding positive R² values. DPPH was best predicted with R² = 0.86, followed by TPC with R² = 0.51 and FRAP with R² = 0.17. Nevertheless, AAC continued to exhibit poor predictability, with an R² of −1.38. These results suggest that while synthetic data can partially capture the data structure of some antioxidant properties, it remains insufficient for accurately modeling targets like AAC without real data inputs.

The RF-based model trained on the mixed dataset, which combined both the original and synthetic samples, delivered the most balanced and reliable performance. TPC and DPPH both achieved strong fits with R² values of 0.88, while FRAP also performed well with R² = 0.70. AAC, however, remained a challenging target, with only a marginally positive R² of 0.07. The mixed data approach thus demonstrated the strongest generalization overall, effectively integrating the empirical variability of real measurements with the expanded coverage of synthetically generated patterns.

These results highlight that while the RF-based mixed model improved the predictive alignment for TPC and DPPH, challenges remain in accurately modeling AAC, which may reflect inherent biological variability or limited representation in the training data.

3.10. Partial Dependence Analysis of RF-Based Models

To better interpret how individual input features influenced the model predictions, partial dependence plots (PDPs) were generated for each antioxidant response (TPC, FRAP, DPPH, and AAC) across the four predictors: concentration of solvent (C, %), solvent ratio (R_L/S, mL/g), temperature (T, °C), and extraction time (t, min). The PDPs visualize the marginal effect of each predictor after averaging out the influence of other variables, offering insight into the modeled relationships learned by RF models trained on different datasets (original, synthetic, and mixed).

For the model trained on the original dataset, PDPs (Figure 18) revealed generally smooth but shallow response curves, indicating low model sensitivity to input variation. The results of C (%) and R_L/S showed modest negative slopes for most targets, particularly TPC and FRAP, suggesting that higher solvent concentrations and solvent ratios tended to reduce predicted antioxidant values. Temperature exhibited a more pronounced positive relationship with TPC and FRAP, while time (t) influenced predictions positively in nearly all cases, though effects on AAC appeared erratic. These modest trends are consistent with the limited representational power of the model trained on the small original dataset, which likely restricted the learned functional relationships to low-variance approximations.

In contrast, the synthetic-trained model (Figure 19) exhibited steeper and more structured response patterns across nearly all features. For example, C (%) and R_L/S displayed strong nonlinear declines for TPC, FRAP, and DPPH, whereas temperature exhibited a clear sigmoidal increase, particularly evident for DPPH. The time variable contributed positively to predictions, with increasing slopes across most plots. These sharper transitions suggest that the synthetic model was able to capture more defined patterns between variables, although some over-smoothing and artifacts were apparent in less reliable targets such as AAC, where PDP curves were more erratic.

The model trained on the mixed dataset (Figure 20) presented the most coherent and biologically plausible trends. The PDPs across targets demonstrated well-defined, monotonic relationships. C (%) consistently showed negative associations with antioxidant capacity, while R_L/S exhibited declining effects, particularly for FRAP and TPC. Temperature maintained a strong positive relationship, especially for DPPH and FRAP. Extraction time (t) showed clear and mostly monotonic increases in partial dependence, indicating its significant influence on yield-related outcomes. Compared to the other models, the mixed-trained model exhibited more stable and interpretable PDPs, which is consistent with its higher predictive accuracy and generalization capacity.

Overall, the partial dependence analysis reinforces the conclusion that models trained on a mixture of real and synthetic data achieve superior learning of underlying relationships between process variables and antioxidant outcomes. The synthetic model was able to capture sharp feature effects, but only the mixed model exhibited smooth, consistent trends aligned with expected extraction behavior. These findings support the inclusion of controlled synthetic data to augment and stabilize learning in low-sample experimental contexts.

Despite these encouraging results, several limitations remain. The relatively small size of the original dataset constrains the ability of the models to fully capture the underlying variability of the extraction process. The current synthetic data generation approach, while effective, is based on RF predictions with Gaussian noise, which may not fully reflect the true complexity of the system. Future work should explore more advanced generative modeling techniques, such as Variational Autoencoders or Generative Adversarial Networks, to enhance the diversity and realism of synthetic data. Additionally, expanding the experimental dataset and incorporating additional physicochemical or spectral variables could further improve model accuracy and generalization, supporting more robust and transferable AI-assisted extraction models.

3.11. Model Prediction Accuracy at Optimal Conditions

To assess how accurately RF models predicted antioxidant outcomes under optimal extraction conditions, we compared model predictions against experimentally reported values for four antioxidant metrics: TPC, FRAP, DPPH, and AAC. Figure 21 presents a comparative bar plot showing the predicted values from each model along with their absolute errors, visualized as value ± error for each target.

The model trained solely on the original data exhibited the highest prediction error across all targets. For TPC, the model predicted 55.9 compared to the reported 78.2, yielding an absolute error of 22.4. Similarly, FRAP and DPPH were underestimated by 768.3 and 480.7, corresponding to absolute errors of 146.5 and 398.0, respectively. AAC was also significantly underestimated with 9.9, with an error of 7.9. These results highlight the limitations of training exclusively on small, original datasets, especially when attempting to extrapolate to optimal regions.

The model trained on the synthetic data demonstrated improved performance across most targets. TPC was predicted at 60.4 with error = 17.8, while FRAP reached 832.8 with error = 82.0, and DPPH was predicted at 463.3 with error = 415.4. Although the DPPH prediction remained notably poor, AAC predictions were marginally closer to the reported value, with a predicted value of 9.1 with an error = 8.8.

The model trained on the mixed dataset yielded the most accurate and consistent predictions across all targets. TPC was predicted at 64.0 with error = 14.2, and FRAP at 865.2 with error = 49.6, showing strong alignment with the experimental values. DPPH prediction also improved, with a value of 541.8 and an error of 336.9. AAC was predicted at 9.2 with an error of 8.7. While some targets, particularly DPPH and AAC, remained challenging to predict accurately, the mixed model consistently outperformed the other models in terms of proximity to the experimental data.

In summary, these results confirm that training on a mixed dataset comprising both real and synthetic samples enhances the model’s ability to generalize and make reliable predictions under optimal conditions. The mixed model showed the lowest aggregate absolute error and the closest alignment to the reported values across all antioxidant targets. Models trained solely on synthetic data exhibited poor generalization to real samples, underscoring the need for empirical grounding. In addition, predictions for specific targets such as AAC and DPPH remained less accurate, likely due to high intrinsic variability or limited representation within the training data.

4. Conclusions

This study successfully optimized the extraction conditions for rosemary leaves using PLE, demonstrating its potential as an efficient and environmentally friendly technique for recovering bioactive compounds. The optimized PLE parameters yielded extracts rich in antioxidants, polyphenols, and ascorbic acid, highlighting the suitability of PLE for such applications.

In parallel, ML approaches were applied to model and predict antioxidant responses based on extraction parameters. While the RF-based mixed model showed improved generalization compared to models trained solely on experimental or synthetic data, the small sample size and reliance on data augmentation introduce limitations to the robustness of the conclusions. In particular, feature importance results may be influenced by synthetic data, and test set performance indicates that further model refinement is needed to ensure reliable predictions in real-world scenarios.

Future research should focus on expanding experimental datasets to improve model training and validation, applying advanced generative methods for more realistic data augmentation, and conducting real-world testing of model predictions. Additionally, evaluating model transferability across different plant matrices and extraction systems, as well as validating the process at industrial scale, will be important steps toward broader practical implementation of AI-assisted extraction optimization.

Author Contributions

Conceptualization, V.A. and S.I.L.; methodology, V.A., software, V.A.; validation, V.A.; formal analysis, M.M. and V.A.; investigation, M.M. and E.B.; resources, S.I.L.; data curation, M.M. and K.G.L.; writing—original draft preparation, M.M. and K.G.L.; writing—review and editing, V.A., M.M., K.G.L., E.B. and S.I.L.; visualization, M.M. and K.G.L.; supervision, V.A. and S.I.L.; project administration, S.I.L.; funding acquisition, S.I.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghasemzadeh Rahbardar, M.; Hosseinzadeh, H. Toxicity and Safety of Rosemary (Rosmarinus officinalis): A Comprehensive Review. Naunyn. Schmiedebergs Arch. Pharmacol. 2025, 398, 9–23. [Google Scholar] [CrossRef] [PubMed]
Aamer, H.A.; Al-Askar, A.A.; Gaber, M.A.; El-Tanbouly, R.; Abdelkhalek, A.; Behiry, S.; Elsharkawy, M.M.; Kowalczewski, P.Ł.; El-Messeiry, S. Extraction, Phytochemical Characterization, and Antifungal Activity of Salvia Rosmarinus Extract. Open Chem. 2023, 21, 20230124. [Google Scholar] [CrossRef]
de Macedo, L.M.; Santos, É.M.d.; Militão, L.; Tundisi, L.L.; Ataide, J.A.; Souto, E.B.; Mazzola, P.G. Rosemary (Rosmarinus officinalis L., Syn Salvia rosmarinus Spenn.) and Its Topical Applications: A Review. Plants 2020, 9, 651. [Google Scholar] [CrossRef]
Ahmed, H.M.; Babakir-Mina, M. Investigation of Rosemary Herbal Extracts (Rosmarinus officinalis) and Their Potential Effects on Immunity. Phytother. Res. 2020, 34, 1829–1837. [Google Scholar] [CrossRef]
González-Minero, F.J.; Bravo-Díaz, L.; Ayala-Gómez, A. Rosmarinus officinalis L. (Rosemary): An Ancient Plant with Uses in Personal Healthcare and Cosmetics. Cosmetics 2020, 7, 77. [Google Scholar] [CrossRef]
Aziz, E.; Batool, R.; Akhtar, W.; Shahzad, T.; Malik, A.; Shah, M.A.; Iqbal, S.; Rauf, A.; Zengin, G.; Bouyahya, A.; et al. Rosemary Species: A Review of Phytochemicals, Bioactivities and Industrial Applications. S. Afr. J. Bot. 2022, 151, 3–18. [Google Scholar] [CrossRef]
Dhenge, R.; Rinaldi, M.; Ganino, T.; Lacey, K. Recent and Novel Technology Used for the Extraction and Recovery of Bioactive Compounds from Fruit and Vegetable Waste. In Wealth out of Food Processing Waste; CRC Press: Boca Raton, FL, USA, 2024; ISBN 978-1-00-326919-9. [Google Scholar]
Christoforidis, A.; Mantiniotou, M.; Athanasiadis, V.; Lalas, S.I. Caffeine and Polyphenolic Compound Recovery Optimization from Spent Coffee Grounds Utilizing Pressurized Liquid Extraction. Beverages 2025, 11, 74. [Google Scholar] [CrossRef]
Galanakis, C.M.; Aldawoud, T.M.S.; Rizou, M.; Rowan, N.J.; Ibrahim, S.A. Food Ingredients and Active Compounds against the Coronavirus Disease (COVID-19) Pandemic: A Comprehensive Review. Foods 2020, 9, 1701. [Google Scholar] [CrossRef]
Martins, R.; Barbosa, A.; Advinha, B.; Sales, H.; Pontes, R.; Nunes, J. Green Extraction Techniques of Bioactive Compounds: A State-of-the-Art Review. Processes 2023, 11, 2255. [Google Scholar] [CrossRef]
Kim, H.C.; Ha, S.Y.; Yang, J.-K. Antioxidant Activity of Ultrasonic Assisted Ethanol Extract of Ainsliaea acerifolia and Prediction of Antioxidant Activity with Machine Learning. BioResours. 2024, 19, 7637–7652. [Google Scholar] [CrossRef]
Kunjiappan, S.; Ramasamy, L.K.; Kannan, S.; Pavadai, P.; Theivendren, P.; Palanisamy, P. Optimization of Ultrasound-Aided Extraction of Bioactive Ingredients from Vitis Vinifera Seeds Using RSM and ANFIS Modeling with Machine Learning Algorithm. Sci. Rep. 2024, 14, 1219. [Google Scholar] [CrossRef] [PubMed]
Kalompatsios, D.; Athanasiadis, V.; Mantiniotou, M.; Lalas, S.I. Optimization of Ultrasonication Probe-Assisted Extraction Parameters for Bioactive Compounds from Opuntia macrorhiza Using Taguchi Design and Assessment of Antioxidant Properties. Appl. Sci. 2024, 14, 10460. [Google Scholar] [CrossRef]
Shehata, E.; Grigorakis, S.; Loupassaki, S.; Makris, D.P. Extraction Optimisation Using Water/Glycerol for the Efficient Recovery of Polyphenolic Antioxidants from Two Artemisia Species. Sep. Purif. Technol. 2015, 149, 462–469. [Google Scholar] [CrossRef]
Athanasiadis, V.; Chatzimitakos, T.; Mantiniotou, M.; Kalompatsios, D.; Bozinou, E.; Lalas, S.I. Investigation of the Polyphenol Recovery of Overripe Banana Peel Extract Utilizing Cloud Point Extraction. Eng 2023, 4, 3026–3038. [Google Scholar] [CrossRef]
Freedman, D.A. Statistical Models: Theory and Practice, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009; ISBN 978-0-511-81586-7. [Google Scholar]
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)?—Arguments against Avoiding RMSE in the Literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Kvålseth, T.O. Cautionary Note about R². Am. Stat. 1985, 39, 279–285. [Google Scholar] [CrossRef]
Thoo, Y.; Ng, S.Y.; Khoo, M.; Mustapha, W.; Ho, C. A Binary Solvent Extraction System for Phenolic Antioxidants and Its Application to the Estimation of Antioxidant Capacity in Andrographis paniculata Extracts. Int. Food Res. J. 2013, 20, 1103–1111. [Google Scholar]
Mantiniotou, M.; Athanasiadis, V.; Kalompatsios, D.; Lalas, S.I. Optimization of Carotenoids and Other Antioxidant Compounds Extraction from Carrot Peels Using Response Surface Methodology. Biomass 2025, 5, 3. [Google Scholar] [CrossRef]
Segovia, F.J.; Luengo, E.; Corral-Pérez, J.J.; Raso, J.; Almajano, M.P. Improvements in the Aqueous Extraction of Polyphenols from Borage (Borago officinalis L.) Leaves by Pulsed Electric Fields: Pulsed Electric Fields (PEF) Applications. Ind. Crops Prod. 2015, 65, 390–396. [Google Scholar] [CrossRef]
More, P.R.; Jambrak, A.R.; Arya, S.S. Green, Environment-Friendly and Sustainable Techniques for Extraction of Food Bioactive Compounds and Waste Valorization. Trends Food Sci. Technol. 2022, 128, 296–315. [Google Scholar] [CrossRef]
Athanasiadis, V.; Mantiniotou, M.; Kalompatsios, D.; Makrygiannis, I.; Alibade, A.; Lalas, S.I. Evaluation of Antioxidant Properties of Residual Hemp Leaves Following Optimized Pressurized Liquid Extraction. AgriEngineering 2025, 7, 1. [Google Scholar] [CrossRef]
Zhou, J.; Wang, M.; Carrillo, C.; Zhu, Z.; Brncic, M.; Berrada, H.; Barba, F.J. Impact of Pressurized Liquid Extraction and pH on Protein Yield, Changes in Molecular Size Distribution and Antioxidant Compounds Recovery from Spirulina. Foods 2021, 10, 2153. [Google Scholar] [CrossRef] [PubMed]
Anticona, M.; Blesa, J.; Lopez-Malo, D.; Frigola, A.; Esteve, M.J. Effects of Ultrasound-Assisted Extraction on Physicochemical Properties, Bioactive Compounds, and Antioxidant Capacity for the Valorization of Hybrid Mandarin Peels. Food Biosci. 2021, 42, 101185. [Google Scholar] [CrossRef]
Athanasiadis, V.; Chatzimitakos, T.; Mantiniotou, M.; Kalompatsios, D.; Kotsou, K.; Makrygiannis, I.; Bozinou, E.; Lalas, S.I. Optimization of Four Different Rosemary Extraction Techniques Using Plackett–Burman Design and Comparison of Their Antioxidant Compounds. Int. J. Mol. Sci. 2024, 25, 7708. [Google Scholar] [CrossRef]
Antony, A.; Farid, M. Effect of Temperatures on Polyphenols during Extraction. Appl. Sci. 2022, 12, 2107. [Google Scholar] [CrossRef]
Hashem Hashempur, M.; Zareshahrabadi, Z.; Shenavari, S.; Zomorodian, K.; Rastegari, B.; Karami, F. Deep Eutectic Solvent-Based Extraction of Rosemary Leaves: Optimization Using Central Composite Design and Evaluation of Antioxidant and Antimicrobial Activities. New J. Chem. 2025, 49, 4495–4505. [Google Scholar] [CrossRef]
Kabubii, Z.N.; Mbaria, J.M.; Mathiu, P.M.; Wanjohi, J.M.; Nyaboga, E.N. Diet Supplementation with Rosemary (Rosmarinus officinalis L.) Leaf Powder Exhibits an Antidiabetic Property in Streptozotocin-Induced Diabetic Male Wistar Rats. Diabetology 2024, 5, 12–25. [Google Scholar] [CrossRef]
Xie, L.; Li, Z.; Li, H.; Sun, J.; Liu, X.; Tang, J.; Lin, X.; Xu, L.; Zhu, Y.; Liu, Z.; et al. Fast Quantitative Determination of Principal Phenolic Anti-Oxidants in Rosemary Using Ultrasound-Assisted Extraction and Chemometrics-Enhanced HPLC–DAD Method. Food Anal. Methods 2023, 16, 386–400. [Google Scholar] [CrossRef]
Samer, A.J.A.; Samarrai, O.R.A. Phytochemical Screening, Antioxidant Power and Quantitative Analysis by HPLC of Isolated Flavonoids from Rosemary. Samarra J. Pure Appl. Sci. 2025, 6, 15–29. Available online: https://sjpas.com/index.php/sjpas/article/view/861 (accessed on 10 June 2025).
Baptista, A.; Menicucci, F.; Brunetti, C.; dos Santos Nascimento, L.B.; Pasquini, D.; Alderotti, F.; Detti, C.; Ferrini, F.; Gori, A. Unlocking the Hidden Potential of Rosemary (Salvia rosmarinus Spenn.): New Insights into Phenolics, Terpenes, and Antioxidants of Mediterranean Cultivars. Plants 2024, 13, 3395. [Google Scholar] [CrossRef]
Miljanović, A.; Dent, M.; Grbin, D.; Pedisić, S.; Zorić, Z.; Marijanović, Z.; Jerković, I.; Bielen, A. Sage, Rosemary, and Bay Laurel Hydrodistillation By-Products as a Source of Bioactive Compounds. Plants 2023, 12, 2394. [Google Scholar] [CrossRef]

Figure 1. Histograms with kernel density estimates showing the distribution of extraction parameters and antioxidant response variables for rosemary samples. While extraction settings are uniformly distributed due to the design structure, antioxidant responses such as FRAP and DPPH exhibit skewed distributions, indicating variability in sample performance.

Figure 2. Combined boxplot of all extraction parameters and antioxidant response variables. Antioxidant metrics (FRAP and DPPH) exhibit larger spread and more outliers than extraction parameters, indicating greater variability and sensitivity to experimental conditions.

Figure 3. Plot (A) is a heatmap of raw data values for extraction parameters and antioxidant responses across 17 experimental conditions. Brighter colors indicate higher absolute values. The most intense FRAP activity was observed in sample #13. Plot (B) is a z-score normalized heatmap of the same dataset. Standardization enables direct comparison across all features. Red shades represent values above the mean, while blue shades indicate values below the mean. Strong positive deviations in FRAP and TPC are evident in sample #13.

Figure 4. Pearson correlation matrix among extraction parameters and antioxidant response variables. Values range from −1 (perfect negative correlation) to +1 (perfect positive correlation). Strong internal consistency was observed among antioxidant metrics (TPC, FRAP, and DPPH), while concentration (C, %) negatively correlates with TPC and FRAP.

Figure 5. TPC, showing the (A) covariation of X₁ (ethanol concentration, C, % v/v) and X₂ (liquid-to-solid ratio, R, mL/g); (B) covariation of X₁ and X₄ (extraction time, t, min); (C) covariation of X₂ and X₃ (extraction temperature, T, °C); and (D) covariation of X₂ and X₄. FRAP, showing the (E) covariation of X₁ and X₂; (F) covariation of X₁ and X₃; and (G) covariation of X₂ and X₄. DPPH, showing the (H) covariation of X₁ and X₂; (I) covariation of X₁ and X₃; (J) covariation of X₁ and X₄; (K) covariation of X₂ and X₃; and (L) covariation of X₂ and X₄. AAC, showing the (M) covariation of X₁ and X₃.

Figure 6. Pareto plots illustrating the significance of parameter estimates for the PLE technique across TPC (A), FRAP (B), DPPH (C), and AAC (D), with a pink asterisk marking significant values (p < 0.05). Positive estimates are shown in blue, while negative ones are represented in red.

Figure 7. PCA for the measured variables. Each X variable is presented with a blue color.

Figure 8. Plot (A) shows the optimization of the PLE technique for rosemary extracts through the partial least squares (PLS) prediction profiler and a desirability function with extrapolation control. Plot (B) shows the Variable Importance Plot (VIP) graph, showing the VIP values for each predictor variable in the PLE technique, with a red dashed line marking the 0.8 significance level.

Figure 9. Histograms of the performance of our six ML models on our original training set.

Figure 10. Histograms of the performance of our six ML models on our original test set.

Figure 11. Histograms of the performance of our six ML models on our synthetic training set.

Figure 12. Histograms of the performance of our six ML models on our synthetic test set.

Figure 13. Histograms of the performance of our six ML models on our mixed training set.

Figure 14. Histograms of the performance of our six ML models on our mixed test set.

Figure 15. Cross-dataset evaluation of the RF regression model trained on the original, synthetic, and mixed datasets. Each group of bars represents the performance metrics MAE, MSE, RMSE, and R² obtained when the model trained on one dataset was tested on another. Results highlight the generalization ability of each training regime across data domains. Models trained on the mixed dataset showed superior cross-domain performance, particularly when evaluated on both the original and synthetic test sets.

Figure 16. Feature importance scores from RF models trained on the (left) original, (middle) synthetic, and (right) mixed datasets. Importance scores reflect each feature’s contribution to predicting antioxidant targets (TPC, FRAP, DPPH, and AAC). Feature influence varies significantly depending on the dataset used for training.

Figure 17. Predicted vs. actual standardized values for antioxidant targets using RF-based models trained on the original (top row), synthetic (middle row), and mixed (bottom row) datasets. Diagonal red dashed lines represent the ideal 1:1 relationship. R² values quantify model fit for each case.

Figure 18. Partial dependence plots showing the marginal effect of each feature on antioxidant predictions from the RF original model.

Figure 19. Partial dependence plots showing the marginal effect of each feature on antioxidant predictions from the RF synthetic model.

Figure 20. Partial dependence plots showing the marginal effect of each feature on antioxidant predictions from the RF mixed model.

Figure 21. Comparison of predicted versus reported antioxidant values under optimal extraction conditions using RF-based models trained on the original, synthetic, and mixed datasets. Bars represent predicted values for TPC, FRAP, DPPH, and AAC, with numeric labels showing predicted value ± absolute error relative to the experimentally reported values. The mixed model produced the most accurate predictions overall, with reduced errors across most targets.

Table 1. Summary of machine learning regression models and the corresponding hyperparameters tuned during grid search. Default parameters were used for Linear Regression, while regularization strengths (α) and core structural parameters (number of estimators, max depth, min samples split, tree depth, and learning rate) were varied for the other models.

Model	Tuned Parameters	Values Tested
Linear Regression	None	-
Ridge	α	[0.1, 1.0, 10.0]
Lasso	α	[0.001, 0.01, 0.1, 1.0]
RF	n_estimators, max_depth, min_samples_split	[100, 200], [None, 10], [2, 5]
GB	n_estimators, learning_rate, max_depth	[100, 200], [0.05, 0.1], [3, 5]
AdaBoost	n_estimators, learning_rate	[50, 100], [0.5, 1.0]

Table 2. Experimental results for the four examined independent variables and the dependent variables’ responses to the PLE technique.

Design Point	Independent Variables				Actual PLE Responses *
Design Point	C (%) (X₁)	R_L/S (mL/g) (X₂)	T (°C) (X₃)	t (min) (X₄)	TPC	FRAP	DPPH	AAC
1	75	10	40	20	37.20	612.74	243.81	7.47
2	75	55	130	20	26.84	366.39	152.88	9.46
3	25	25	160	10	44.43	740.02	389.29	9.95
4	0	55	130	5	37.03	451.66	230.22	12.02
5	0	10	40	5	28.84	331.68	67.14	3.14
6	100	25	160	25	18.96	214.42	108.73	9.90
7	100	70	160	10	34.14	413.30	174.82	14.65
8	100	25	70	10	11.26	119.68	66.67	8.35
9	25	70	70	10	41.10	467.27	185.26	11.05
10	25	70	160	25	47.41	615.91	251.75	15.56
11	75	10	130	5	60.26	286.87	388.69	11.89
12	0	55	40	20	31.88	233.39	100.86	5.45
13	0	10	130	20	75.85	1089.81	816.36	6.38
14	75	55	40	5	30.09	393.47	150.16	10.24
15	100	70	70	25	19.17	169.93	85.11	13.34
16	25	25	70	25	49.64	696.79	427.46	8.44
17	50	40	100	15	52.69	893.17	497.11	12.45

* Values represent the mean of triplicate determinations; TPC, total polyphenol content (in mg GAE/g dw); FRAP, ferric reducing antioxidant power (in μmol AAE/g dw); DPPH, antiradical activity (in μmol AAE/g dw); AAC, ascorbic acid content (in mg/g dw).

Table 3. Analysis of variance (ANOVA) is performed for the response surface quadratic polynomial model in the context of the PLE technique.

Factor	TPC	FRAP	DPPH	AAC
Stepwise regression coefficients
Intercept	42.88 *	715.8 *	356.2 *	11.04 *
X₁—ethanol concentration	−10.7 *	−170 *	−96.8 *	1.204 *
X₂—liquid-to-solid ratio	−5.69	−81.1	−103 *	2.283 *
X₃—temperature	7.183	96.73 *	93.54 *	1.956 *
X₄—extraction time	0.854	66.9	39.24	-
X₁X₂	3.746	166.6 *	92.01	-
X₁X₃	-	−114	−54.1	−1.02
X₁X₄	−8.19	-	−91	-
X₂X₃	−4.4	-	−52.1	-
X₂X₄	−4.51	−131 *	−82.1	-
X₃X₄	-	-	-	-
X₁²	−13.8	−268 *	−103	−1.69
X₂²	15.5	-	79.73	-
X₃²	−8.56	-	−130	-
X₄²	-	−131	-	-
ANOVA
F-value	4.507	7.716	4.362	10.65
p-Value	0.0545 ns	0.0067 *	0.0833 ns	0.0006 *
R²	0.908	0.908	0.929	0.829
Adjusted R²	0.707	0.791	0.716	0.751
RMSE	8.747	121.9	104.7	1.633
PRESS	3913	363,375	580,653	57.61
CV	42.46	55.93	77.03	32.76
DF (total)	16	16	16	16

* The values significantly affected responses at a probability level of 95% (p < 0.05). TPC, total polyphenol content; FRAP, ferric reducing antioxidant power; DPPH, antiradical activity; AAC, ascorbic acid content; ns, non-significant; F-value, test for comparing model variance with residual (error) variance; p-Value, probability of seeing the observed F-value if the null hypothesis is true; RMSE, root mean square error; PRESS, predicted residual error sum of squares; CV, coefficient of variation; DF, degree of freedom.

Table 4. Maximum predicted responses and optimum extraction conditions for the dependent variables.

Parameters	Independent Variables				Desirability	Stepwise Regression
Parameters	C (%) (X₁)	R_L/S (mL/g) (X₂)	T (°C) (X₃)	t (min) (X₄)	Desirability	Stepwise Regression
TPC (mg GAE/g dw)	25	10	130	20	0.9292	76.19 ± 17.92
FRAP (μmol AAE/g dw)	25	10	160	20	0.9907	1117.68 ± 212.88
DPPH (μmol AAE/g dw)	0	10	130	20	0.8728	799.03 ± 271.68
AAC (mg/g dw)	50	70	160	-	0.9318	15.28 ± 2.23

Table 5. Multivariate correlation analysis of measured variables.

Responses	TPC	FRAP	DPPH	AAC
TPC	–	0.7796	0.9129	−0.0738
FRAP		–	0.8549	−0.0139
DPPH			–	−0.0722
AAC				–

Table 6. The partial least squares (PLS) prediction profiler determined the maximum desirability for all variables under optimal extraction condition for PLE technique.

Parameters	Independent Variables				Desirability	Partial Least Squares (PLS) Regression	Experimental Values
Parameters	C (%) (X₁)	R_L/S (mL/g) (X₂)	T (°C) (X₃)	t (min) (X₄)	Desirability	Partial Least Squares (PLS) Regression	Experimental Values
TPC (mg GAE/g dw)	25	10	160	25	0.8429	80.29	78.23 ± 0.63
FRAP (μmol AAE/g dw)						1118.22	914.82 ± 1.53
DPPH (μmol AAE/g dw)						817.20	878.7 ± 6.34
AAC (mg/g dw)						10.20	17.83 ± 0.25

Table 7. Optimal extraction conditions for polyphenolic compounds using the PLE technique of rosemary extraction.

Polyphenolic Compound	Concentration (μg/g dw)
Catechin	239 ± 11
Quercetin 3-D-galactoside	1114 ± 45
Luteolin-7-glucoside	236 ± 13
Kaempferol-3-glucoside	442 ± 19
Hesperidin	3711 ± 96
Rosmarinic acid	1570 ± 58
Apigenin	245 ± 7
Kaempferol	72 ± 2
Rosmanol	731 ± 27
Carnosic acid	889 ± 32
Total identified	9250 ± 311

Values represent the mean of triplicate determinations ± standard deviation.

Table 8. Equation of calibration curves for each compound identified through HPLC-DAD.

Polyphenolic Compounds (Standards)	Equation (Linear)	R²	Retention Time (min)	UVmax (nm)
Catechin	y = 11,920.78x − 128.19	0.997	20.933	278
Quercetin 3-D-galactoside	y = 41,489.69x − 35,577.55	0.993	34.598	257
Luteolin-7-glucoside	y = 34,875.94x − 16,827.36	0.999	35.949	347
Kaempferol-3-glucoside	y = 50,916.85x − 42,398.83	0.996	38.724	265
Hesperidin	y = −30,502.75x − 30,502.75	0.995	40.249	283
Rosmarinic acid	y = 50,281.27x − 113,633.31	0.995	41.644	329
Apigenin	y = 95,483.52x − 5214.26	0.998	55.860	227
Kaempferol	y = 93,385.02x − 18,613.03	0.999	56.883	265
Rosmanol	y = 5509.45x − 10,899.23	0.994	65.924	288
Carnosic acid	y = 8883.45x + 101,483.30	0.992	77.870	284

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mantiniotou, M.; Athanasiadis, V.; Liakos, K.G.; Bozinou, E.; Lalas, S.I. Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction. Processes 2025, 13, 1879. https://doi.org/10.3390/pr13061879

AMA Style

Mantiniotou M, Athanasiadis V, Liakos KG, Bozinou E, Lalas SI. Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction. Processes. 2025; 13(6):1879. https://doi.org/10.3390/pr13061879

Chicago/Turabian Style

Mantiniotou, Martha, Vassilis Athanasiadis, Konstantinos G. Liakos, Eleni Bozinou, and Stavros I. Lalas. 2025. "Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction" Processes 13, no. 6: 1879. https://doi.org/10.3390/pr13061879

APA Style

Mantiniotou, M., Athanasiadis, V., Liakos, K. G., Bozinou, E., & Lalas, S. I. (2025). Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction. Processes, 13(6), 1879. https://doi.org/10.3390/pr13061879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence and Extraction of Bioactive Compounds: The Case of Rosemary and Pressurized Liquid Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Chemicals and Reagents

2.2. Rosemary Leaves’ Raw Material and Sample Preparation

2.3. Experimental Design

2.4. Total Polyphenolic Content (TPC) Determination Through Spectrophotometric Evaluation

2.5. Ferric-Reducing Antioxidant Power (FRAP) Evaluation of Antioxidant Activity

2.6. Evaluation of Radical Scavenging Activity

2.7. HPLC Quantification of Polyphenolic Compounds

2.8. Ascorbic Acid Content (AAC)

2.9. Statistical Analysis

2.10. Initial Data Set Exploration and Visualization

2.11. ML Regressor Development

2.12. Machine Learning Regressor Evaluation

2.13. Generative Model Development

3. Results and Discussion

3.1. Optimization of PLE Parameters

3.1.1. Model Analysis

3.1.2. Impact of Extraction Parameters on Assays Through Pareto Plot Analysis

3.2. Principal Component Analysis (PCA) and Multivariate Component Analysis (MCA)

3.3. Partial Least Squares (PLS) Analysis

3.4. Performance of Machine Learning Regressors on the Original Data

3.5. Performance of Machine Learning Regressors on the Synthetic Dataset

3.6. Performance of Machine Learning Regressors on the Mixed Dataset

3.7. Cross-Evaluation of RF Models on Real, Synthetic, and Mixed Data

3.8. Feature Importance Analysis Across RF-Based Models

3.9. Actual vs. Predicted Performance Across RF-Based Models

3.10. Partial Dependence Analysis of RF-Based Models

3.11. Model Prediction Accuracy at Optimal Conditions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI