Machine Learning-Driven Prediction of CO2 Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach

Hashemi, Seyed Hossein; Torabi, Farshid; Tontiwachwuthikul, Paitoon

doi:10.3390/en18154205

Open AccessArticle

Machine Learning-Driven Prediction of CO₂ Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach

by

Seyed Hossein Hashemi

¹

,

Farshid Torabi

^2,* and

Paitoon Tontiwachwuthikul

³

¹

Energy Systems Engineering, University of Regina, Regina, SK S4S 0A2, Canada

²

Energy and Process Systems Engineering, University of Regina, Regina, SK S4S 0A2, Canada

³

Clean Energy Technologies Research Institute (CETRi), Faculty of Engineering and Applied Science, University of Regina, Regina, SK S4S 0A2, Canada

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(15), 4205; https://doi.org/10.3390/en18154205

Submission received: 7 July 2025 / Revised: 25 July 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Advances in Carbon Dioxide (CO₂) Enhanced Oil Recovery (EOR) and Carbon Capture and Storage (CCS))

Download

Browse Figures

Versions Notes

Abstract

The solubility of CO₂ in brine systems is critical for both carbon storage and enhanced oil recovery (EOR) applications. In this study, Gaussian Process Regression (GPR) with eight different kernels was optimized using the Grey Wolf Optimizer (GWO) algorithm to model this important phase behavior. Among the tested kernels, the ARD Matern 3/2 and ARD Matern 5/2 kernels achieved the highest predictive accuracies, with R² values of 0.9961 and 0.9960, respectively, on the test data. This demonstrates superior performance in capturing CO₂ solubility trends. The GWO algorithm effectively tuned the hyperparameters for all kernel configurations, while the ARD capability successfully quantified the influence of key physicochemical parameters on CO₂ solubility. The outstanding performance of the ARD Matern 3/2 and ARD Matern 5/2 kernels suggests their particular suitability for modeling complex thermodynamic behaviors in brine systems. Furthermore, this study integrates fundamental thermodynamic principles into the modeling framework, ensuring all predictions adhere to physical laws while maintaining excellent accuracy (test R² > 0.98). These results highlight how machine learning can improve CO₂ injection processes, both for underground carbon storage and enhanced oil production.

Keywords:

CO₂ solubility; Gaussian process regression; Grey Wolf Optimizer; ARD kernels; carbon storage; enhanced oil recovery

1. Introduction

Rising CO₂ levels in the atmosphere and growing concerns about climate change have made carbon capture and storage (CCS) a practical approach to reducing emissions. By capturing CO₂ at the source and storing it in underground geological formations, significant emission reductions can be achieved [1]. Global storage capacity estimates vary widely, ranging from 8000 to 55,000 gigatons (1 gigaton = 10¹² kg), based on comprehensive analyses of sediment thickness and other geological factors [2]. Deep salt formations are considered the most promising CO₂ storage sites due to their high capacity [3,4], where understanding CO₂ solubility in brine becomes crucial. Precise quantification of CO₂ solubility in brine is critical for two key applications: (1) optimizing injection parameters for secure, long-term geological storage in saline aquifers [5], and (2) enhancing hydrocarbon recovery through carbonated water flooding (CWF) operations [6]. Therefore, accurately predicting CO₂ solubility in water across wide ranges of temperature, pressure, and mineral ion concentrations is crucial for both storage and enhanced oil recovery applications. One recent approach to improve these solubility estimates in saline solutions involves using machine learning algorithms, which has attracted significant attention from researchers in the field. Table 1 provides an overview of previous studies predicting CO₂ solubility in brine using machine learning and computational algorithms. Bhattacherjee et al. (2023) [7] investigated CO₂ solubility in saline solutions using machine learning, with Extreme Gradient Boosting (XGBoost) emerging as the most accurate algorithm (1.3% average deviation). This ML approach demonstrated comparable precision to traditional equations of state while significantly reducing computational complexity. Sadeghi et al. (2015) [8] studied CO₂ solubility in NaCl brine (0–600 bar, 283–383 K) using both thermodynamic modeling and neural networks, finding that a 5-neuron neural network (R² = 0.975, 3.41% error) outperformed the optimized thermodynamic model (3.55% error). Zou et al. [9] investigated CO₂ solubility in brines (273–453 K, 0.06–100 MPa) using neural networks, demonstrating that a Levenberg–Marquardt optimized cascade forward neural network (CFNN-LM) achieved highest accuracy (R² = 0.9949, 5.37% error), with pressure having the strongest positive effect while temperature and salt concentrations showed negative impacts due to salting-out effects.

Jeon and Lee [10] developed an accurate ANN model (4.9% error) for CO₂ solubility prediction in brines across wide pressure-temperature ranges, outperforming conventional methods. Yang et al. [11] developed an accurate ANN model (R² > 0.99) predicting CO₂ solubility in both water and brine across reservoir conditions, revealing distinct dissolution mechanisms between pure and saline systems. Mohammadian et al. [12] developed four machine learning models (XGB, MLP, KNN, GA) that accurately predict CO₂ solubility in brine (R² 0.95–0.99), identifying pressure as the most influential factor, with validated performance even beyond original dataset ranges (298–373 K, up to 200 atm).

Du et al. [13] found XGBoost to be the most accurate ML model (R² = 0.9926) for predicting CO₂ solubility in brine across wide pressure (0.098–140 MPa) and temperature (−10 to 450 °C) ranges, with pressure showing the strongest positive effect. Karaei et al. [14] developed a precise least squares support vector machine model to predict CO₂ solubility in brine, demonstrating how temperature, pressure and salinity affect dissolution behavior in enhanced oil recovery systems.

While machine learning approaches have improved CO₂ solubility predictions in brine, existing studies exhibit key limitations: (1) most focus on limited ionic compositions (primarily Na⁺ and Cl⁻), neglecting the complexity of natural brine systems; (2) current models often rely on single-algorithm frameworks without systematic hyperparameter optimization, potentially limiting accuracy; and (3) the performance of Gaussian Process Regression (GPR)—a robust probabilistic method—remains underexplored for this application, particularly with advanced kernel optimization.

This study bridges these gaps by introducing a hybrid GPR-GWO (Grey Wolf Optimizer) model that advances the state of the art in three ways: (1) employing GWO to optimize GPR kernel functions and hyperparameters, enhancing prediction accuracy beyond conventional machine learning methods; (2) incorporating a broader range of ions (e.g., Br⁻, Fe²⁺, Sr²⁺, NH₄⁺) to better represent real-world brine chemistry; and (3) providing a probabilistic framework that quantifies prediction uncertainty, crucial for risk assessment in CCS and EOR applications. By outperforming existing models (Table 1), our approach offers a more reliable tool for optimizing CO₂ storage and hydrocarbon recovery in diverse saline environments. In this study, we also introduce a key innovation: our model incorporates fundamental physics rules to ensure accurate CO₂ solubility predictions. By requiring the model to follow known chemical behaviors (like how solubility changes with temperature, pressure, and salinity), we achieve more reliable results that make scientific sense. This physics-based approach helps our predictions stay accurate even in conditions beyond our training data.

2. Data and Methods

2.1. Algorithms Used in This Work

Given the outstanding performance of Gaussian Process Regression (GPR) in petroleum systems modeling [15], this study employs GPR with multiple kernel functions to predict carbon dioxide (CO₂) solubility in brine. The model parameters are optimized using the Grey Wolf Optimizer (GWO) algorithm. The key steps are:

1

Gaussian Process Regression (GPR)

Uses multiple kernel functions (Matern, RBF, etc.)
Incorporates mineral ion data as input features

2

Optimization with Grey Wolf Optimizer (GWO)

Automatically adjusts hyperparameters (e.g., length scale)
Improves model accuracy using a nature-inspired search method

3

Implementation Steps

Data preprocessing → Kernel selection → GWO optimization → Model validation

This combined approach (GPR + GWO) provides an accurate and reliable model for predicting CO₂ solubility in saline water. We divided the dataset randomly into two parts:

1: 70% for training the model
2: 30% for testing the model

This random splitting helps ensure both sets have similar characteristics. All performance results come only from the test set that was completely withheld during training. This gives us a fair and accurate measure of how well the model will perform on new, unseen data. This approach is crucial because it ensures our model learns from only part of the data while keeping some aside for honest testing. By evaluating the model on completely new data it has never seen before, we get a true picture of how well it will perform in real-world situations. Most importantly, this prevents us from making the common mistake of overestimating our model’s accuracy, giving us reliable results we can trust.

In this study, we use rng(42) to split the data into training and test sets, ensuring that the same splits are generated across different runs. This guarantees a fair comparison between optimization algorithms by evaluating them on identical datasets. Without this consistency, comparing machine learning models reliably would be impossible.

2.1.1. Flowchart Used

This study develops a Gaussian process regression (GPR) model optimized by the Grey Wolf Optimizer (GWO) to predict CO₂ solubility in brine, as illustrated in the flowchart below.

Data Preparation
▪
Input Structure: Features (‘X’) include 13 parameters (T, P, and 11 ion concentrations) as columns, with each row representing one sample. Targets (‘T’) are column-oriented CO₂ solubility values.
▪
Preprocessing: All features are standardized to equalize their influence on the kernel.
Train-Test Split (70–30%)
Randomly split data into training (70%) and testing (30%) sets.
Hyperparameter Optimization
Population Setup:
‘n_wolves = 30’: Balances exploration and computational cost.
‘n_iterations = 50’: Determined via early stopping if fitness plateaus (<0.1% R² improvement over 5 iterations).
Hyperparameter Bounds:
▪
Length scales (‘[0.001, 100]’): Wide range accommodates diverse feature sensitivities
▪
Sigma (signal variance) (‘[0.001, 10]’): Reflects expected magnitude of solubility variations.
These bounds ensure a wide search space while preventing numerical instability in GPR.
▪
Kernel Selection: Evaluates 8 kernels (e.g., Matern 3/2, ARD Squared Exponential) during GWO iterations.
GWO Core Mechanics
▪
Leader Hierarchy: Alpha (best), Beta, and Delta wolves guide updates.
▪
Position Update:
A = 2*a*rand ()—a; % Exploration coefficient (a decreases linearly from 2 to 0)
C = 2*rand (); % Random perturbation
new_position = (alpha_pos—A*abs (C*alpha_pos—current_pos))/3 + … % Beta/Delta terms
GPR Training with Optimized Kernel
▪
Kernel Configuration:
ARD (Automatic Relevance Determination): Each feature gets a unique length scale (optimized by GWO).
Prediction & Evaluation
▪
Predict on training/test sets.
▪
Metrics: R², MAE, RMSE.
Output
▪
Optimal hyperparameters.
▪
Performance metrics and plots

2.1.2. GPR Kernel Parameters

Gaussian process regression (GPR) is a non-parametric Bayesian method for regression that effectively balances exploration and exploitation, making it particularly useful in optimization and active learning tasks [16]. Hyperparameter optimization is critical in machine learning, especially for Gaussian Process Regression (GPR), where poor initialization can lead to local minima and high computational costs [17]. This study focuses on hyperparameter optimization in Gaussian Process Regression (GPR) for predicting CO₂ solubility in brine. We systematically evaluate eight kernel functions: Squared Exponential (RBF), Matern 3/2, Matern 5/2, Rational Quadratic, ARD Squared Exponential, ARD Matern 3/2, ARD Matern 5/2, and ARD Rational Quadratic. Using the Grey Wolf Optimizer (GWO), we automatically tune their hyperparameters including length scales (σl), signal variance (σf), and shape parameters (α) where applicable. The proposed GPR-GWO framework demonstrates enhanced accuracy while maintaining computational efficiency, particularly valuable for carbon capture applications. Table 2 presents the details of different kernel functions evaluated in this Gaussian process regression study for hyperparameter optimization.

Uncertainty Quantification (UQ) is critical for reliability in scientific applications [23]. We employed Gaussian Process Regression (GPR), which inherently provides uncertainty estimates through predictive variance (σ²) alongside point predictions. While GPR accounts for epistemic uncertainty, aleatoric uncertainty (data noise) was addressed via hyperparameter optimization of the noise parameter (sigma) during training.

2.1.3. Physics-Informed GPR Model

Machine learning models that follow physics rules give predictions that are accurate and make sense in the real world [24]. In this work, we incorporate physical knowledge directly into the Gaussian Process Regression (GPR) model to ensure predictions obey fundamental physical laws. Although standard GPR models learn patterns purely from data, they may sometimes produce results that contradict known physical behaviors due to data noise or limited training samples. To address this, we apply a physics-informed objective function that penalizes the model when its predicted solubility response behaves inconsistently with established physical principles. Specifically, we numerically estimate the partial derivatives of the predicted CO₂ solubility with respect to temperature, pressure, and salinity using finite difference approximations. According to physical chemistry, CO₂ solubility should decrease with increasing temperature and salinity, but increase with pressure. Therefore, when the model’s predicted derivatives violate these expected monotonic trends (e.g., solubility increases with temperature), a penalty term proportional to the magnitude of the violation is added to the objective function. This combined objective, balancing data fit and physical consistency, guides the optimization of the GPR hyperparameters. As a result, the trained model not only fits the available data accurately but also respects the underlying physics, improving prediction reliability and generalization outside the training domain.

2.2. Data Collection

In this study, 500 samples were collected to evaluate and predict CO₂ solubility in brine for applications in underground carbon storage and enhanced oil recovery. The dataset includes 13 parameters such as temperature, pressure, and concentrations of ions (sodium, calcium, magnesium, potassium, chloride, sulfate, bicarbonate, iron, strontium, etc.). The dataset used in this study consists of experimentally measured CO₂ solubility values in brine across a wide range of temperatures (273–453 K), pressures (0.06–100 MPa), and ionic compositions (including Na⁺, K⁺, Mg²⁺, Ca²⁺, Cl⁻, SO₄²⁻, HCO₃⁻, Br⁻, Fe²⁺, Sr²⁺, and NH₄⁺). The experimental CO₂ solubility data were obtained from published scientific studies and carefully processed to ensure reliable and consistent results. This preprocessing included unit normalization to maintain consistent measurement scales across all parameters, as well as outlier removal to eliminate anomalous data points that could skew the results. To rigorously evaluate the model’s performance, the dataset was randomly split into training (70%) and testing (30%) subsets using a fixed random seed (rng(42)), ensuring reproducibility and fair comparison across different optimization trials. Performance metrics such as R² were calculated exclusively on the test set.

Table 3 provides detailed information about the collected dataset, including temperature/pressure conditions and target ion concentrations. The complete dataset is available in the Supplementary Materials.

3. Results and Discussion

This study evaluates the performance of machine learning algorithms, focusing on Gaussian process regression with different kernels, using 500 experimental samples (350 training, 150 testing) with 13 input features including temperature, pressure, and ion concentrations (sodium, calcium, iron, strontium, chloride, sulfate, bicarbonate, etc.) for predicting CO₂ solubility. Figure 1 presents the predicted CO₂ solubility in brine using Gaussian Process Regression (GPR) with the Squared Exponential kernel, optimized as the Grey Wolf Optimizer (GWO). The optimal hyperparameters were determined as:

Length Scale: 66.141
Sigma: 3.782

The model demonstrated excellent performance with:

Training R² = 0.9957
Test R² = 0.9793

The length scale (66.141) suggests the model captures long-range trends in the data, indicating smooth variations in CO₂ solubility across input parameters (temperature, pressure, and ion concentrations). The sigma value (3.782) reflects moderate data variability, balancing flexibility and generalization.

Figure 1. CO₂ solubility prediction using optimized Gaussian Process Regression with Squared Exponential kernel optimized by Grey Wolf Optimizer (GWO): (a) Training results, (b) Testing results.

Figure 2 illustrates the predicted CO₂ solubility in brine using Gaussian Process Regression (GPR) with the Matern 3/2 kernel, optimized through the Grey Wolf Optimizer (GWO). The optimal hyperparameters obtained were:

Length Scale: 57.208
Sigma: 9.865

The model delivered outstanding results:

Training R²: 0.9966
Test R²: 0.9819

The selected length scale (57.208) suggests the model effectively captures intermediate-scale patterns in the data, offering a balanced response to both local fluctuations and broader trends in CO₂ solubility across variables such as temperature, pressure, and ionic concentrations. The relatively high sigma value (9.865) reflects the model’s capacity to accommodate a wider range of data variability, enhancing its flexibility while maintaining robust generalization on unseen data.

Figure 2. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) with Matern 3/2 kernel optimized by Grey Wolf Optimizer (GWO): (a) Training results, (b) Testing results.

In this study, the Gaussian Process Regression (GPR) model employing the Matern 5/2 kernel was optimized using the Grey Wolf Optimizer (GWO) algorithm, as illustrated in Figure 3. The optimization process resulted in a length scale of 51.400 and a signal variance (σ) of 10.000. The model exhibited excellent performance, achieving an R² value of 0.9959 on the training data and 0.9813 on the test data. These findings confirm that the Matern 5/2 kernel, when tuned via GWO, yields a highly accurate and generalizable model with minimal risk of overfitting.

As shown in Figure 4, the Gaussian Process Regression (GPR) model employing the Rational Quadratic kernel was optimized using the Grey Wolf Optimizer (GWO) algorithm. The optimization process yielded a length scale of 63.907, an alpha value of 2.324, and a signal variance (σ) of 7.569. The model demonstrated excellent performance, with an R² of 0.9957 on the training data and 0.9804 on the test data. These findings confirm that the Rational Quadratic kernel, when tuned using GWO, offers a highly accurate and generalizable predictive model.

As shown in Figure 5, the Gaussian Process Regression (GPR) model using the Automatic Relevance Determination (ARD) Squared Exponential kernel was optimized through the Grey Wolf Optimizer (GWO) algorithm. The optimal hyperparameters included individual length scales for each of the 13 input features, ranging from 0.001 to 57.929, reflecting the model’s ability to adaptively weigh feature relevance. The study analyzed thirteen key parameters in the brine samples: (1) ammonium (NH₄⁺), (2) chloride (Cl⁻), (3) sodium (Na⁺), (4) potassium (K⁺), (5) calcium (Ca²⁺), (6) magnesium (Mg²⁺), (7) sulfate (SO₄²⁻), (8) bromide (Br⁻), (9) strontium (Sr²⁺), (10) iron (Fe), (11) bicarbonate (HCO₃⁻), along with (12) pressure and (13) temperature measurements. These parameters were selected to comprehensively characterize the brine chemistry and its interaction with CO₂ under varying conditions. Specifically, the length scales for features 1 to 13 were: 53.319, 47.600, 44.914, 38.035, 0.001, 33.310, 57.929, 31.718, 10.375, 30.789, 12.304, 28.851, and 20.864, respectively. The signal variance (σ) was optimized to 3.369.

The model demonstrated outstanding performance with an R² of 0.9970 on the training dataset and 0.9892 on the test dataset. These results confirm that the ARD Squared Exponential kernel, when finely tuned using GWO, yields a highly accurate and interpretable model with excellent generalization capability.

Figure 6 shows the Gaussian Process Regression (GPR) model with an Automatic Relevance Determination (ARD) Matern32 kernel, optimized using the Grey Wolf Optimizer (GWO) algorithm. The optimized hyperparameters included distinct length scales for each of the 13 input features, ranging from 1.175 to 99.844, highlighting the model’s ability to adaptively assess feature importance. The length scales for features 1 to 13 were: 54.840, 61.916, 1.742, 20.246, 98.421, 27.175, 1.175, 68.120, 1.931, 89.016, 99.844, 48.230, and 16.950, respectively. The signal variance (σ) was optimized to 3.944. The model achieved exceptional performance, with an R² of 0.9983 on the training set and 0.9961 on the test set. These results demonstrate that the ARD Matern32 kernel, when fine-tuned with GWO, produces a highly accurate, interpretable, and robust predictive model.

Figure 7 presents the Gaussian Process Regression (GPR) model using an ARD Matern 52 kernel, optimized with the Grey Wolf Optimizer (GWO). The model automatically determined each feature’s importance through optimized length scales ranging from 3.070 to 63.066 across the 13 input features. Key parameters included:

Length scales: 31.346, 17.390, 3.762, 53.374, 41.885, 17.647, 3.070, 63.066, 54.515, 20.266, 45.695, 36.488, 9.548
Signal variance (σ): 4.590

The model achieved highly accurate predictions:

Training R²: 0.9971 (99.71% accuracy)
Test R²: 0.9960 (99.60% accuracy)

These results show the ARD Matern52 kernel effectively balances accuracy and interpretability when optimized with GWO, making it ideal for complex data analysis.

Figure 7. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) model with ARD Matern 5/2 kernel Optimized by GWO: (a) Training results, (b) Testing results.

Figure 8 presents the Gaussian Process Regression (GPR) model employing an Automatic Relevance Determination (ARD) Rational Quadratic kernel, hyperparameter-optimized via the Grey Wolf Optimizer (GWO) algorithm. The optimized hyperparameters comprised distinct length scales for all 13 input features, ranging from 10.033 (Feature 6) to 74.506 (Feature 1), demonstrating the model’s adaptive feature importance assessment. The length scales for Features 1–13 were: 74.506, 55.424, 40.331, 28.313, 17.091, 10.033, 22.545, 60.380, 23.589, 32.909, 46.520, 32.746, and 19.381, respectively. The kernel’s alpha (α) and signal variance (σ) were optimized to 3.945 and 5.129, reflecting the data’s non-linear dynamics and noise tolerance. The model achieved near-perfect performance, with an R² of 0.9969 (training) and 0.9896 (test), alongside a low prediction error. These results underscore the ARD Rational Quadratic kernel’s superiority in balancing flexibility and interpretability, with GWO fine-tuning enabling robust, high-accuracy predictions.

The results presented in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 demonstrate an important characteristic of Gaussian Process Regression (GPR) optimization: different kernel configurations can lead to distinct hyperparameter values while maintaining comparable predictive performance. Our analysis of eight different kernels shows that each successfully optimized the GPR model, but proposed fundamentally different combinations of length scales and sigma parameters. This phenomenon occurs because the optimization landscape of GPR models typically contains multiple valid solutions. The interaction between length scales and sigma is complex, with their relative ratios often being more important than their absolute values, as evidenced by the kernel formulas in Table 2. Each kernel type processes data differently based on its mathematical properties, leading to variations in the optimized parameters even when achieving similar prediction accuracy. It is worth noting that large length scale values do not necessarily indicate a modeling issue such as excessive smoothness. Instead, they may reflect the physical nature of the system. For example, when a feature has a weak or slowly varying effect on the target variable, the model tends to assign a larger length scale to that input. In contrast, more influential features typically receive smaller length scale values, indicating greater sensitivity in the model output. Therefore, the learned hyperparameters, especially the length scales, can provide meaningful insight into the physical relevance and variability of each input variable.

Figure 9 demonstrates the feature importance derived from the GPR model with ARD Matern 3/2 kernel. The feature importance analysis using the ARD Matern 3/2 kernel in the Gaussian Process Regression (GPR) model showed that features 7, 3, and 9 had the highest influence on CO₂ solubility prediction. These features exhibited notably larger inverse length scale values, indicating a stronger relevance to the model output. The remaining features showed relatively lower importance, highlighting the model’s ability to distinguish dominant input variables.

The nearly identical performance metrics across all kernels (Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8) confirm that multiple approaches can be equally valid for a given problem. The choice between them should consider:

(1): Interpretability requirements (some kernels provide clearer feature importance)
(2): Flexibility needs (certain kernels handle complex patterns better)
(3): Practical implementation constraints

The optimized ARD kernel models demonstrated outstanding predictive performance for CO₂ solubility, with all variants consistently achieving R² values exceeding 0.9793 on test datasets. Three key findings emerged from the hyperparameter analysis: First, the automatically learned length scales provided quantitative insights into feature importance, revealing which physicochemical factors most significantly influence solubility. Second, among kernel functions, the Matern 3/2 and 5/2 kernels delivered optimal performance (test R² = 0.9961 and 0.9960, respectively), combining high accuracy with clear physical interpretability—particularly valuable for capturing the system’s multiscale behavior. Third, the Rational Quadratic kernel’s elevated parameters (σ = 5.129, α = 3.945) indicated its enhanced capacity to model data variability across different scales. Notably, the negligible performance differences (<0.1% R²) between kernel variants suggest that the automatic relevance determination (ARD) mechanism, through its adaptive feature weighting, contributes more substantially to model success than the specific kernel choice. These results collectively highlight ARD-GPR’s dual strengths: exceptional predictive power for engineering applications and physically meaningful interpretation through learned length scales, offering both accurate forecasts and mechanistic understanding of CO₂ solubility behavior. The ARD results should be interpreted carefully, because the model was trained on data from different sources with different conditions and methods. This variation in the data may affect how important each feature appears. In this study, Grey Wolf Optimizer (GWO) demonstrated excellent performance in tuning Gaussian Process Regression hyperparameters, achieving highly satisfactory and reliable results.

Figure 10 presents CO₂ solubility predictions generated by the physics-informed Gaussian Process Regression (GPR) model employing the ARD Matern 3/2 kernel. This kernel was selected to enforce physical constraints due to its demonstrated superior accuracy (as shown in Figure 6), enabling both thermodynamic consistency and high-fidelity predictions. The optimized GPR model demonstrated robust predictive performance, achieving a training R² of 0.9971 and testing R² of 0.9857. The learned length scales ([0.0012, 98.2534, 100.0000, 39.2315, 4.0386, 1.4870, 29.2326, 4.2875, 1.6872, 17.9518, 9.9475, 8.2305, 4.6160]) and noise parameter σ = 3.9562 reflect the model’s automatic feature relevance identification. Notably, feature 1 (length scale = 0.0012) emerged as the most dominant physical driver of CO₂ solubility, followed by features 6, 9, and 5 (with scales of 1.4870, 1.6872, and 4.0386, respectively). In contrast, features 2 and 3 exhibited negligible influence. Intermediate values for the remaining features suggest moderate but physically meaningful contributions. The kernel’s σ = 3.9562 value indicates effective noise handling aligned with the expected uncertainty in the experimental data. Furthermore, the model correctly reproduces expected physical trends: solubility increases with temperature while decreasing with pressure and salinity. These results demonstrate that integrating domain knowledge through hybrid data-physics modeling can yield predictions that are not only statistically accurate but also physically consistent, thereby addressing a major limitation of purely data-driven approaches in scientific and engineering contexts.

This study’s optimized GPR model delivers four key benefits for carbon management applications. First, its high predictive accuracy (R² > 0.9793) enables reliable CO₂ solubility estimation across diverse reservoir conditions, improving site selection for storage and EOR projects while reducing experimental costs. Second, this capability supports risk mitigation through early detection of solubility constraints and predictive modeling of CO₂ plume behavior. Third, operational efficiencies are achieved by substituting modeling for portions of experimental work, optimizing injection strategies, and accelerating project timelines. These combined advantages offer industry practitioners a robust tool for more economical and effective CCS/EOR implementation.

4. Conclusions

The solubility of carbon dioxide (CO₂) in brine systems is fundamentally important for both carbon storage and enhanced oil recovery (EOR) applications. This comprehensive study demonstrates the successful application of Gaussian Process Regression (GPR) coupled with Grey Wolf Optimizer (GWO) for accurate prediction of CO₂ solubility in brine systems under various thermodynamic conditions. Our evaluation of eight kernel functions showed that ARD Matern 3/2 and ARD Matern 5/2 achieved the highest predictive accuracies (R² = 0.9961 and 0.9960, respectively), while all other kernels still demonstrated strong performance (R² > 0.97). The GWO algorithm proved particularly effective in tuning hyperparameters, consistently identifying optimal length scales and sigma values across all kernel configurations. Using the ARD capability, our model could determine how much factors like temperature, pressure, and water salinity affect CO₂ solubility. The robustness of these models across diverse conditions, combined with their ability to handle complex, non-linear relationships, makes them particularly valuable for optimizing carbon management strategies. Moreover, this study incorporates fundamental physical laws (based on thermodynamic principles governing gas solubility in brine) directly into the modeling framework. By successfully integrating thermodynamic principles with machine learning, this study delivers a robust predictive tool that combines physical realism with computational accuracy. The demonstrated performance (test R² > 0.97) confirms that our physics-informed approach not only matches experimental data but reliably extrapolates to untested conditions—a crucial capability for practical CCS and EOR applications. This framework sets a new standard for data-driven modeling in subsurface fluid systems, where honoring physical laws is as important as statistical accuracy. Furthermore, the flexibility to choose among multiple high-performing kernel configurations allows practitioners to select models that best balance accuracy, computational efficiency, and interpretability for their specific applications. These findings establish machine learning-optimized GPR as a powerful tool for advancing carbon capture, utilization, and storage (CCUS) technologies.

5. Future Research Directions

Combining thermodynamic principles with machine learning, especially kernel-based Gaussian Process Regression (GPR), offers significant potential for advancing CO₂ solubility predictions. Future work should explore: (1) hybrid models that tightly integrate physical laws with data-driven methods, (2) incorporation and refinement of thermodynamic models to improve physical interpretability, (3) adaptive kernel designs for broader applicability, (4) validation of the model using independent and more homogeneous datasets to further evaluate its robustness and improve the generalizability of the conclusions, and (5) extension to multicomponent brine systems. This approach can bridge gaps between accuracy, scalability, and practicality in carbon storage and EOR.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/en18154205/s1.

Author Contributions

Conceptualization, S.H.H. and F.T.; methodology, S.H.H., F.T. and P.T.; software, S.H.H. and F.T.; validation, S.H.H., F.T. and P.T.; formal analysis, S.H.H. and F.T.; writing—original draft preparation, S.H.H., F.T. and P.T.; writing—review and editing, F.T. and P.T.; supervision, F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Izadpanahi, A.; Kumar, N.; Tassinari, C.; Ali, M.; Ahmad, T.; Pinto, M.A. A Review of Carbon Storage in Saline Aquifers: Key Obstacles and Solutions. Geoenergy Sci. Eng. 2025, 250, 213806. [Google Scholar]
Izadpanahi, A.; Blunt, M.; Kumar, N.; Ali, M.; Tassinari, C.; Pinto, M.A. A review of carbon storage in saline aquifers: Mechanisms, prerequisites, and key considerations. Fuel 2024, 369, 131744. [Google Scholar]
Ismail, I.; Gaganis, V. Carbon Capture, Utilization, and Storage in Saline Aquifers: Subsurface Policies, Development Plans, Well Control Strategies and Optimization Approaches—A Review. Clean Technol. 2023, 5, 609–637. [Google Scholar]
Gunatilake, T.; Zappone, A.; Zhang, Y.; Zbinden, D.; Mazzotti, M.; Wiemer, S. Quantitative Modeling and Assessment of CO₂ Storage in Saline Aquifers: A Case Study in Switzerland. Carbon Capture Sci. Technol. 2025, 14, 100360. [Google Scholar]
Ratnakar, R.; Chaubey, V.; Dindoruk, B. A novel computational strategy to estimate CO₂ solubility in brine solutions for CCUS applications. Appl. Energy 2023, 342, 121134. [Google Scholar]
Pradhan, S.; Bhattacherjee, R.; Aichele, C.; Bikkina, P. Determination of CO₂ solubility in brines and produced waters of various salinities for CO₂ EOR and storage applications. Chem. Eng. J. 2025, 507, 160401. [Google Scholar]
Bhattacherjee, R.; Botchway, K.; Pashin, J.; Chakraborty, G.; Bikkina, P. Machine learning-based prediction of CO₂ fugacity coefficients: Application to estimation of CO₂ solubility in aqueous brines as a function of pressure, temperature, and salinity. Int. J. Greenh. Gas Control. 2023, 128, 103971. [Google Scholar]
Sadeghi, A.; Salami, H.; Taghikhani, V.; Robert, M. A comprehensive study on CO₂ solubility in brine: Thermodynamic-based and neural network modeling. Fluid Phase Equilibria 2015, 403, 153–159. [Google Scholar]
Zou, X.; Zhu, Y.; Lv, J.; Zhou, Y.; Ding, B.; Liu, W.; Xiao, K.; Zhang, Q. Toward Estimating CO₂ Solubility in Pure Water and Brine Using Cascade Forward Neural Network and Generalized Regression Neural Network: Application to CO₂ Dissolution Trapping in Saline Aquifers. ACS Omega 2024, 9, 4705–4720. [Google Scholar]
Jeon, P.R.; Lee, C.H. Artificial neural network modelling for solubility of carbon dioxide in various aqueous solutions from pure water to brine. J. CO2 Util. 2021, 47, 101500. [Google Scholar]
Yang, S.; Wang, D.; Dong, Z.; Li, Y.; Du, D. ANN prediction of the CO₂ solubility in water and brine under reservoir conditions. AIMS Geosci. 2025, 11, 201–227. [Google Scholar]
Mohammadian, E.; Liu, B.; Riazi, A.; Huang, J. Evaluation of Different Machine Learning Frameworks to Estimate CO₂ Solubility in NaCl Brines: Implications for CO₂ Injection into Low-Salinity Formations. Lithosphere 2022, 1615832. [Google Scholar] [CrossRef]
Du, X.; Thakur, G.C. Development of Advanced Machine Learning Models for Predicting CO₂ Solubility in Brine. Energies 2025, 18, 1202. [Google Scholar]
Karaei, A.M.; Honarvar, B.; Azdarpour, A.; Mohammadian, E. On prediction of carbon dioxide solubility in aqueous systems of NaCl using LSSVM algorithm. Energy Sources Part A Recovery Util. Environ. Eff. 2022, 44, 2801–2810. [Google Scholar]
Hashemi, S.H.; Torabi, F. Machine Learning-Based Prediction of Scale Inhibitor Efficiency in Oilfield Operations. Processes 2025, 13, 1964. [Google Scholar]
Schulz, E.; Speekenbrink, M.; Krause, A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. J. Math. Psychol. 2018, 85, 1–16. [Google Scholar]
Ulapane, N.; Thiyagarajan, K.; Kodagoda, S. Hyper-Parameter Initialization for Squared Exponential Kernel-based Gaussian Process Regression. In Proceedings of the 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 9–13 November 2020; pp. 1154–1159. [Google Scholar]
Available online: https://www.mathworks.com/help/stats/kernel-covariance-function-options.html (accessed on 6 July 2025).
Kanagawa, M.; Hennig, P.; Sejdinovic, D.; Sriperumbudur, B. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences. arXiv 2018, arXiv:1807.02582. [Google Scholar] [CrossRef]
Beckers, T. An Introduction to Gaussian Process Models. arXiv, 2021; arXiv:2102.05497. [Google Scholar] [CrossRef]
Rasmussen, C.E.; Williams, C.K.I. Chapter 4:Covariance Functions. In Gaussian Processes for Machine Learning; MIT Press: Cambridge, UK, 2006. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2960–2968. [Google Scholar]
Li, K.Q.; Yin, Z.Y.; Zhang, N.; Liu, Y. A data-driven method to model stress-strain behaviour of frozen soil considering uncertainty. Cold Reg. Sci. Technol. 2023, 213, 103906. [Google Scholar]
Li, K.Q.; Yin, Z.Y.; Zhang, N.; Li, J. A PINN-based modelling approach for hydromechanical behaviour of unsaturated expansive soils. Comput. Geotech. 2024, 169, 106174. [Google Scholar]
Rumpf, B.; Nicolaisen, H.; Maurer, G. Solubility of carbon dioxide in aqueous solutions of ammonium chloride at temperatures from 313 K to 433 K and pressures up to 10 MPa. Berichte Bunsenges. Für Phys. Chem. 1994, 98, 1077–1081. [Google Scholar]
El-Maghraby, R.M.; Pentland, C.H.; Iglauer, S.; Blunt, M.J. A fast method to equilibrate carbon dioxide with brine at high pressure and elevated temperature including solubility measurements. J. Supercrit. Fluids 2012, 62, 55–59. [Google Scholar]
Zhao, H.; Dilmore, R.; Allen, D.E.; Hedges, S.W.; Soong, Y.; Lvov, S.N. Measurement and modeling of CO₂ solubility in natural and synthetic formation brines for CO₂ sequestration. Environ. Sci. Technol. 2015, 49, 1972–1980. [Google Scholar]
Li, Z.; Dong, M.; Li, S.; Dai, L. Densities and Solubilities for Binary Systems of Carbon Dioxide + Water and Carbon Dioxide + Brine at 59 °C and Pressures to 29 MPa. J. Chem. Eng. Data 2004, 49, 1026–1031. [Google Scholar] [CrossRef]
Poulain, M.; Messabeb, H.; Lach, A.; Contamine, F.; Cézac, P.; Serin, J.P.; Dupin, J.C.; Martinez, H. Experimental Measurements of Carbon Dioxide Solubility in Na–Ca–K–Cl Solutions at High Temperatures and Pressures up to 20 MPa. J. Chem. Eng. Data 2019, 64, 2497–2503. [Google Scholar]
Rumpf, B.; Maurer, G. An Experimental and Theoretical Investigation on the Solubility of Carbon Dioxide in Aqueous Solutions of Strong Electrolytes. Berichte Bunsenges. Für Phys. Chem. 1993, 97, 85–97. [Google Scholar]
Cruz, J.L.; Neyrolles, E.; Contamine, F.; Cézac, P. Experimental Study of Carbon Dioxide Solubility in Sodium Chloride and Calcium Chloride Brines at 333.15 and 453.15 K for Pressures up to 40 MPa. J. Chem. Eng. Data 2021, 66, 249–261. [Google Scholar]
Stewart, P.B.; Munjal, P. Solubility of Carbon Dioxide in Pure Water, Synthetic Sea Water, and Synthetic Sea Water Concentrates at -50 to 250 C. and 10- to 45-Atm. Pressure. J. Chem. Eng. Data 1970, 15, 67–71. [Google Scholar]
Tang, Y.; Bian, X.; Du, Z.; Wang, C. Measurement and prediction model of carbon dioxide solubility in aqueous solutions containing bicarbonate anion. Fluid Phase Equilibria 2015, 386, 56–64. [Google Scholar]

Figure 3. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) with Matern 5/2 kernel optimized by Grey Wolf Optimizer (GWO): (a) Training results, (b) Testing results.

Figure 4. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) with Rational Quadratic kernel optimized by Grey Wolf Optimizer (GWO): (a) Training results, (b) Testing results.

Figure 5. CO₂ solubility prediction using Automatic Relevance Determination (ARD) Squared Exponential kernel: (a) Training results, (b) Testing results.

Figure 6. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) model with ARD Matern32 Kernel Optimized by GWO: (a) Training results, (b) Testing results.

Figure 8. CO₂ solubility prediction in brine using Gaussian Process Regression (GPR) model with ARD Rational Quadratic kernel Optimized by GWO: (a) Training results, (b) Testing results.

Figure 9. Feature importance in the GPR model using the ARD Matern 3/2 kernel, quantified by the inverse length scale (1/length scale) values. The measured parameters (numbered 1–13 in order) are: NH₄⁺, Cl⁻, Na⁺, K⁺, Ca²⁺, Mg²⁺, SO₄²⁻, Br⁻, Sr²⁺, Fe, HCO₃⁻, Pressure, Temperature.

Figure 10. Comparison of actual vs. predicted CO₂ solubility in brine using optimized physics-informed GPR model: (a) Training results, (b) Testing results.

Table 1. Comparative analysis of current and previous approaches for CO₂ solubility prediction in brine.

Studied Ions	Machine Learning Algorithm Used	References
Na⁺, Cl⁻	Linear Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB)	[7]
Na⁺, Cl⁻	Artificial Neural Network (ANN)	[8]
Na⁺, K⁺, Ca²⁺, Mg²⁺, Cl⁻, HCO₃⁻, SO₄²⁻	Cascade Forward Neural Network (CFNN), Generalized Regression Neural Network (GRNN)	[9]
Na+, K⁺, Mg²⁺, Ca²⁺, Cl⁻, SO₄²⁻, HCO₃⁻	Feed-forward Back-propagation Neural Network (BPNN)	[10]
Na⁺, Cl⁻	Multilayer Perceptron (MLP)	[11]
Na⁺, Cl⁻	XGBoost (XGB), K-Nearest Neighbor (KNN), Multilayer Perceptron (MLP), Genetic Algorithm (used to derive an empirical equation)	[12]
Na⁺, Cl⁻	Decision Tree (DT), Random Forest (RF), XGBoost, Multilayer Perceptron (MLP), Support Vector Regression with Radial Basis Function Kernel (SVR-RBF)	[13]
Na⁺, Cl⁻	Least Squares Support Vector Machine (LSSVM) optimized by Particle Swarm Optimization (PSO)	[14]
Na⁺, K⁺, Mg²⁺, Ca²⁺, Cl⁻, SO₄²⁻, HCO₃⁻, Br⁻, Fe²⁺, Sr²⁺, NH₄⁺	A Multi-Kernel Gaussian Process Regression (GPR) Framework Optimized by Grey Wolf Algorithm and Physics-Informed GPR Model	This work

Table 2. Kernel Functions and Their Hyperparameters.

Kernel Functions	Mathematical Formula	Hyperparameter	References
Squared Exponential	$k (x i, x j ∣ θ) = σ_{f}^{2} [\frac{- 1}{2} \frac{{(x_{i} - x_{j})}^{T} (x_{i} - x_{j})}{σ_{l}^{2}}]$	σ_l, σ_f	[18,19,20,21]
Matern 3/2	$k (x i, x j ∣ θ) = σ_{f}^{2} (1 + \frac{\sqrt{3}}{σ_{l}} r) e x p (\frac{\sqrt{3}}{σ_{l}} r)$ $r = {(x_{i} - x_{j})}^{T} (x_{i} - x_{j})$	σ_l, σ_f	[18,19,20,21]
Matern 5/2	$k (x i, x j ∣ θ) = σ_{f}^{2} (1 + \frac{\sqrt{5}}{σ_{l}} r + \frac{5 r^{2}}{3 σ_{l}^{2}}) e x p (- \frac{\sqrt{5}}{σ_{l}} r)$ $r = {(x_{i} - x_{j})}^{T} (x_{i} - x_{j})$	σ_l, σ_f	[18,19,20,21]
Rational Quadratic	$k (x i, x j ∣ θ) = σ_{f}^{2} {(1 + \frac{r^{2}}{2 \propto σ_{l}^{2}})}^{- \propto}$ $r = {(x_{i} - x_{j})}^{T} (x_{i} - x_{j})$	σ_l, σ_f, α	[18,20,21]
ARD Squared Exponential	$k (x i, x j ∣ θ) = σ_{f}^{2} e x p [\frac{- 1}{2} \sum_{m = 1}^{d} \frac{{(x_{i m} - x_{j m})}^{2}}{σ_{m}^{2}}]$	θ_m = log σ_m, for m = 1,2, …, d θ_d₊₁ = log σ_f	[18,20,22]
ARD Matern 3/2	$k (x i, x j ∣ θ) = σ_{f}^{2} (1 + \sqrt{3} r) e x p (- \sqrt{3} r)$ r = $\sqrt{\sum_{m = 1}^{d} \frac{{(x_{i m} - x_{j m})}^{2}}{σ_{m}^{2}}}$	θ_m = log σ_m, for m = 1,2, …, d θ_d₊₁ = log σ_f	[18]
ARD Matern 5/2	$k (x i, x j ∣ θ) = σ_{f}^{2} (1 + \sqrt{5} r + \frac{5}{3} r^{2}) e x p (- \sqrt{5} r)$ r = $\sqrt{\sum_{m = 1}^{d} \frac{{(x_{i m} - x_{j m})}^{2}}{σ_{m}^{2}}}$	θ_m = log σ_m, for m = 1,2, …, d θ_d₊₁ = log σ_f	[18,22]
ARD Rational Quadratic Kernel	$k (x i, x j ∣ θ) = σ_{f}^{2} {(1 + \frac{1}{2 \propto} \sum_{m = 1}^{d} \frac{{(x_{i m} - x_{j m})}^{2}}{σ_{m}^{2}})}^{- \propto}$	θ_m = log σ_m, for m = 1,2, …, d θ_d₊₁ = log σ_f $\propto$	[18]

Table 3. Input parameters and target variable (CO₂ solubility) for the Gaussian process regression model.

Brine Composition	Pressure (MPa)	Temperature (K)	CO₂ Solubility (mol/kg)	References
NH₄Cl + H₂O	0.48–9.69	313.15–433.15	0.09–1.1519	[25]
NaCl + KCl + H₂O	0.34–9	306.15–343.15	0.045–1.105	[26]
NaCl + KCl + MgCl₂ + CaCl₂ + Na₂SO₄ + SrCl₂ + NaBr + H₂O	10–17.5	323.15–423.15	0.326–0.956	[27]
Formation Brine Sample: Ca, Na, Mg, K, Fe, Cl, SO₄	1.76–20.87	332.15	0.24–0.958	[28]
Salt Solution: Na, K, Ca, Cl	1.01–19.93	323–423	0.1305–0.8517	[29]
Al₂(SO4)₃ + H₂O Na₂SO₄ + H₂O	0.185–9.868	313–433	0.049–0.7272	[30]
NaCl + CaCl₂ + H₂O	6.06–40.05	333.15–453.15	0.3–1.37	[31]
NaCl + CaCl₂ + MgSO₄ + MgCl₂ + KCl +NaHCO₃+ NaBr + H₂O	0.101325–4.5596	268.15–298.15	0.025–1.4573	[32]
Formation Brine Sample: Ca, Na, Mg, K, Fe, Cl, SO₄, HCO₃	8–40	308.15–408.15	0.46–1.6155	[33]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hashemi, S.H.; Torabi, F.; Tontiwachwuthikul, P. Machine Learning-Driven Prediction of CO₂ Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach. Energies 2025, 18, 4205. https://doi.org/10.3390/en18154205

AMA Style

Hashemi SH, Torabi F, Tontiwachwuthikul P. Machine Learning-Driven Prediction of CO₂ Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach. Energies. 2025; 18(15):4205. https://doi.org/10.3390/en18154205

Chicago/Turabian Style

Hashemi, Seyed Hossein, Farshid Torabi, and Paitoon Tontiwachwuthikul. 2025. "Machine Learning-Driven Prediction of CO₂ Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach" Energies 18, no. 15: 4205. https://doi.org/10.3390/en18154205

APA Style

Hashemi, S. H., Torabi, F., & Tontiwachwuthikul, P. (2025). Machine Learning-Driven Prediction of CO₂ Solubility in Brine: A Hybrid Grey Wolf Optimizer (GWO)-Assisted Gaussian Process Regression (GPR) Approach. Energies, 18(15), 4205. https://doi.org/10.3390/en18154205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu