1. Introduction
Rising atmospheric CO
2 from industrial activity drives climate change, necessitating effective carbon capture and storage solutions. Saline aquifers are a promising method for high-capacity, long-term CO
2 sequestration [
1,
2]. This process is driven by the dissolution of CO
2 into brine and the subsequent geochemical reactions that lead to permanent mineral trapping [
3]. The dissolution of CO
2 in saline aquifers increases brine density by 0.1–1%, initiating natural convection that further enhances CO
2 dissolution through gravitational mixing [
4]. The solubility of CO
2 is primarily controlled by reservoir temperature, pressure, and salinity [
5].
Determining CO
2 solubility is essential for predicting its behavior and migration within geological storage sites, such as saline aquifers. While experimental measurements are often time-consuming and costly, machine learning has emerged as an ideal and powerful tool to accurately and efficiently predict CO
2 solubility in brine. In recent years, researchers have increasingly adopted machine learning algorithms to address this challenge. Hashemi et al. [
6] employed a Gaussian Process Regression (GPR) model optimized by the Grey Wolf Optimizer (GWO) to achieve highly accurate predictions of CO
2 solubility in brine. Their physically constrained machine learning framework demonstrates significant potential for optimizing CO
2 injection strategies in both carbon storage and enhanced oil recovery applications. Davoodi et al. [
7] developed hybrid Long Short-Term Memory (LSTM) models optimized with metaheuristic algorithms, where the LSTM-COA model achieved the highest accuracy and robustness in predicting CO
2 solubility in diverse brine systems for carbon storage. Wei et al. [
8] developed an accurate fusion model combining BPNN, GRNN, and XGBoost algorithms to predict CO
2 and H
2S solubility in brine, achieving superior performance over previous methods for applications in carbon storage and sour gas management. Yang et al. [
9] developed an accurate Artificial Neural Network (ANN) model that effectively predicts CO
2 solubility in both pure water and brine, highlighting distinct dissolution behaviors between the two fluid systems for CCUS applications. Bhattacherjee et al. [
10] demonstrated that an Extreme Gradient Boosting machine learning model provides a rapid and accurate method for estimating CO
2 fugacity coefficients, which subsequently enabled precise calculations of CO
2 solubility in saline solutions. Sadeghi et al. [
11] successfully developed both thermodynamic and neural network models for predicting CO
2 solubility in NaCl brine, demonstrating that the optimized neural network offered comparable accuracy to the established thermodynamic approach for geological sequestration applications. Zou et al. [
12] developed a Cascade Forward Neural Network optimized with the Levenberg–Marquardt algorithm (CFNN-LM) to accurately predict CO
2 solubility in multi-component brines, identifying pressure as the most influential parameter and ranking the salting-out effects of various salts for CCS applications.
In this study, to address the critical need for predicting carbon dioxide solubility in brine for carbon storage, we propose a novel hybrid approach combining machine learning with metaheuristic optimization. A review of the existing literature (Refs. [
6,
7,
8,
9,
10,
11,
12]) reveals that previous studies have employed various machine learning and optimization approaches, including GPR-GWO [
6], LSTM-COA [
7], BPNN-GRNN-XGBoost fusion [
8], ANN [
9], XGBoost [
10], thermodynamic and neural network hybrids [
11], and CFNN-LM [
12], to predict CO
2 solubility in brine systems. Despite their valuable contributions, these studies are generally limited in two key aspects: (1) they focus on a narrow range of machine learning models or specific optimization algorithms, and (2) the ionic composition of the brine systems considered is often restricted to common ions (e.g., Na
+, K
+, Ca
2+, Mg
2+), with limited attention to more complex and diverse ionic species. In contrast, the current study introduces a novel and comprehensive framework by evaluating four distinct machine learning algorithms, namely Neural Networks (NN), Decision Trees (DT), Support Vector Regression (SVR), and Gradient Boosting Machines (GBM), each optimized using the Ant Colony Optimization (ACO) algorithm, which has rarely been applied in this domain. Furthermore, the dataset employed in this work encompasses a wide variety of ions, thereby significantly expanding the chemical complexity and applicability of the models to real-world CCUS scenarios. This novel combination of diverse learning paradigms, a unique optimization strategy, and an expanded ionic database positions the current study as a meaningful advancement in the predictive modeling of CO
2 solubility for carbon capture, utilization, and storage applications.
3. Results and Discussion
This study implemented and compared four machine learning models—Neural Network (NN), Decision Tree (DT), Gradient Boosting Machine (GBM), and Support Vector Regression (SVR)—all optimized using the Ant Colony Optimization (ACO) algorithm, to predict carbon dioxide solubility in mineral compounds. The Pearson correlation matrix of all input features and CO
2 solubility is presented in
Figure 1. The matrix quantifies linear pairwise dependencies among ionic species, thermodynamic variables, and the target variable. A strong positive linear correlation between pressure and CO
2 solubility (r = 0.57) is observed, representing the highest correlation with the target variable. This confirms that pressure is the dominant linear driver of solubility within the investigated range and is consistent with the thermodynamic expectation of increased gas dissolution under elevated pressure. Temperature exhibits a moderate negative correlation with CO
2 solubility (r = −0.29), indicating that solubility decreases with increasing temperature. This trend aligns with the exothermic nature of gas dissolution in aqueous systems and further supports the physical consistency of the dataset. Among ionic species, HCO
3− shows a moderate positive correlation with CO
2 solubility (r = 0.42), while CO
32− also presents a positive correlation (r = 0.35). These values are higher than those of most other ions and suggest that carbonate system components are more directly associated with dissolved CO
2 levels, likely due to equilibrium interactions within the carbonate–bicarbonate system. In contrast, NH
4+ demonstrates a moderate negative correlation with solubility (r = −0.37). Sulfate (SO
42−) shows a weak negative relationship (r = −0.13), while K
+ exhibits a weak-to-moderate positive correlation (r = 0.29). The remaining monovalent and divalent cations (Cl
−, Na
+, Ca
2+, Mg
2+, Sr
2+, Fe
2+) display correlations close to zero (|r| ≈ 0.00–0.04), indicating negligible linear dependence with CO
2 solubility when evaluated individually.
A notable feature of the matrix is the extremely high intercorrelation among several ionic species. For example, Cl−, Na+, Ca2+, Mg2+, and Fe2+ show near-perfect correlations (r ≈ 0.95–1.00) with one another. Similarly, Br− and Sr2+ are strongly correlated (r = 0.97). These strong interdependencies indicate significant multicollinearity within the ionic composition variables, likely arising from common brine formulations or charge-balanced salt systems reported in the literature sources. Such multicollinearity suggests that individual linear coefficients may not independently represent the physicochemical contribution of each ion.
Importantly, because Pearson correlation captures only linear dependence, weak pairwise correlations do not necessarily imply negligible influence. The complex electrolyte–gas interactions governing CO2 dissolution are inherently nonlinear and multivariate. Therefore, while the correlation matrix provides useful exploratory insight, it does not fully describe the underlying dependencies, further justifying the application of nonlinear machine learning techniques in this study. In solid–liquid–gas equilibrium systems involving CO2 dissolution in brines, interionic and ion–molecule interactions are inherently nonlinear and composition-dependent. Even if two ions are linearly correlated in concentration across the compiled dataset, their physicochemical influence on activity coefficients, complex formation, or carbonate equilibria may differ substantially.
Figure 2 illustrates the distribution plots of the 556 collected samples, offering important insight into the statistical structure and thermodynamic consistency of the dataset. The distribution plots of the 556 collected samples provide important insight into the statistical structure and thermodynamic consistency of the dataset. The ionic composition variables exhibit clustered and discrete concentration levels, indicating that the dataset was compiled from multiple literature sources with distinct brine formulations rather than from a single continuous experimental campaign. For several ions, including NH
4+, Cl
−, Na
+, Ca
2+, and Mg
2+, a considerable number of samples are concentrated at or near zero concentration. This suggests that many reported brine systems did not contain these species, while other subsets of the dataset represent specific chemical environments with elevated concentrations. In contrast, ions such as SO
42−, HCO
3−, and CO
32− display multiple clustered concentration levels, reflecting variations in carbonate–sulfate equilibria among different brine systems. Trace species such as Fe
2+ and Br
− exhibit narrow concentration ranges, indicating limited variability across the compiled studies.
The pressure–solubility relationship demonstrates a clear positive trend. CO2 solubility increases systematically with increasing pressure across the investigated range (approximately 0–45 MPa). The relationship appears approximately linear at lower pressures and gradually transitions to a milder slope at higher pressures, which is consistent with the expected behavior of gas dissolution under increasing compressibility effects. This confirms the thermodynamic reliability of the collected data and aligns with Henry’s law behavior within the investigated range. In contrast, the temperature–solubility plot reveals an overall negative dependence of CO2 solubility on temperature across the range of approximately 280–450 K. Higher temperatures correspond to reduced solubility levels, which is consistent with the exothermic nature of gas dissolution in aqueous systems. The data coverage across this wide temperature interval ensures that the developed model is applicable to a broad spectrum of subsurface and CCUS-related conditions.
The scatter plots of individual ionic species versus CO2 solubility indicate predominantly nonlinear and composition-dependent behavior. Divalent cations such as Ca2+, Mg2+, and Sr2+ tend to correspond to slightly reduced solubility levels at higher concentrations, suggesting a salting-out effect driven by increased ionic strength. Carbonate and bicarbonate species (HCO3− and CO32−) display structured clusters, reflecting their involvement in chemical equilibrium reactions that influence dissolved CO2 speciation. Overall, the absence of simple linear trends between most individual ions and solubility suggests that multivariate nonlinear modeling approaches, such as machine learning, are appropriate for capturing the complex interactions within the system. Importantly, despite the clustered nature of several ionic variables, the dataset spans a broad operational domain in pressure, temperature, and chemical composition. This diversity supports the robustness and generalization capability of the predictive model developed in this study.
In
Figure 3, the predicted results of carbon dioxide solubility in the presence of minerals are presented based on training, validation, and testing, using the Neural Network Algorithm optimized by Ant Colony Optimization. The following optimal parameters were obtained for the neural network:
The model performance, expressed as R2 scores, is as follows:
Training R2: 0.968 (RMSE = 0.0701, MAE = 0.0268);
Validation R2: 0.887 (RMSE = 0.1200, MAE = 0.0473);
Testing R2: 0.930 (RMSE = 0.1083, MAE = 0.0572).
These results indicate that the optimized neural network demonstrates strong predictive ability and good generalization across training, validation, and test datasets.
In
Figure 4, the prediction results for carbon dioxide solubility in mineral compounds are presented based on training, validation, and testing data, utilizing a Decision Tree model optimized by Ant Colony Optimization (ACO). The model was fine-tuned with the following optimal parameters: MinLeaf of 4, MinParent of 23, and MaxSplit of 135. The resulting decision tree contains 67 total nodes, with 34 leaf nodes, and reaches a maximum depth of 7. The performance evaluation using the R
2 metric demonstrates the model’s effectiveness across different datasets:
Training R2: 0.914 (RMSE = 0.1152, MAE = 0.0812);
Validation R2: 0.848 (RMSE = 0.1390, MAE = 0.0920);
Testing R2: 0.912 (RMSE = 0.1215, MAE = 0.0893).
These results confirm that the ACO-optimized decision tree is accurate and reliable for predicting CO2 solubility in mineral systems, with a balanced structure that mitigates overfitting while maintaining explanatory power.
In
Figure 5, the predicted solubility of carbon dioxide in mineral compounds based on training, validation, and testing data is presented using a Support Vector Regression (SVR) model optimized by Ant Colony Optimization (ACO).
The following optimal hyperparameters were identified for the SVR model:
The model’s performance was evaluated using the R2 metric, yielding the following results:
Training R2: 0.942 (RMSE = 0.0948, MAE = 0.0432);
Validation R2: 0.854 (RMSE = 0.1363, MAE = 0.0659);
Testing R2: 0.889 (RMSE = 0.1363, MAE = 0.0659).
These results indicate that the ACO-optimized SVR model successfully learned complex relationships within the training data while maintaining a strong ability to generalize. The high training score reflects an excellent fit, and the competitive testing score demonstrates the model’s robustness and reliability for predicting CO2 solubility in new, unseen mineral compositions.
In
Figure 6, the prediction results for carbon dioxide solubility in mineral compounds are presented based on training, validation, and testing, employing a Gradient Boosting Machine (GBM) model optimized by Ant Colony Optimization (ACO). The model was fine-tuned with the following optimal parameters: MinLeaf of 4, MinParent of 23, MaxSplit of 13, nTrees of 50, and a Learning Rate of 0.2695. The model demonstrated exceptional performance:
Training R2: 0.995 (RMSE = 0.0285, MAE = 0.0178);
Validation R2: 0.995 (RMSE = 0.0261, MAE = 0.0193);
Testing R2: 0.986 (RMSE = 0.0478, MAE = 0.0362).
These results underscore the GBM model’s superior capability in capturing the complex relationships governing CO2 solubility in mineral systems, establishing it as a highly reliable tool for this application.
All models achieved strong predictive performance, with test R2 scores ranging from 0.889 to 0.986. The Gradient Boosting Machine emerged as the top performer, achieving near-perfect scores on both training and validation sets (R2 = 0.995) and an exceptional test score of 0.986. This indicates outstanding generalization capability with minimal overfitting. Its primary strength lies in its exceptional accuracy and stability. The ensemble approach of sequentially correcting errors from multiple shallow trees, combined with an optimal learning rate (0.2695) and tree count (50), allowed it to capture complex, non-linear relationships in the data without overfitting. The Neural Network also performed well (Test R2 = 0.930), showing strong improvement with its deeper architecture (3 layers, 15 neurons) compared to simpler networks. The model demonstrated balanced performance across all datasets with relatively low error rates. The Decision Tree performed excellently on the test set (R2 = 0.912), showing consistent results between training and testing. The key strength of the Decision Tree is its interpretability and transparency. With 67 nodes and a depth of 7, the model’s structure and decision paths can be visualized and understood, providing insight into which mineral features most influence solubility. The Support Vector Regression delivered solid results (Test R2 = 0.889), demonstrating reliable robustness. This model’s core strength is its generalization reliability. The optimal parameters (C = 52.64, ε = 0.001, γ = 5.26) allowed the RBF kernel to model complex relationships, while the regularization parameter prevented overfitting.
The choice of the optimal model depends on the project’s specific priority. If maximum predictive accuracy is paramount, the GBM model is unequivocally the best choice. If model interpretability and transparency are critical for scientific understanding, the Decision Tree is highly valuable. For scenarios requiring a robust neural network with strong performance, the optimized Neural Network with 3 layers is an excellent candidate. The SVR offers balanced performance when consistent generalization across all datasets is needed. This comparative analysis demonstrates that ACO is an effective metaheuristic for tuning diverse ML models, successfully navigating different hyperparameter spaces to enhance their predictive performance for a complex scientific problem.
In
Table 1, a comparison of model performance with and without ACO optimization is presented. The results clearly show that using Ant Colony Optimization to tune the machine learning models consistently improved their performance on unseen test data. This means the optimization algorithm successfully found better hyperparameters for each model, making them more reliable for predicting CO
2 solubility in new situations.
For the Neural Network, ACO had a significant impact. While the training score stayed nearly the same, the test R2 jumped from 0.901 to 0.930. Even more importantly, the validation score improved dramatically from 0.835 to 0.887. This shows that the optimized network, with its three layers and fifteen neurons, generalized much better and was no longer overfitting to the training data. The Support Vector Regression model saw the biggest benefit from ACO. Its test score increased substantially from 0.846 to 0.889, and the validation score rose from 0.803 to 0.854. This large improvement indicates that ACO was essential for finding the right C, epsilon, and gamma values, which unlocked the SVR model’s true potential and made it far more robust. The Decision Tree model also became more accurate and stable after optimization. Its test score increased from 0.897 to 0.912. The Gradient Boosting Machine was already the best performer, but ACO still managed to improve it. The test score increased from an already excellent 0.979 to a near-perfect 0.986. The most notable change was in the validation score, which jumped from 0.957 to match the training score at 0.995. This shows that ACO perfectly balanced the model’s learning rate and tree count, eliminating any small amount of overfitting and creating an exceptionally stable and accurate model.