Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs

Mepaiyeda, Bukola; Ezeh, Michal; Olafadehan, Olaosebikan; Oladipupo, Awwal; Adebayo, Opeyemi; Osaro, Etinosa

doi:10.3390/chemengineering10010001

Open AccessArticle

Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs

by

Bukola Mepaiyeda

¹

,

Michal Ezeh

^1,*

,

Olaosebikan Olafadehan

²

,

Awwal Oladipupo

³

,

Opeyemi Adebayo

¹

and

Etinosa Osaro

^4,*

¹

Department of Petroleum and Gas Engineering, University of Lagos, Akoka 100213, Nigeria

²

Department of Chemical Engineering, University of Lagos, Akoka 100213, Nigeria

³

Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA

⁴

Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN 46556, USA

^*

Authors to whom correspondence should be addressed.

ChemEngineering 2026, 10(1), 1; https://doi.org/10.3390/chemengineering10010001

Submission received: 25 September 2025 / Revised: 14 November 2025 / Accepted: 9 December 2025 / Published: 19 December 2025

Download

Browse Figures

Versions Notes

Abstract

The dual imperative of mitigating carbon emissions and maximizing hydrocarbon recovery has amplified global interest in carbon capture, utilization, and storage (CCUS) technologies. These integrated processes hold significant promise for achieving net-zero targets while extending the productive life of mature oil reservoirs. However, their effectiveness hinges on a nuanced understanding of the complex interactions between geological formations, reservoir characteristics, and injection strategies. In this study, a comprehensive machine learning-based framework is presented for estimating CO₂ storage capacity and enhanced oil recovery (EOR) performance simultaneously in subsurface reservoirs. The methodology combines simulation-driven uncertainty quantification with supervised machine learning to develop predictive surrogate models. Simulation results were used to generate a diverse dataset of reservoir and operational parameters, which served as inputs for training and testing three machine learning models: Random Forest, Extreme Gradient Boosting (XGBoost), and Artificial Neural Networks (ANN). The models were trained to predict three key performance indicators (KPIs): cumulative oil production (bbl), oil recovery factor (%), and CO₂ sequestration volume (SCF). All three models exhibited exceptional predictive accuracy, achieving coefficients of determination (R²) greater than 0.999 across both training and testing datasets for all KPIs. Specifically, the Random Forest and XGBoost models consistently outperformed the ANN model in terms of generalization, particularly for CO₂ sequestration volume predictions. These results underscore the robustness and reliability of machine learning models for evaluating and forecasting the performance of CO₂-EOR and sequestration strategies. To enhance model interpretability and support decision-making, SHapley Additive exPlanations (SHAP) analysis was applied. SHAP, grounded in cooperative game theory, offers a model-agnostic approach to feature attribution by assigning an importance value to each input parameter for a given prediction. The SHAP results provided transparent and quantifiable insights into how geological and operational features such as porosity, injection rate, water production rate, pressure, etc., affect key output metrics. Overall, this study demonstrates that integrating machine learning with domain-specific simulation data offers a scalable approach for optimizing CCUS operations. The insights derived from the predictive models and SHAP analysis can inform strategic planning, reduce operational uncertainty, and support more sustainable oilfield development practices.

Keywords:

carbon capture utilization and storage; cumulative oil production; cumulative CO₂ storage; machine learning; oil reservoir; oil recovery factor

1. Introduction

With the global increase in energy demand, and the oil and gas sector accounting for approximately 80% of the world’s energy [1,2,3], sustainable production is needed to reduce carbon emissions associated with fossil fuel production [4,5]. Currently, one of the primary methods for reducing CO₂ is carbon capture, utilization, and storage (CCUS) [6,7,8,9,10,11]. An important component of CCUS involves identifying suitable storage options for the captured CO₂, including underground geological formations such as coal seams, oil, and gas reservoirs, residual oil zones, basalt formations, and deep saline aquifers [11,12,13]. Among these options, oil and gas reservoirs are considered the preferred choice for CO₂ storage due to the benefits of enhanced oil recovery, which not only increases oil production [14] but also has the potential to offset the cost of sequestration and extend the reservoir’s production life [15,16,17,18,19]. Additionally, the known geological characteristics and proven containment capacities of these reservoirs through previous hydrocarbon extraction make them highly promising [9,20]. As a result of these dual benefits, carbon (IV) oxide-enhanced oil recovery (CO₂–EOR) has emerged as a significant option for efficiently utilizing produced CO₂ as part of CCUS operations in the oil and gas industry [15,21].

Recently, CO₂–EOR has been used commercially to enhance oil recovery from light and medium-gravity reservoirs nearing the end of their development [22,23]. In a typical CO₂–EOR operation, when injected CO₂ is exposed to the proper conditions, it can dissolve and displace oil residue trapped in rock pores. The mutual solubility of crude oil and CO_2, based on the temperature and pressure conditions of a geologic reservoir, is the fundamental principle of CO₂–EOR [24]. The suitability for the application of CO₂–EOR and storage varies across different types of reservoirs based on their technical and physical properties [25]. As a result of these differences, there is a need to consider various factors that often lead to challenges in optimizing CO₂–EOR and accurately modeling CO₂ storage. Reservoir heterogeneity is one major factor that impacts oil recovery and CO₂–EOR efficiency, as high-permeability zones may hinder the reservoir’s oil displacement and storage capacity [18,19,26]. Furthermore, another factor that affects CO₂–EOR is the secure and long-term storage of CO₂ in geological formations [27,28]. This involves various subsurface trapping mechanisms such as structural, residual, mineral, and solubility trapping. As a result of the need for storage potential, understanding these mechanisms and how they depend on reservoir conditions and injection rates is crucial for effective implementation and identification of the most suitable techniques for optimizing injection and ensuring proper storage [29,30]. Therefore, addressing these challenges in CO₂–EOR requires advances in reservoir characterization, enhanced data-driven models, and accurate modeling of CO₂ storage for maximizing oil recovery and carbon sequestration [31,32,33,34].

For CO₂–EOR, there are three model approaches to implementing CO₂–EOR in conjunction with carbon sequestration: conventional, which focuses primarily on EOR rather than CO₂ storage; advanced, in which the operator attempts to utilize and store more CO₂ while increasing oil recovery; and maximum storage, in which CO₂–EOR focuses primarily on carbon sequestration rather than oil recovery [35]. Of these models, most researchers focused on the conventional, which involved optimizing enhanced oil recovery rather than CO₂ storage [36], but over the past few years, recent studies have shown interest in the performance of CO₂–EOR and carbon storage simultaneously in oil reservoirs, especially through the integration of machine learning (ML). Machine learning has thus become a transformative tool in energy studies, material design, drug discovery, and petroleum engineering applications characterized by its ability to identify patterns, model complex relationships, and make predictions based on data. Its applications span diverse fields, from healthcare to energy systems, driven by its capacity to analyze large datasets and generate actionable insights. Within subsurface reservoir engineering and CO₂ storage, ML has gained traction for tasks such as predicting storage capacity, optimizing enhanced oil recovery (EOR) processes, and monitoring injection dynamics [37,38,39,40]. ML approaches are generally categorized into three paradigms. Supervised learning, the most applied in reservoir studies, utilizes labeled datasets to predict target outputs such as cumulative oil production or CO₂ storage capacity. By contrast, unsupervised learning uncovers hidden structures within unlabeled data, such as clustering geological formations based on porosity and permeability. Although less commonly applied in this domain, reinforcement learning involves training algorithms to make sequential decisions in dynamic systems, such as optimizing injection rates over time [41,42,43,44].

ML gives deeper insights into key parameters influencing CO₂–EOR and carbon storage, optimizing their performances [45,46,47,48], which complements and improves the conventional reservoir engineering methods, which often fall short in accurately predicting the behavior of CO₂ in such reservoirs [18,19,32,49].

Several ML algorithms have been widely used in the area of research, from neural networks to multivariate adaptive regression, support vector machines, long-short, and complex ensemble methods, amongst others [50]. Ampomah et al. [51] focused on co-optimizing oil recovery and CO₂ storage in the Farnsworth unit hydrocarbon reservoir in Ochiltree County, Texas, using a genetic optimization algorithm. The result showed that the reservoir could potentially store 95% CO₂ and obtain 80% oil recovery. Additionally, it showed that proxy modeling was useful in reducing computation time in CO₂–EOR co-optimization instances. Ampomah et al. [36] further devised a neural network and an uncertainty model for the co-optimization process of oil recovery and CO₂ storage. This work was more enhanced than Ampomah et al. [51] because of the usage of the co-optimization of the various operational variables for an additional layer of geologic uncertainty. Furthermore, Van and Chon [52] utilized ANN to effectively predict and manage the CO₂ flooding process in a combined CCS–EOR project. The three parameters quantitatively predicted were oil recovery, net CO₂ storage, and cumulative CO₂ production. Their results expressed an optimal oil recovery rate from 22% to 30% oil in place and approximately 21,000–29,000 tons of CO₂ sequestered underground after 35 cycles if injection began at 60% water saturation. Chen et al. [53] predicted the performance of CO₂ storage and oil recovery using multiple models, including multivariate adaptive regression (MARS), support vector regression (SVR), and random forest (RF). Their research revealed MARS as the best-performing model with the best predictive accuracy, whereas SVR had the lowest predictive accuracy.

You et al. [54,55] employed hybridized multi-layer and radial basis function (RBF) neural networks, and ANN-based proxy models and particle swarm optimization (PSO), respectively, to co-optimize oil recovery field performance and CO₂ storage in the Farnsworth Unit (FWU). They developed workflows that highlighted the feasibility of ML to optimize a development plan to maximize oil output and CO₂ storage. Additionally, Vo Thanh et al. [45] worked on the application of ANN for predicting the performance of CO₂–enhanced oil recovery and storage in residual oil zones. The database was created using uncertainty parameters, which included the well operations and geological aspects, and the model had a root mean square error of less than 2%. Most recently, Khan et al. [56] utilized proxy models of multilayered neural networks (MLNN) combined with PSO and genetic algorithm (GA) to forecast cumulative oil production and CO₂ storage. Their results showed that MLNN–PSO had higher predictive accuracy and PSO developed more proxies 16 times faster, while GA proxies were 10 times faster than the reservoir simulation in finding the optimal solution. The most common ANN architecture is the artificial feed-forward neural network and the multi-layer neural network [57].

Overall, the integration of ML for optimizing CO₂–EOR and carbon sequestration in oil reservoirs has advanced, and models have shown promising results in terms of their predictive accuracy and efficiency. However, challenges remain in the capability of models in terms of generalizability across different geological and reservoir parameters, and because neural networks are the most popular ML algorithm in this field [58], there is a need for an adequate balance in predictive accuracy with interpretability to communicate how the model works. As a result, the main objective of this study is to evaluate the optimization of CO₂–EOR and carbon sequestration in terms of the oil recovery factor, cumulative oil produced, and the CO₂ sequestration volume, taking into consideration both geological and reservoir data to investigate possible heuristics for developing a machine learning model that is suitable for a wide range of geological formations.

2. Methodology

This study is aimed at developing a data-driven model to evaluate and optimize oil recovery and CO₂ storage simultaneously in oil reservoirs. A comprehensive workflow is devised to implement the methodology effectively by combining reservoir simulation modeling with machine learning. The key performance indicators are the oil recovery factor, cumulative oil production, and cumulative CO₂ storage capacity. Firstly, the reservoir simulation is highlighted: the software, assumptions, and parameters, then the machine learning algorithms section, where details are explained with code and data provision.

2.1. Reservoir Simulation

In this study, a three-dimensional reservoir model is constructed using CMG GEM, a compositional simulator to model the fluid dynamics of the oil reservoir and evaluate the performance of CO₂–WAG injections for enhanced oil recovery. The reservoir simulation workflow used in this study is shown in Figure 1, which outlines the stages in the model setup. The process begins with data preprocessing and defining the static reservoir model, including geological and petrophysical properties. This is followed by the dynamic model setup, where initial and boundary conditions, relative permeability functions, and fluid PVT models are defined. The simulation is then run over the specified production period, and the results are analyzed. Once the model performance is validated, the final simulation outputs are used to generate the datasets for the machine learning. The assumptions made in the modeling of the CO₂–WAG under miscible conditions in the simulator are (1) dissolved CO₂ may exist in the produced oil, and (2) the injected gas utilized was pure CO₂.

The reservoir model consists of 50 blocks in both the

x y

x and ydirections, and 6 blocks in the z-direction, having a total of 15,000 grid cells. The parameters of the reservoir simulation model are created based on typical reservoir characteristics. Geological, geophysical, and reservoir data are integrated into the model, and the inverted five–spot (4 producers and 1 injector) well pattern scale showing the evaluation of the CO₂–WAG injection for oil recovery and carbon sequestration is depicted in Figure 2.

The CO₂–WAG process is implemented to limit the issue of low viscosity of gas injection when compared to oil [59], and is modeled as a cyclic alternation of water and CO₂ with the injection well configured in a central location to achieve an optimized sweep efficiency. The equation of state used in the model of the 3D reservoir model is Peng-Robinson, as it provides accurate phase behavior predictions for systems containing hydrocarbons and non-hydrocarbons such as CO₂. The injection period lasted for 10 years, and a multi-component fluid system was incorporated. The fluid system includes water, oil (hydrocarbons), and CO₂, which are the active fluid phases frequently found in oil reservoirs. Figure 3a,b shows the respective oil–water and gas–liquid relative permeability curves used in the simulation, which are essential for modeling multi-phase flow and are calibrated to reflect wettability and represent how fluid mobility is a function of water saturation.

The reservoir parameters of the simulation model are given in Table 1. The developed simulation model is used to generate the required data to build the machine learning models.

2.2. Machine Learning Method

To evaluate reservoir performance and optimize predictive accuracy, a comprehensive modeling workflow was developed combining data preprocessing, machine learning, and interpretability techniques. The workflow begins with isolating predictor variables from target outputs, namely Cumulative Oil Production (bbl), Oil Recovery Factor (%), and CO₂ Sequestration Volume. This separation ensures the models are trained exclusively on relevant inputs. The dataset, consisting of 2062 sample points generated from the reservoir simulations, is then divided into training and testing sets using an 80/20 train-test split, enabling robust assessment of model generalization on unseen data.

Three primary models are employed: Random Forest, XGBoost, and Artificial Neural Network (ANN). These models were selected for their proven capability to capture complex, non-linear relationships within high-dimensional datasets, making them particularly well-suited for reservoir performance prediction. To standardize the feature scales and streamline the training process, a scikit-learn pipeline is constructed. It integrates StandardScaler for normalization and each model in succession. A five-fold cross-validation (CV) strategy is applied to the training set to evaluate model stability, mitigate overfitting, and ensure reproducibility. The five-fold CV was applied with data shuffling and a fixed random seed (42). Each fold was fully independent, with the preprocessing pipeline (including scaling and encoding) applied within the cross-validation loop, ensuring that no information from the test folds leaked into training folds. This design ensures that preprocessing and model fitting occur inside each fold’s training phase. Model performance is measured using the R² score, MAE, and MSE metrics. After training, models are assessed on both the training and test datasets.

To further enhance model transparency, SHAP (SHapley Additive exPlanations) analysis is performed. This interpretability step identifies the most influential features contributing to each target prediction, helping stakeholders understand the decision-making process behind the model, especially critical in the case of ensemble models like Random Forest and XGBoost (See Figure 4). The purpose of this work is to develop surrogate machine learning models for the prediction and evaluation of the optimization of CO₂–EOR and carbon sequestration in terms of the oil recovery factor, cumulative oil produced, and the CO₂ sequestration volume.

2.2.1. Artificial Neural Network

The ANN consists of three main components: the input layer, multiple hidden layers, and the output layer. The input layer accepts features describing the reservoir, such as porosity, permeability, and injection rates. These features, denoted as, are standardized to have zero mean and unit variance to ensure numerical stability during training:

x_{s c a l e d} = (x - μ) / σ

(1)

The input features are passed through multiple hidden layers, each consisting of neurons that apply a weighted transformation followed by a non-linear activation function. Mathematically, the output of a single neuron is expressed as:

h_{j} = φ (\sum_{i = 1}^{n} ω_{i, j} x_{i} + b_{j})

(2)

In this study, the Rectified Linear Unit (ReLU) activation function is used:

φ (x) = m a ⥂ x (0, x)

(3)

which introduces non-linearity while avoiding the vanishing gradient problem encountered in earlier activation functions such as sigmoid or tanh.

The final layer of the ANN produces predictions for the target variables. For a single target

{\bar{y}}_{k}

, the output is given by:

{\bar{y}}_{k} = \sum_{j = 1}^{m} ω_{k, j} h_{j} + b_{k}

(4)

The ANN is trained by minimizing the mean squared error (MSE), a loss function that quantifies the average squared difference between predicted and true values:

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\bar{y}}_{i})}^{2}

(5)

The Adam optimizer is used for training. It is an adaptive gradient descent algorithm that combines momentum and adaptive learning rates for efficient convergence [60]. The weight updates are performed as:

ω_{t + 1} = ω_{t} - η \frac{\partial L o s s}{\partial ω_{t}}

(6)

The performance of the ANN model is highly sensitive to hyperparameter choices, such as the number of neurons, the number of hidden layers, and the learning rate. To identify the optimal configuration, Keras Tuner was employed with a random search strategy. This approach explores a predefined hyperparameter space, selecting the best configuration based on validation loss. For this study, the search space included: neuron per layer: 32 to 128, number of hidden layers: 1 to 4, and the learning rate: 10⁻⁴ to 10⁻² on a logarithmic scale. The overall workflow of the ANN algorithm is described in Algorithm 1, and its architecture is shown in Figure 5 below.

Algorithm 1 Neural Network Prediction Workflow

[“Factor (%)”, “Cumulative CO₂ Stored (SCF)”]
Output: Trained models, Metrics (R2), Plots
1. Split data into training (80%) and test (20%) sets
2. Standardize input features X using StandardScaler
3. Initialize empty lists for models, scalers, results, and metrics
4. For each target variable (i = 0 to 2):
        a. Standardize the target variable using StandardScaler
        b. Define model-building function with Keras Tuner:
               i. Create Sequential model with:
                       - Input layer: Dense (units = 32–128, relu, input_dim = 7)
                       - 1–4 hidden layers: Dense (units = 32–128, relu)
                       - Output layer: Dense (1)
               ii. Compile with Adam optimizer (learning_rate = 1 × 10⁻⁴ to 1 × 10⁻²) and MSE loss
        c. Initialize RandomSearch tuner with max_trials = 5, executions_per_trial = 2
        d. Search for the best hyperparameters using X_train and y_train_scaled
        e. Retrieve the best hyperparameters and rebuild the model
        f. Train the best model with 200 epochs, 30% validation split, batch_size = 32
        g. Store the trained model
        h. Predict on X_test and inverse transform predictions and true values

2.2.2. Random Forest

Random Forest is a bagging-based ensemble algorithm that constructs a multitude of decision trees during training and outputs the mean prediction across the ensemble. Each tree is built using a random subset of the data (bootstrapped sampling) and a random subset of features, promoting diversity among the trees and reducing overfitting. This randomness, combined with aggregation, improves the model’s generalization capability. In the context of this study, Random Forest was utilized for its ability to capture complex, nonlinear interactions between input features and target variables, while maintaining a high degree of interpretability through feature importance metrics which makes it suitable for prediction of Cumulative Oil Production (bbl), Oil Recovery Factor (%), and CO₂ Sequestration Volume because studies have shown that the exhibit nonlinear relationships with reservoir and wellbore parameters. The architecture of the Random Forest algorithm is shown in Figure 6, and the overall workflow of the algorithm is described in Algorithm 2 below.

Algorithm 2 Random Forest Regressor

Input: Features X, Target y, Number of trees n_estimators, Max depth, Random state
Output: Trained model, Predictions

1. Initialize ensemble of n_estimators decision trees
2. Standardize features X using StandardScaler
3. For each tree in ensemble:
        a. Sample random subset of data (bootstrap sampling)
        b. Select random subset of features at each split
        c. Build decision tree:
                i. For each node:
                        - Choose the best feature and split point to minimize MSE
                        - Split data into left and right child nodes
                ii. Continue until max depth or minimum samples are reached
        d. Store tree in ensemble
4. Return ensemble of trees
5. For new data X_test:
        a. Standardize X_test
        b. For each tree:
                i. Predict y_tree by traversing tree based on feature values
        c. Compute final prediction: y_test_pred = mean (y_tree for all trees)

2.2.3. XGBoost

XGBoost, or Extreme Gradient Boosting, is a boosting-based ensemble algorithm that builds decision trees sequentially, where each tree attempts to correct the residual errors of its predecessor. It incorporates several advanced regularization techniques (such as L1 and L2 penalties) and efficient tree pruning strategies that enhance both predictive accuracy and computational performance. Unlike Random Forest, which grows trees independently, XGBoost optimizes a loss function using gradient descent in a stage-wise fashion. Its robustness against overfitting and its ability to handle missing values and outliers make it particularly well-suited for structured data analysis. In this work, XGBoost was selected for its high predictive performance and its flexibility in modeling complex functional relationships. The architecture of the XGBoost algorithm is shown in Figure 7, and the overall workflow of the algorithm is described in Algorithm 3 below.

Algorithm 3 XG Boost Regressor

Input: Features X, Target y, Number of estimators n_estimators, Learning rate, Max depth, Random state, Regularization parameters
Output: Trained model, Predictions

1. Initialize model with constant prediction: y_pred = 0
2. Standardize features X using StandardScaler
3. For each estimator (weak learner, decision tree):
        a. Compute gradients (first derivative of loss, e.g., MSE: g = y − y_pred)
        b. Compute hessians (second derivative of loss, e.g., MSE: h = 1)
        c. Build decision tree to predict gradients:
                i. For each node:
                        - Choose best feature and split point to maximize gain:
                            Gain = (sum of gradients in left child)²/(sum of hessians in left child + lambda)
                                      + (sum of gradients in right child)²/(sum of hessians in right child + lambda)
                                      − (sum of gradients in parent)²/(sum of hessians in parent + lambda)
                        - Apply L1 (alpha) and L2 (lambda) regularization
                ii. Continue until max depth or minimum samples reached
       d. Update predictions: y_pred = y_pred + learning_rate × tree_prediction
4. Return ensemble of trees
5. For new data X_test:
        a. Standardize X_test
        b. Initialize y_test_pred = 0
        c. For each tree:
                i. Predict contribution
                ii. Update y_test_pred = y_test_pred + learning_rate × tree_prediction

3. Results and Discussions

3.1. Baseline Reservoir Simulation Model Results

The reservoir simulation model of the CO₂-WAG injection process was run with the reservoir and geological parameters as shown in Table 1. The simulation study was carried out to obtain the dataset for input and output for training and testing of the machine learning models. The simulation model yielded valuable insights into the CO₂-EOR potential of the reservoir. During the simulation period, 4 producers and 1 injector, which was a WAG well, were placed in a 5-spot pattern. The reservoir model showed significant heterogeneity in both permeability and porosity. The permeability varied across all directions, with the horizontal permeability higher than the vertical permeability, while the porosity values ranged from 19 to 29%. The WAG well was defined in January 2000, and the injection period was defined until January 2010, operating at 5000 psi, which was 10 years into the forecasting period, with a maximum rate of 30 MMSCF/day. The CO₂ injection is alternated with water injection, and the water acts to sweep the oil towards the production zone. Therefore, a 34-year forecasting period from 1990 to 2024 was performed with the reservoir numerical simulation. At the end of the simulation forecasting period, the cumulative oil produced was 49.8 MMbbl while the oil recovery factor reached 85%. Furthermore, it is shown that 93 MMSCF of CO₂ was injected, 66 MMSCF was produced, while a cumulative of 27 MMSCF would be stored. The results obtained from the baseline reservoir model simulation are summarized in Table 2, and Figure 8 shows the forecasted cumulative oil production from the field history.

3.2. Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) was conducted to gain preliminary insights into the structure, distribution, and relationships among the reservoir and production variables. One of the central components of this analysis was the construction of a correlation heatmap (Figure 9), which quantifies the pairwise linear relationships between the variables using Pearson correlation coefficients. This heatmap helps identify redundancies, reveal underlying patterns, and guide informed feature selection before modeling.

Figure 9 presents the Pearson correlation heatmap of the generated dataset. A Pearson correlation heatmap is a graphical representation that displays the strength and direction of linear relationships between multiple variables, using color gradients to illustrate correlation coefficients. This visualization enables a comprehensive assessment of the linear relationships between key reservoir, injection, and production variables. The majority of the cumulative variables exhibit very strong positive correlations, particularly among cumulative gas production, cumulative gas injection, and cumulative water injection, with coefficients approaching unity. Similarly, cumulative oil production and the oil recovery factor display near-perfect correlation, suggesting that these variables carry largely overlapping information. This level of multicollinearity has important implications for model development, as it may lead to overfitting or instability in regression-based or interpretable models if not addressed. Consequently, such features should either be consolidated or subjected to dimensionality reduction techniques such as principal component analysis.

A notable pattern is the moderately strong negative correlation observed between the water injection rate and gas injection rate (r ≈ −0.41), which may reflect operational strategies where increased injection of one fluid is accompanied by a reduction in the other. Additionally, while the instantaneous oil production rate shows only modest correlation with cumulative metrics (e.g., gas production and gas-oil ratio), it maintains meaningful associations with several injection variables. This suggests that while the oil rate is not strongly governed by cumulative values alone, it may still be sensitive to ongoing injection activities. The CO₂ sequestration volume shows consistently high correlations (r > 0.90) with cumulative gas and water injection, cumulative oil production, and the oil recovery factor. This reinforces its relevance as a proxy or outcome measure in enhanced oil recovery studies where CO₂ injection is employed. It is also important to note that the average reservoir temperature is missing or undefined in the dataset, as indicated by the absence of correlation values involving this feature. This is due to constant temperature readings across samples. Overall, the correlation heatmap reveals several strongly correlated feature pairs. As a result, the models were built in two ways, using all the features and dropping correlated features with a correlation coefficient above 0.95. This approach allowed assessment of the effect of adding and excluding the most correlated features.

3.3. Results from Machine Learning Models

The input parameters for the machine learning models are average reservoir pressure, cumulative gas (CO₂) injection, cumulative gas (CO₂) production, cumulative water production, cumulative water injection, gas injection rate, water injection rate, average temperature, gas–oil ratio, and water cut. These parameters are used to generate models that predict the oil recovery factor, cumulative oil produced, and cumulative CO₂ stored, based on the simulation data generated. It was observed that models built using all features and those with redundant features removed show no significant difference in performance. The focus of this study is to gain deeper insights from feature analysis; therefore, the models presented utilize all available features.

The scatter plots (Figure 10, Figure 11 and Figure 12) show that all the algorithms performed well on both the training and test sets for the three targets. Predictions align very closely with actual values (points fall along the diagonal), indicating excellent predictive accuracy and minimal overfitting. The model generalizes well across all targets. This showed that simple models that require less data, like Random Forest and XGBoost, reflect the model’s effectiveness in capturing the dynamics in different systems.

The key insights into the stability, efficiency, and generalization capability of the models are revealed in Figure 13a–c. Across all three targets, the convergence patterns indicate successful learning of the underlying data relationships, with minimal overfitting and efficient training dynamics. Figure 13a, representing cumulative oil production, shows a steep initial decline in loss during the first 10 epochs, followed by smooth stabilization after approximately 50 epochs. The near-zero final losses, which indicate efficient learning and strong generalization, highlight the model’s ability to accurately capture the non-linear interplay between reservoir parameters and production outcomes. Similarly, Figure 13b, for the oil recovery factor, demonstrates almost identical loss behavior for training and validation datasets. This close alignment across epochs reflects the model’s strong generalization to unseen data, a critical requirement for predictive reliability in operational contexts, and avoidance of overfitting. The model’s capacity to handle large-scale targets for the CO₂ sequestration volumes is illustrated in Figure 13c. Despite the large scale of the target, the losses stabilize effectively, highlighting the robustness of the model in handling high-dimensional data. This is evidenced by the smooth convergence despite the high dimensionality of the input data. The small gap between training and validation losses underscores the robustness of the training process, with effective regularization preventing overfitting. The rapid reduction in loss during early epochs, particularly in Figure 13a,b, suggests efficient learning of dominant patterns in the data, driven by optimized hyperparameters such as learning rate and batch size. In Figure 13c, the consistent alignment of loss curves further validates the suitability of the ANN architecture for modeling complex, large-scale storage outcomes. These curves collectively emphasize the stability of the training process and the models’ adaptability to diverse scales and complexities, ensuring reliable performance across all reservoir and CO₂ storage predictions.

Performance of Models

Table 3 summarizes the performance metrics of the models in terms of R², MSE, and MAE for both the training and testing datasets. The models achieved exceptional accuracy across all target variables, as summarized in Table 3. The R² values, which quantify the proportion of variance explained by the models, approached 1.0 for all targets, indicating that the models captured nearly all the variability in the data. As shown in Table 3, all the model performances are very close despite their different levels of complexity.

For cumulative oil production, all models showed excellent performance on both training and test datasets. XGBoost and Random Forest achieved the highest test accuracy, each with an R² of 0.9999, confirming their strong predictive capability. Random Forest obtained the lowest test MAE of 1.82 × 10⁴ bbl, indicating the closest match to the true values, while ANN performed slightly weaker, reflected by its higher test MAE of 2.91 × 10⁵ bbl. In predicting the oil recovery factor, both Random Forest and XGBoost again exhibited strong and stable performance, each maintaining a test R² of 0.9999. Random Forest recorded the lowest test MAE of 0.0317%, demonstrating high precision in capturing variations in recovery efficiency. ANN, while still accurate with an R² = 0.9997, showed comparatively larger test errors, suggesting a less precise representation of the recovery trend. For CO₂ sequestration volume, XGBoost delivered the best overall generalization performance, achieving a test R² of 0.9998 and one of the lowest test error values. Random Forest closely matched this accuracy with an R² of 0.9998, while ANN showed noticeably higher test errors with MAE of 1.65 × 10⁸ SCF. This indicates that XGBoost and Random Forest are more robust in modeling the complex nonlinear relationships governing CO₂ storage capacity.

To understand the influence of various reservoir and operational parameters on key performance indicators, SHAP analysis was employed. SHAP is a model-agnostic interpretability technique grounded in cooperative game theory. It assigns each feature an importance value for a particular prediction, providing transparent insights into how each input feature contributes to the model’s output. The SHAP summary plots presented in Figure 14 reveal the relative importance and directional influence of input variables on three critical targets: cumulative oil production, oil recovery factor, and CO₂ sequestration volume based on the XGBoost. The SHAP analysis of the Random Forest is not presented because they are nearly identical, which makes sense, because both techniques rely on decision trees, which split the data based on the features that best reduce error or impurity. For cumulative oil production, the SHAP values indicate that cumulative water production, cumulative gas production, and gas-oil ratio are the most influential predictors.

Notably, cumulative water injection and cumulative gas injection also exhibit strong impacts. High values of cumulative water and gas production tend to contribute positively to cumulative oil output, suggesting that these parameters serve as proxies for overall production maturity and development extent. A high gas-oil ratio also shows a positive relationship with oil production in many cases, possibly reflecting effective pressure maintenance or improved recovery mechanisms in gas-cap drive systems. On the contrary, lower values of water injection rate are associated with decreased cumulative oil production, highlighting the importance of adequate reservoir support during secondary recovery.

When examining the oil recovery factor, a similar trend is observed. Features such as cumulative water and gas production, along with cumulative water and gas injection volumes, remain dominant contributors. Interestingly, water cut exhibits a negatively skewed distribution in terms of SHAP values. Higher water cut values tend to decrease the predicted recovery factor, aligning with known reservoir behavior where excessive water production may indicate coning or breakthrough, leading to diminished oil productivity. The oil rate and gas-oil ratio also influence the recovery factor, although their effects appear more complex and nonlinear, as indicated by the broader SHAP value distributions.

The SHAP analysis for CO₂ sequestration volume demonstrates a different but intuitively consistent feature importance profile. Here, cumulative water injection, cumulative gas injection, and their respective rates dominate the influence landscape. High injection volumes and rates correspond to positive SHAP values, reflecting their direct physical relationship to storage capacity in the reservoir. In other words, the more fluid injected, particularly under conditions conducive to miscibility, the greater the predicted volume of CO₂ sequestration. Average reservoir pressure and temperature, though less dominant, still play contributory roles in shaping sequestration outcomes, likely due to their effects on phase behavior and injectivity. Across all three target variables, cumulative water and gas production emerge as consistently influential features. This underscores the integral role of these parameters in capturing the temporal evolution and operational history of the reservoir. Moreover, the SHAP plots reveal not just the importance of individual features but also how their magnitude (high or low) affects the model’s prediction, offering deeper insight into reservoir dynamics beyond conventional feature ranking. However, the SHAP analysis of the ANN model provides different insights, with a high level of agreement with the Random Forest and XGBoost, as seen in Figure 15.

Figure 15 shows that gas and water injection strongly impact CO₂ storage, while oil rate and gas–oil ratio significantly influence recovery, similarly to the other algorithms, but with different arrangements. For the other features across targets, this shows that different machine learning algorithm sees the physical processes from data in different ways, depending on their framework and architecture.

4. Conclusions

This study integrates numerical reservoir simulation with machine learning algorithms to develop a robust workflow for predicting cumulative oil production, oil recovery factor, and CO₂ sequestration volumes in CO₂–EOR reservoir modeling. The simulation outputs were used to construct the training and testing datasets, serving as inputs and target variables for the machine learning models, including Random Forest, XGBoost, and Artificial Neural Network. The proposed models were trained and validated on datasets representing a range of reservoir and operational conditions.

All models demonstrated exceptional predictive performance, with R² values of 0.99 across both training and testing datasets for all target variables, indicating their strong capability to capture complex relationships among geological and reservoir parameters. Specifically, XGBoost showed the highest overall performance, particularly in predicting CO₂ sequestration volumes, achieving an R² of 0.9998. Random Forest also performed exceptionally well, slightly surpassing XGBoost and ANN in the cumulative oil production and recovery factor predictions. The analysis of the training and validation loss curves further confirmed the model’s stability, efficiency, and generalization capacity. The close alignment of the curves throughout training reflects minimal overfitting, while the rapid convergence during early epochs highlights the effectiveness of the selected hyperparameters and network architecture. Although the models in this study predicted the three target variables, it is important to acknowledge the limitation that potential biases, measurement noise, and unmodeled geological complexities may arise when applying these models to field-scale data. Therefore, future research should focus on validating the model with field data, incorporating a broader range of geological and operational parameters, and exploring additional reservoir and injection scenarios to enhance the model’s generalizability and practical applicability.

Author Contributions

Project Administration, B.M.; Supervision, B.M.; Conceptualization, B.M., M.E. and A.O.; Formal Analysis, M.E., E.O., A.O. and O.A.; Resources, B.M. and O.O.; Methodology, M.E., E.O., A.O. and O.A.; Visualization, M.E., E.O., A.O., O.O. and O.A.; Software, M.E., A.O., E.O. and O.A.; Writing—original draft preparation, M.E. and E.O.; Writing—review & editing, B.M., M.E., O.O., A.O., E.O. and O.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in this study can be found on the GitHub page which can be accessed here: https://github.com/theOsaroJ/MLFA-CMG, accessed on 24 October 2025. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

bbL	barrel
$b_{j}$	bias of the input $j$ neuron
$b_{k}$	bias of the output $k$ neuron
$h_{j}$	output of the $j$ neuron
$m$	number of neurons in the last hidden layer
MAE	mean absolute error
MSE	mean squared error
MMbbL	million barrel
MMSCF	million standard cubic feet
$N$	number of samples
$R^{2}$	coefficient of determination
SCF	standard cubic foot
$ω_{i, j}$	weight connecting the $i$ input to the $j$ neuron
$ω_{k, j}$	weight for the output neuron.
$ω_{t}$	weight at iteration $t$
$X$	input layer features
$x$	an input feature
$y_{i}$	true value for the $i$ sample
${\bar{y}}_{i}$	predicted value for the $i$ sample
Greek Letters
$(\partial L o s s / \partial ω_{t})$	gradient of the loss function
$φ$	activation function
$η$	learning rate
$μ$	mean of input
$σ$	standard deviation of input

References

Ahmad, T.; Zhang, D. A critical review of comparative global historical energy consumption and future demand: The story told so far. Energy Rep. 2020, 6, 1973–1991. [Google Scholar] [CrossRef]
IEA Executive Summary—World Energy Outlook 2023—Analysis. IEA 2023. Available online: https://www.iea.org/reports/world-energy-outlook-2023/executive-summary (accessed on 10 January 2025).
Saouter, E.; Gibon, T. A World Full of Energy. In All You Need to Know About the Next Energy Revolution: Solutions for a Truly Sustainable Future; Springer: Cham, Switzerland, 2024; pp. 1–29. [Google Scholar] [CrossRef]
Wang, Z.; Li, S.; Jin, Z.; Li, Z.; Liu, Q.; Zhang, K. Oil and gas pathway to net-zero: Review and outlook. Energy Strategy Rev. 2023, 45, 101048. [Google Scholar] [CrossRef]
Esiri, A.; Babayeju, O.; Ekemezie, I. Implementing sustainable practices in oil and gas operations to minimize environmental footprint. GSC Adv. Res. Rev. 2024, 19, 112–121. [Google Scholar] [CrossRef]
Ansarizadeh, M.; Dodds, K.; Gurpinar, O.; Pekot, L.J.; Kalfa, Ü.; Sahin, S.; Uysal, S.; Ramakrishnan, T.S.; Sacuta, N.; Whittaker, S. Carbon dioxide-challenges and opportunities. Oilfied Rev. 2015, 27, 36–50. [Google Scholar]
Vitillo, J.G.; Eisaman, M.D.; Aradóttir, E.S.P.; Passarini, F.; Wang, T.; Sheehan, S.W. The role of carbon capture, utilization, and storage for economic pathways that limit global warming to below 1.5 °C. iScience 2022, 25, 104237. [Google Scholar] [CrossRef]
Valluri, S.; Claremboux, V.; Kawatra, S. Opportunities and challenges in CO₂ utilization. J. Environ. Sci. 2022, 113, 322–344. [Google Scholar] [CrossRef]
English, J.M.; English, K.L. An overview of carbon capture and storage and its potential role in the energy transition. First Break 2022, 40, 35–40. [Google Scholar] [CrossRef]
McLaughlin, H.; Littlefield, A.A.; Menefee, M.; Kinzer, A.; Hull, T.; Sovacool, B.K.; Bazilian, M.D.; Kim, J.; Griffiths, S. Carbon capture utilization and storage in review: Sociotechnical implications for a carbon reliant world. Renew. Sustain. Energy Rev. 2023, 177, 113215. [Google Scholar] [CrossRef]
Cao, C.; Zhu, H.; Hou, Z. Advances in carbon capture, utilization and storage (CCUS). Energies 2024, 17, 4784. [Google Scholar] [CrossRef]
Jafari, M.; Cao, S.C.; Jung, J. Geological CO₂ sequestration in saline aquifers: Implication on potential solutions of China’s power sector. Resour. Conserv. Recycl. 2017, 121, 137–155. [Google Scholar] [CrossRef]
Ampomah, W.; Morgan, A.; Koranteng, D.O.; Nyamekye, W.I. CCUS perspectives: Assessing historical contexts, current realities, and future prospects. Energies 2024, 17, 4248. [Google Scholar] [CrossRef]
Chen, B.; Pawar, R.J. Characterization of CO₂ storage and enhanced oil recovery in residual oil zones. Energy 2019, 183, 291–304. [Google Scholar] [CrossRef]
Bachu, S. Screening and selection criteria, and characterisation techniques for the geological sequestration of carbon dioxide (CO₂). In Developments and Innovation in Carbon Dioxide (CO₂) Capture and Storage Technology; Woodhead Publishing: Cambridge, UK, 2010; pp. 27–56. [Google Scholar] [CrossRef]
Bouzalakos, S.; Maroto-Valer, M.M. Overview of carbon dioxide (CO₂) capture and storage technology. In Developments and Innovation in Carbon Dioxide (CO₂) Capture and Storage Technology; Woodhead Publishing: Cambridge, UK, 2010; pp. 1–24. [Google Scholar] [CrossRef]
Ali, M.; Jha, N.K.; Pal, N.; Keshavarz, A.; Hoteit, H.; Sarmadivaleh, M. Recent advances in carbon dioxide geological storage, experimental procedures, influencing parameters, and future outlook. Earth-Sci. Rev. 2022, 225, 103895. [Google Scholar] [CrossRef]
Davoodi, S.; Al-Shargabi, M.; Wood, D.A.; Mehrad, M.; Rukavishnikov, V.S. Carbon dioxide sequestration through enhanced oil recovery: A review of storage mechanisms and technological applications. Fuel 2024, 366, 131313. [Google Scholar] [CrossRef]
Davoodi, S.; Thanh, H.V.; Wood, D.A.; Mehrad, M.; Al-Shargabi, M.; Rukavishnikov, V.S. Machine learning insights to CO₂-EOR and storage simulations through a five-spot pattern—A theoretical study. Expert Syst. Appl. 2024, 250, 123944. [Google Scholar] [CrossRef]
Agartan, E.; Gaddipati, M.; Yip, Y.; Savage, B.; Ozgen, C. CO₂ storage in depleted oil and gas fields in the Gulf of Mexico. Int. J. Greenh. Gas Control 2018, 72, 38–48. [Google Scholar] [CrossRef]
Yuan, S.; Ma, D.; Li, J.; Zhou, T.; Ji, Z.; Han, H. Progress and prospects of carbon dioxide capture, EOR-utilization and storage industrialization. Pet. Explor. Dev. 2022, 49, 955–962. [Google Scholar] [CrossRef]
Al-Shargabi, M.; Davoodi, S.; Wood, D.A.; Rukavishnikov, V.S.; Minaev, K.M. Carbon dioxide applications for enhanced oil recovery assisted by nanoparticles: Recent developments. ACS Omega 2022, 7, 9984–9994. [Google Scholar] [CrossRef]
Orin, Z.; Rahman, T.; Amirul, M.; Alam, H.; Mannafi, A.S.M.; Habib, K. Enhanced oil recovery through carbon dioxide injection: A compositional simulation approach. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 49, 149–160. [Google Scholar] [CrossRef]
Zhang, J.; Guan, Y.; Li, T.; Yin, G. Solubility variation and prediction model of CO₂ in water-bearing crude oil. ACS Omega 2022, 7, 44420–44427. [Google Scholar] [CrossRef]
Bera, A.; Satapathy, S.; Daneti, J. Perspectives of CO₂ injection strategies for enhanced oil recovery and storage in Indian oilfields. Energy Fuels 2024, 38, 10613–10633. [Google Scholar] [CrossRef]
Imanovs, E.; Krevor, S.; Zadeh, A.M. CO₂-EOR and Storage Potentials in Depleted Reservoirs in the Norwegian Continental Shelf NCS. In Proceedings of the SPE Europec, Virtual, 1–3 December 2020. [Google Scholar] [CrossRef]
Juan, C.; Xilin, S.; Xuejun, L.; Gang, Y.; Yanling, S. Application prospects of geophysical exploration in the field of CCUS. SEG Libr. 2024, 62, 403–405. [Google Scholar] [CrossRef]
Yang, L.; Rui, W.; Qingmin, Z.; Yuanlong, Z.; Xin, F.; Zhaojie, X. CO₂-enhanced oil recovery with CO₂ utilization and storage: Progress and practical applications in China. Unconv. Resour. 2024, 4, 100096. [Google Scholar] [CrossRef]
Bashir, A.; Ali, M.; Patil, S.; Aljawad, M.S.; Mahmoud, M.; Al-Shehri, D.; Hoteit, H.; Kamal, M.S. Comprehensive review of CO₂ geological storage: Exploring principles, mechanisms, and prospects. Earth-Sci. Rev. 2024, 249, 104672. [Google Scholar] [CrossRef]
Massarweh, O.; Abushaikha, A.S. CO₂ sequestration in subsurface geological formations: A review of trapping mechanisms and monitoring techniques. Earth-Sci. Rev. 2024, 253, 104793. [Google Scholar] [CrossRef]
Agunbiade, T.O.; Oluwadare, O.A.; Amusan, R.O. Reservoir characterization for CO₂ sequestration and CO₂-enhanced oil recovery techniques in Maje field, offshore Niger-Delta, Nigeria. In Proceedings of the SPE Nigeria Annual International Conference and Exhibition, Lagos, Nigeria, 31 July–2 August 2023. [Google Scholar] [CrossRef]
Peralta, Y.; Ganesh, A.; Zambrano, G.; Chalaturnyk, R.; Shokri, A. Development of fast predictive models for CO₂ enhanced oil recovery and storage in mature oil fields. In Proceedings of the Abu Dhabi International Petroleum Exhibition and Conference, Abu Dhabi, United Arab Emirates, 4–7 November 2024. [Google Scholar] [CrossRef]
Wang, Y.Z.; Cao, R.Y.; Jia, Z.H.; Wang, B.Y.; Ma, M.; Cheng, L.S. A multi-mechanism numerical simulation model for CO₂-EOR and storage in fractured shale oil reservoirs. Pet. Sci. 2024, 21, 1814–1828. [Google Scholar] [CrossRef]
Al-Ghnemi, M.; Ozkan, E.; Amini, K.; Kazemi, H. Numerical Modeling Assessment of CO₂-EOR and Sequestration Potential in a Light-Oil Carbonate Reservoir. In Proceedings of the SPE Improved Oil Recovery Conference, Tulsa, OK, USA, 22–25 April 2024. [Google Scholar] [CrossRef]
IEA Insights Series 2015—Storing CO₂ through Enhanced Oil Recovery—Analysis. IEA. 2015. Available online: https://www.iea.org/reports/storing-co2-through-enhanced-oil-recovery (accessed on 10 January 2025).
Ampomah, W.; Balch, R.; Will, R.; Cather, M.; Gunda, D.; Dai, Z. Co-optimization of CO₂-EOR and storage processes under geological uncertainty. Energy Procedia 2017, 114, 6928–6941. [Google Scholar] [CrossRef]
Narciso, D.; Martins, F.G. Application of machine learning tools for energy efficiency in industry: A review. Energy Rep. 2020, 6, 1181–1199. [Google Scholar] [CrossRef]
Kuang, L.; Liu, H.; Ren, Y.; Luo, K.; Shi, M.; Su, J.; Li, X. Application and development trend of artificial intelligence in petroleum exploration and development. Pet. Explor. Dev. 2021, 48, 1–14. [Google Scholar] [CrossRef]
Salem, A.M.; Yakoot, M.S.; Mahmoud, O. Addressing diverse petroleum industry problems using machine learning techniques: Literary methodology—Spotlight on predicting well integrity failures. ACS Omega 2022, 7, 2504–2519. [Google Scholar] [CrossRef]
Osaro, E.; Okorie, V.; Alornyo, S. Exploring the usefulness of Gaussian process regression for the prediction of oil, water and gas production rates. J. Pet. Environ. Biotechnol. 2023, 14, 1000506. [Google Scholar]
Nasteski, V. An overview of the supervised machine learning methods. Horiz. B 2017, 4, 51–62. [Google Scholar] [CrossRef]
Alloghani, M.; Al-Jumeily, D.; Mustafina, J.; Hussain, A.; Aljaaf, A.J. A systematic review on supervised and unsupervised machine learning algorithms for data science. In Supervised and Unsupervised Learning for Data Science; Springer: Cham, Switzerland, 2019; Volume 1, pp. 3–21. [Google Scholar] [CrossRef]
Jiang, T.; Gradus, J.L.; Rosellini, A.J. Supervised machine learning: A brief primer. Behav. Ther. 2020, 51, 675–687. [Google Scholar] [CrossRef] [PubMed]
Scheurer, M.S.; Slager, R.J. Unsupervised machine learning and band topology. Phys. Rev. Lett. 2020, 124, 226401. [Google Scholar] [CrossRef]
Vo Thanh, H.; Sugai, Y.; Sasaki, K. Application of artificial neural network for predicting the performance of CO₂ enhanced oil recovery and storage in residual oil zones. Sci. Rep. 2020, 10, 18204. [Google Scholar] [CrossRef]
Li, H.; Gong, C.; Liu, S.; Xu, J.; Imani, G. Machine learning-assisted prediction of oil production and CO₂ storage effect in CO₂-water-alternating-gas injection (CO₂-WAG). Appl. Sci. 2022, 12, 10958. [Google Scholar] [CrossRef]
Hamadi, M.; El Mehadji, T.; Laalam, A.; Zeraibi, N.; Tomomewo, O.; Ouadi, H.; Dehdouh, A. Prediction of key parameters in the design of CO₂ miscible injection via the application of machine learning algorithms. Eng 2023, 4, 1905–1932. [Google Scholar] [CrossRef]
Meng, S.; Fu, Q.; Tao, J.; Liang, L.; Xu, J. Predicting CO₂-EOR and storage in low-permeability reservoirs with deep learning-based surrogate flow models. Geoenergy Sci. Eng. 2023, 233, 212467. [Google Scholar] [CrossRef]
Dong, P.; Liao, X.; Zhang, L.; Zhang, H.; Zhao, X.; Xue, Q. A surrogate model for numerical reservoir simulation of CO₂ flooding and storage based on deep learning. In Proceedings of the SPE/IATMI Asia Pacific Oil & Gas Conference and Exhibition, Jakarta, Indonesia, 10–12 October 2023. [Google Scholar] [CrossRef]
Yao, P.; Yu, Z.; Zhang, Y.; Xu, T. Application of machine learning in carbon capture and storage: An in-depth insight from the perspective of geoscience. Fuel 2023, 333, 126296. [Google Scholar] [CrossRef]
Ampomah, W.; Balch, R.S.; Grigg, R.B.; McPherson, B.; Will, R.A.; Lee, S.Y.; Dai, Z.; Pan, F. Co-optimization of CO₂-EOR and storage processes in mature oil reservoirs. Greenh. Gases Sci. Technol. 2016, 7, 128–142. [Google Scholar] [CrossRef]
Le Van, S.; Chon, B.H. Evaluating the critical performances of a CO₂–Enhanced oil recovery process using artificial neural network models. J. Pet. Sci. Eng. 2017, 157, 207–222. [Google Scholar] [CrossRef]
Chen, B.; Harp, D.R.; Lin, Y.; Keating, E.H.; Pawar, R. Application of Machine Learning Techniques in CO₂ Storage and Enhanced Oil Recovery. In Proceedings of the IMA Workshop: Recent Advances in Machine Learning and Computational Methods for Geoscience, Minneapolis, MN, USA, 22–26 October 2018. [Google Scholar] [CrossRef]
You, J.; Ampomah, W.; Kutsienyo, E.J.; Sun, Q.; Balch, R.; Aggrey, W.N.; Cather, M. Assessment of enhanced oil recovery and CO₂ storage capacity using machine learning and optimization framework. In Proceedings of the SPE Europec Featured at 81st EAGE Conference and Exhibition, London, UK, 3–6 June 2019. [Google Scholar] [CrossRef]
You, J.; Ampomah, W.; Sun, Q.; Kutsienyo, E.J.; Balch, R.S.; Dai, Z.; Cather, M.; Zhang, X. Machine learning based co-optimization of carbon dioxide sequestration and oil recovery in CO₂-EOR project. J. Clean. Prod. 2020, 260, 120866. [Google Scholar] [CrossRef]
Khan, W.A.; Rui, Z.; Hu, T.; Liu, Y.; Zhang, F.; Zhao, Y. Application of machine learning and optimization of oil recovery and CO₂ sequestration in the tight oil reservoir. SPE J. 2024, 29, 2772–2792. [Google Scholar] [CrossRef]
Mepaiyeda, E.B.; Oluwayomi, I.A.; Oladipupo, A.O.; Odutola, T.O. Prediction of gas hydrate formation temperature in pipelines using artificial neural network (ANN) and firefly algorithm (FA). Pet. Coal 2023, 65, 824–835. [Google Scholar]
Du, X.; Salasakar, S.; Thakur, G.C. A comprehensive summary of the application of machine learning techniques for CO₂-Enhanced oil recovery projects. Mach. Learn. Knowl. Extr. 2024, 6, 917–943. [Google Scholar] [CrossRef]
Jahangiri, H.R.; Zhang, D. Optimization of carbon dioxide sequestration and enhanced oil recovery in oil reservoir. In Proceedings of the SPE Western Regional Meeting, Anaheim, CA, USA, 26–30 May 2010. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. Flowchart for the process of reservoir simulation using CMG GEM.

Figure 2. The 3D reservoir model depicting the five-spot well pattern.

Figure 3. (a) Oil-water and gas-liquid relative permeability curve and (b) Gas-liquid relative permeability curve.

Figure 4. Model development workflow diagram.

Figure 5. ANN architecture for Cumulative Oil Production (bbl), Oil Recovery Factor (%), and CO₂ Sequestration Volume.

Figure 6. Random Forest algorithm architectural design.

Figure 7. XGBoost algorithm architectural design.

Figure 8. Cumulative oil production versus time for the baseline reservoir case, showing both the field history and prediction period.

Figure 9. Correlation heatmap of features and targets.

Figure 10. XGBoost model performance parity plot.

Figure 11. Random Forest model performance parity plot.

Figure 12. ANN model performance parity plot.

Figure 13. ANN Training and validation loss for (a) cumulative oil production, (b) oil recovery factor, and (c) CO₂ sequestration volume.

Figure 14. SHAP analysis for XGBoost model.

Figure 15. SHAP analysis for ANN model.

Table 1. Reservoir simulation model parameters.

Properties	Values
Porosity (%)	25
Reservoir Temperature (°F)	186
Specific Gravity	0.845
Oil Viscosity (cp)	0.74
Boi (rb/STB)	1.168
Initial Pressure (psia)	2090
Bubble-point pressure (psia)	1493
Water viscosity (cp)	0.45
°API	36

Table 2. Summary of the result of the baseline reservoir model.

Results	Values
Cumulative Oil Production in 1990 (MMbbl)	15.51
Oil Recovery (%)	27.00
Cumulative Oil Production in 2024 (MMbbl)	49.80
Oil Recovery (%)	85.00
Cumulative CO₂ Injection (MMSCF)	93.10
Cumulative CO₂ Production (MMSCF)	66.09
Cumulative CO₂ Stored (MMSCF)	27.01

Table 3. Comparison of models performance metrics and target means.

Target	Models		Train			Test
Target	Models	R²	MSE	MAE	R²	MSE	MAE
Cumulative	RF	0.9999	1.2390 × 10⁸	6.5036 × 10³	0.9999	8.5501 × 10⁸	1.8217 × 10⁴
oil production	ANN	0.9999	3.8700 × 10⁹	3.8700 × 10⁴	0.9996	1.3702 × 10¹¹	2.9078 × 10⁵
(bbl)	XGBoost	0.9999	8.3866 × 10⁸	1.9076 × 10⁴	0.9999	6.6720 × 10⁹	5.4175 × 10⁴
Oil	RF	0.9999	0.0004	0.0115	0.9999	0.0026	0.0317
recovery	ANN	0.9997	0.0040	0.0400	0.9997	0.0902	0.2149
factor (%)	XGBoost	0.9999	0.0025	0.0331	0.9999	0.0209	0.0919
CO₂	RF	0.9997	4.9256 × 10¹⁵	2.6539 × 10⁷	0.9998	1.600 × 10¹⁶	6.4029 × 10⁷
sequestration	ANN	0.9992	7.7241× 10¹⁷	2.450 × 10⁷	0.9992	1.0042 × 10¹⁷	1.6493 × 10⁸
volume (SCF)	XGBoost	0.9994	7.7241 × 10¹⁴	1.5528 × 10⁷	0.9998	2.490 × 10¹⁶	8.4057 × 10⁷

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mepaiyeda, B.; Ezeh, M.; Olafadehan, O.; Oladipupo, A.; Adebayo, O.; Osaro, E. Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs. ChemEngineering 2026, 10, 1. https://doi.org/10.3390/chemengineering10010001

AMA Style

Mepaiyeda B, Ezeh M, Olafadehan O, Oladipupo A, Adebayo O, Osaro E. Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs. ChemEngineering. 2026; 10(1):1. https://doi.org/10.3390/chemengineering10010001

Chicago/Turabian Style

Mepaiyeda, Bukola, Michal Ezeh, Olaosebikan Olafadehan, Awwal Oladipupo, Opeyemi Adebayo, and Etinosa Osaro. 2026. "Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs" ChemEngineering 10, no. 1: 1. https://doi.org/10.3390/chemengineering10010001

APA Style

Mepaiyeda, B., Ezeh, M., Olafadehan, O., Oladipupo, A., Adebayo, O., & Osaro, E. (2026). Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs. ChemEngineering, 10(1), 1. https://doi.org/10.3390/chemengineering10010001

Article Menu

Integration of Machine Learning and Feature Analysis for the Optimization of Enhanced Oil Recovery and Carbon Sequestration in Oil Reservoirs

Abstract

1. Introduction

2. Methodology

2.1. Reservoir Simulation

2.2. Machine Learning Method

2.2.1. Artificial Neural Network

2.2.2. Random Forest

2.2.3. XGBoost

3. Results and Discussions

3.1. Baseline Reservoir Simulation Model Results

3.2. Exploratory Data Analysis (EDA)

3.3. Results from Machine Learning Models

Performance of Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI