Insights into the Application of Machine Learning in Reservoir Engineering: Current Developments and Future Trends

: In the past few decades, the machine learning (or data-driven) approach has been broadly adopted as an alternative to scientific discovery, resulting in many opportunities and challenges. In the oil and gas sector, subsurface reservoirs are heterogeneous porous media involving a large number of complex phenomena, making their characterization and dynamic prediction a real challenge. This study provides a comprehensive overview of recent research that has employed machine learning in three key areas: reservoir characterization, production forecasting, and well test interpretation. The results show that machine learning can automate and accelerate many reservoirs engineering tasks with acceptable level of accuracy, resulting in more efficient and cost-effective decisions. Although machine learning presents promising results at this stage, there are still several crucial challenges that need to be addressed, such as data quality and data scarcity, the lack of physics nature of machine learning algorithms, and joint modelling of multiple data sources/formats. The significance of this research is that it demonstrates the potential of machine learning to revolutionize the oil and gas sector by providing more accurate and efficient solutions for challenging problems.


Introduction
Reservoir engineering is an interdisciplinary field that integrates mechanics, geology, physics, mathematics and computers as research tools to economically recover hydrocarbon resources in underground formations.Modelling and estimating such multiphysics and multiscale systems with conventional analytical or numerical simulation tools inevitably encounter serious challenges and introduce several sources of uncertainty [1].In addition, investigating reservoir behaviors usually involves very intractable inverse problems, which typically have multi-solutions and require more complex theories and sophisticated algorithms.In recent decades, a great number of sensor-based tools are being adopted in the fields to automatically collect substantial amounts of data every day [2].It is possible to fully discover the information underlying such valuable data to solve bottlenecks; this is where machine learning (ML) comes into play.ML enables the system to automatically learn and improve from prior information without being explicitly programmed [3], which is widely used to extract the highly nonlinear multi-factor interaction between numerous inputs and outputs (i.e., regression), pattern recognition, computer vision (CV), natural language processing (NLP) and discovering the governing partial differential equations, etc.In fact, the state-of-the-art ML algorithms even achieved human-level-surpassing performance in some specific tasks, such as AlphaGo in game [4] and residual networks (ResNets) in image recognition [5].With the improvement of computing architecture, ML can also efficiently extract underlying information from the exploding real-world data collected in petroleum industry applications.As a powerful data-driven technique, ML is being widely accepted to assist and improve our understanding of drilling [6], production and reservoir areas [7].In particular, ML has been most widely used in reservoir engineering and has achieved excellent results, such as the prediction of permeability, porosity and tortuosity [8], prediction of shale gas production [9], reservoir characterization [10], 3D digital cores reconstruction [11], well test interpretation [12], rapid production optimization of shale gas [13,14], well-log processing [15] and history matching [16].In this work, a systematic review of the ML methods applied to the different reservoir fields with an emphasis on production forecast, well test analysis and reservoir characterization is presented by reviewing the relevant literature published in recent years.In addition, despite the tremendous success of ML algorithms, achieved in a variety of reservoir applications that are computationally prohibitive or cannot be well modelled by physics principles, there are still many associated challenges and opportunities left for the future, which will also be outlined in this work.

Application Status of ML in Reservoir Engineering
Based on the available data, ML algorithms can be categorized into supervised, unsupervised, semi-supervised and reinforcement learning [17,18].In reinforcement learning, algorithms learn by interacting with the environment and receiving feedback in the form of rewards or punishments.Unsupervised learning is applied to discover underlying patterns and obtain a more abstract representation of unlabeled data.Semisupervised learning is employed if obtaining labeled data is expensive or timeconsuming, but large amounts of unlabeled data are available.The algorithm utilizes the labeled data to extract the information and then applies this knowledge to label the unlabeled data.By far, supervised machine learning is more prevalent in a wide range of reservoir engineering applications, where trainable parameters are progressively updated for regression or classification purposes based on input-output pairs.ML can automate and accelerate many reservoirs engineering tasks, resulting in more efficient and costeffective operations, as summarized in Table 1, and has been found to have a high level of accuracy and can be used to improve predictions in areas where traditional techniques have failed.This paper will mainly review the current application status of supervised ML in three aspects: production prediction, well test analysis and reservoir characterization.production dynamics [28,[31][32][33] Capture spatial dependencies; translation invariance

Production Prediction
Production forecasting plays an essential role in optimizing the construction strategy of wells, such as well drilling, stimulation and the enhanced hydrocarbon recovery processes [34].
A great number of investigations have employed ML to predict the cumulative production of oil and gas.Kong et al. [21] used the extreme gradient boost regressor (XGBoost) as the base model and then applied a linear regressor as the meta-model to ensemble multiple base models.A minor improvement (R 2 improved from 0.79 to 0.8) was observed in predicting the cumulative production, but the bias of the stacked model was mitigated compared to that of the individual model.Wang et al. [19] trained and compared machine learning (ML) algorithms, including linear regression (LR), artificial neural networks (ANN), gradient-boosting decision trees (GBDT) and random forests (RF), to predict the first-year barrels of the oil equivalent (BOE) of infill wells.By employing the coefficient of determination (R 2 ) and mean absolute error (MAE) as the metrics for model performance, ensemble methods (RF and GBDT) provide superior predictions than other methods, and the MAE and R 2 were obtained as 0.79 and 31.8 kBbl, respectively.Support vector regression (SVR) and gaussian process regression (GPR) were used by Guo et al. [24] to predict early oil and gas production with a best R 2 of 0.83, and sensitivity analysis indicated that fluid volume and total organic carbon (TOC) are the most essential features to predict the production of wells.With input variables GOR, upstream and downstream pressures, and choke size, support vector machine (SVM) and RF were implemented [22] to predict the surface oil rates in high gas oil ratio formations.All the data-driven models provided a better estimation (R 2 > 0.9) than empirical correlation.Khan et al. [35] also employed artificial neuro fuzzy inference systems (ANFIS), SVM and ANN to investigate the oil production rate in artificial gas lift wells, and approximately 99% of variance in the data is explained.
In fact, predicting production dynamics with acceptable accuracy is of greater importance to reservoir operators, and production curve estimation is also a more challenging problem than predicting a single point (i.e., the cumulative production).As shown in Figure 1, the monthly well production in unconventional reservoirs fluctuates greatly due to the frequent shut-in operation, which grows the complexity of applying ML methods to deal with production sequences.The autoregressive strategy is widely used in time series analysis [36][37][38], i.e., to predict future values based on past values.For example, Duan et al. [25] integrated the autoregressive integral moving average (ARIMA) model and RTS smoothing to predict gas well production.Fan et al. [26] proposed a combined model to capture the linear component by ARIMA and the nonlinear part of the daily production sequence through a Long Short-Term Memory (LSTM) network separately.In addition, the specific architectures of recurrent neural networks (RNNs), such as LSTM and encoder-decoder architectures, render them inherently appropriate for modelling sequential information, such as monthly production and bottom hole pressures (BHP).A particle swarm optimization-assisted LSTM model was proposed by Song et al. [27] to infer the daily oil rate of fractured horizontal wells.Zha et al. [28] developed a CNN-LSTM model to extract important features automatically and capture the sequence dependence, which was used to predict monthly natural gas production with an MAPE of 7.7%.Moreover, in another study [39], short-term forecasts of oil production were demonstrated by DeepAR and Prophet time series analysis.The results imply that ML approaches could fail in capturing long-term trends and the data-driven models should be retrained regularly.Zhong et al. [40] used a conditional deep convolutional generative neural network and the material balance method to develop a proxy model for the assessment of the production rate for waterflooding as a function of permeability field and time.The results indicate that the predicted values of the proposed model match excellently with the outcomes of the commercial simulator.To summarize, ML has been widely adopted to estimate the early stage cumulative production and production dynamics of conventional/unconventional reservoirs, with R 2 ranging from approximately 0.8 to 0.95, and neural networks are the most prevalent models due to their flexibility in processing different data formats and powerful nonlinear mapping capacity.However, the production forecast problem involves data in multiple formats, such as time series and tabular data.Currently, processing data with multiple formats simultaneously can be challenging for ML models because different types of data have different characteristics.Therefore, integrating multiple data formats is an active area of research, and new architectures are being developed to address this limitation.In addition, the availability of real-world well data can be a major obstacle in developing accurate ML models for reservoir characterization because of the incomplete features collected and limited number of available samples, resulting in the poor generalization of the ML model.

Well Test Analysis
In well test analysis, an input impulse (typically a change in flow rate) is provided to the reservoir and then the corresponding response (typically a variation in pressure) is measured [41].As is known to all, the response is governed by a set of underground properties such as porosity, permeability, well communication, formation damage (i.e., skin coefficient), boundary, and fracture geometries [42].Based on the information collected and some analytical models that could link the response and petrophysical properties, it can be inferred that the model parameters/assumptions are analogous to subsurface conditions by applying optimal curve fitting [43,44].ML also finds its feasibility in well test analysis.A study was conducted to implement CNN to estimate mobility ratios, dimensionless radius, well storage and skin effects in radial composite formations with input parameters of pressure buildup/drawdown data and pressure derivative information [31].Leveraging the log-log plots (pressure and corresponding derivative), another researcher used CNN to model a dimensionless variable comprising permeability, storage and skin effect in infinite reservoir [32].Chu et al. [20] applied the multi-layer CNN and fully connected neural networks (FCNN) to classify well testing plots, and the mean F1 values were obtained as 0.91 and 0.81, respectively.Dong et al. [33] also used the one-dimensional CNN to interpret the well test data automatically.The welltrained model could be used to identify homogeneous, dual-porosity, radial composite and finite conductivity vertically fractured models and inverse the associated parameters with reasonable accuracy.Xue et al. [12] extracted the slopes of 40 segments to characterize the pressure derivative curve, and used these slopes to train RF to identify the water invasion pattern (bottom or edge water) in the gas reservoir.They also integrated the RF regressor and ensemble Kalman filter to investigate the permeability, aquifer size ratio and gas-water contact depth.Pandey et al. [45] used the genetic algorithm (GA) for feature selection and hyperparameter tuning to improve the performance of ANN in identifying and characterizing homogeneous reservoirs.The feasibility and superiority of GA-optimized ANN were demonstrated by applying the well-trained model to nine simulated cases and one real case.S. Wang and Chen [30] revealed that LSTM is competent in interpreting the correlation between pressure and flow rate data from hydraulic fractured tight reservoirs without the requirement of mathematical models.The authors concluded that the LSTM is able to capture well shut-in accurately and is more robust to noise.Nagaraj et al. [29] developed a Siamese neural network (SNN), which is composed of CNN and LSTM, for recognizing 14 different reservoir models.The SNN model achieved an accuracy of 93% in identifying the correct model as the top recommendation.
Well test analysis is heavily driven by both mathematics and data.It utilizes mathematical models based on physical principles to analyze data collected during well tests to estimate the performance of the well and the properties of the reservoir from which it is produced.ML has been increasingly applied in well test interpretation to improve the efficiency and accuracy, but the majority of current work has completely ignored the physical laws, leading to poor generalization and lack of interpretability.

Reservoir Characterization
Reservoir characterization is a heavily data-driven problem, which integrates seismic, logging and core analysis data to improve the understanding of subsurface properties such as porosity, saturation, permeability and pressure-volume-temperature [46].The inherent heterogeneity of reservoirs makes the estimation of these subsurface properties very difficult [47], and relations between geophysical data and expected properties even fluctuate significantly from place to place.Therefore, conventional geostatistical approaches such as kriging and co-kriging could hardly identify the underlying correlation.Advances in ML algorithms facilitated the analysis of data from well logs, seismic surveys, and other sources to improve the understanding of the subsurface geology and fluid distribution.
A great number of studies have been carried out to infer reservoir properties inversely from seismic information because of the wide-range coverage.For instance, RNNs and Monte Carlo (MC) simulation were leveraged by Grana et al. [48] to classify facies from seismic data.The results reveal that RNNs perform better if the training set is sufficiently large, while MC can give the equivalent performance and quantify uncertainty better if the prior information is specified.Moreover, Liu et al. [49] also used RNNs architecture for seismic reservoir characterization.With input variables of seismic data, source wavelet and low-frequency prior porosity, the unsupervised convolutional neural network was developed by Feng et al. [50] to forecast porosity with limited labelled data.By introducing biased dropout and dropconnect strategies to address the overfitting problem, Liu et al. [10] extended the extreme learning machine (ELM) to simultaneously estimate several important properties such as lithofacies, shale content, porosity and saturation.The authors concluded that the proposed method has better generalization performance and a more efficient training process.Another work by Lee et al. [51] evaluated the statistical relationships between total organic carbon (TOC) in unconventional reservoirs and seismic parameters by combining ML and statistical rock physics.Chen et al. [52] proposed four physical constraints (spatial, continuity, gradient and category) and incorporated them into the RF algorithm to predict reservoir quality.A significant improvement in prediction accuracy was observed in terms of the F1 score.
Although seismic data cover a large area and can provide properties assessments that span the entire reservoir, they are low-information-carrying [53][54][55].Logging data have high resolution but low coverage, and core data provide the most accurate information and highest resolution, but their availability is limited.For instance, Katterbauer et al. [56] incorporated electrical and acoustic image logging data to classify fractures and estimate the fracture degree.With 100 shale scanning electron microscope images, Tian and Daigle [57] applied the automated object detection based on ML to identify and characterize microfractures.Several studies have focused on integrating various data sources to comprehensively characterize the reservoir, typically referred to as integration modelling [58,59].Anifowose et al. [23] combined seismic data and wireline attributes to forecast the permeability by employing six state-of-the-art ML algorithms.The authors found that SVM outperformed the others, and the depth-matching strategy made a significant difference in estimating production capacities.Priezzhev and Stanislav [60] compared the generalization of conventional seismic inversion methods and several ML techniques using seismic attributes and well logs to predict reservoir properties.Investigating the distributions of lithological properties is useful to identify potential hydrocarbon-rich regions.Dixit et al. [61] utilized ascendant hierarchical clustering (AHC), self-organizing maps (SOM), ANN, and multi-resolution graph-based clustering (MRGC) with the best R 2 s of 0.85, 0.74, 0.90 and 0.68, respectively, to predict lithofacies by integrating core data and well logs.
Overall, ML can be a powerful tool for reservoir characterization.However, seismic, logging and core analysis data are susceptible to errors or noises that may be introduced during the collection, aggregation or annotation stages, and small deviations in features could result in significant changes in ML estimates due to their inherent black-box nature.Therefore, high-quality data are required to make an accurate analysis and reliable decisions with the assistance of more advanced techniques.

Future Trends for ML in Reservoir Engineering
In spite of the remarkable achievement of ML in current reservoir applications, such as characterization and performance investigation, it is still in its infancy and has great potential for further improvement, as stated in Section 2.

Data Quality and Quantity
At present, most studies have been devoted to proposing new strategies to improve the performance of models, ignoring the fact that the quality of the model output depends not only on the model architectures, but also largely on the quality of the data [62].Even most advanced ML algorithms cannot deliver reasonable results without guaranteed data quality and quantity.Data quality is not completely equal to data accuracy, but also comes with completeness and consistency [63].Completeness denotes the absence of missing values and data consistency refers to whether the data records follow a uniform specification.The accuracy of data can be improved in the future with the assistance of more advanced sensors and mathematical algorithms (e.g., denoising and outlier detection).Completeness assurance requires the prior determination of what information must be collected in order to prevent the collection of too much redundant information, as well as the absence of many important parameters.Moreover, data consistency is most affected by human factors, and standardizing the data recording process, including the use of the same units and labeling methods, is crucial.
In addition, acquiring extremely high amounts of datasets is also very challenging in reservoir engineering due to the limitations of cost, technology and sharing policies.To overcome the shortage of samples, the construction of a data-sharing platform should be actively promoted.Moreover, the advent of few-shot learning, transfer learning and federated learning are likely to be promising solutions to partially address this challenge [64][65][66].With transfer learning, the model will be pre-trained on big data from a reservoir with similar conditions and then slightly tuned to enhance the model performance in the target reservoir.However, transfer learning may only be applicable on target domains with sufficient similarity, and some investigations have been conducted at the well scale [67].Federated learning aims to train a model based on local training and parameter sharing without direct access to the data source, so data privacy and legal compliance are ensured.In addition, inspired by the capacity of humans to learn and generalize from small samples, few-shot learning aims to learn on scarce data with the assistance of strategies such as data augmentation, metric learning and meta-learning.In total, drawing reasonable inferences from sparse data is essential for reservoir systems, and current research is still in its early development, and more work is required in the future.

Fusion of Multiple Data Sources
In reservoir engineering, data at different scales, in different domains and different formats, are accumulated, e.g., there is a large number of computerized tomography (CT) scans at the core scale, time series data such as monthly fluid production, and a wide variety of tabular data at the well scale.Different data sources have different resolutions and can provide different information.For instance, seismic data are available regarding large areas and can provide an evaluation of petrophysical properties across the entire reservoir, but they only carry limited information.Logging data have relatively high resolution but suffer from sparse, site-specific data issues.Core data provide the most accurate information and highest resolution, but their acquisition is limited and often prohibitively expensive.Therefore, fusing multiple input information methods with ML models is expected to improve the generalization of the models, enhance the resolution of the reservoir modelling and alleviate the bottleneck of data scarcity.Therefore, reasonable stack strategies of various modules or multimodal learning architectures are anticipated to become available in the near future to process various inputs simultaneously.Figure 2 shows an illustration of the integration of multiple data sources in reservoir engineering.

Coupling Physics Laws with ML
Due to the capital-intensive nature of the oil and gas industry, a small deviation in decision making may result in a significant loss of manpower, resources and funds.Therefore, the interpretability of ML models is one of the most important and yet challenging issues, hindering the implementation of ML in the engineering field.The prior information of physical laws based on fundamental principles such as conservation laws, monotonicity and symmetry and empirical rules is completely discarded in ML practice in modelling physical phenomena, which results in poor generalization performance, especially in the small data regime [68].Therefore, even the state-of-the-art black-box methods are unable to provide physically consistent results and lack generalizability to out-of-bag samples [69].Intuitively, applying extra physical constraints to a data-driven approach enables the ultimate model to benefit from both the data and the laws of physics and constraints the search space, thus granting more general inference with significantly fewer training samples than traditional methods, as well as ML.A simple coupling strategy is shown in Figure 3, using analytical/empirical models and machine learning algorithms (e.g., neural networks) to predict the desired variables, respectively, then stacking them and feeding them into a simple data-driven model to obtain the final output.A more promising solution to effectively address this challenge could be the sophisticated physics-informed neural networks (PINN), in which automatic differentiation has been employed to calculate the derivatives of the neural network output with respect to input coordinates and model parameters.PINN has been successfully employed for data-driven partial differential equation solving, data-driven discovery governing equations and fitting a potential many-body energy surface [68,70,71], as such networks are constrained to respect any conservation principles, symmetries and differentiable property stemming from the physical laws [68].Therefore, PINN is expected to have great potential for future applications in complex systems, such as reservoirs with large amounts of data and various applicable physical models.

Conclusions
This study provides a comprehensive review of the ML approaches that have been employed in the field of reservoir engineering and highlights the application status and challenges.It is evident that ML has the potential to provide accurate results and improve reservoir engineering tasks, but there are still challenges to be addressed, and more research is needed to overcome these challenges.The key conclusions drawn from this study are listed below: 1. Machine learning (ML) techniques have numerous applications in reservoir engineering with acceptable accuracy, including the estimation of reservoir properties, well test interpretation, and investigation of production behaviors.
2. A variety of machine learning algorithms have been adopted in the field of reservoir engineering, among which neural networks (e.g., FCNN, RNN and its variant LSTM, CNNs) are the most popular models because of their powerful nonlinear mapping capacity and flexibility in addressing different data formats.
3. The current application of ML in reservoir engineering is still in its infancy, and further research is needed to enhance the ability to draw reliable inferences from sparse data and to develop strategies for integrating data from multiple sources/formats.
4. More attention should be given to the integration of physical laws with current datadriven models for the purpose of improving model interpretability and generalization, and PINN is a promising approach to address this problem.

Citation:
Wang, H.; Chen, S. Insights into the Application of Machine Learning in Reservoir Engineering:

Figure 1 .
Figure 1.Monthly production of some wells in Duvernay Formation.

Figure 2 .
Figure 2. Illustration of integrating multiple data source.

Figure 3 .
Figure 3.A simple strategy to integrate analytical/empirical models with data-driven models.

Table 1 .
Commonly used ML model in reservoir engineering.