# New Hybrid Approach for Developing Automated Machine Learning Workflows: A Real Case Application in Evaluation of Marcellus Shale Gas Production

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

^{2}correlation) and Mean Square Error (i.e., MSE). For this purpose, actual field data obtained from 1567 gas wells in Marcellus Shale, with 121 features from reservoir, drilling, completion, stimulation, and operation is tested using different proposed workflows. A proposed new hybrid workflow is then used to evaluate the type well used for evaluation of Marcellus shale gas production. In conclusion, our automated hybrid approach showed significant improvement in comparison to other proposed workflows using both scoring matrices. The new hybrid approach provides a practical tool that supports the automated model and hyperparameter selection, which is tested using real field data that can be implemented in solving different engineering problems using artificial intelligence and machine learning. The new hybrid model is tested in a real field and compared with conventional type wells developed by field engineers. It is found that the type well of the field is very close to P50 predictions of the field, which shows great success in the completion design of the field performed by field engineers. It also shows that the field average production could have been improved by 8% if shorter cluster spacing and higher proppant loading per cluster were used during the frac jobs.

## 1. Introduction

## 2. Methodology

#### 2.1. Data Pre-Processing

- (a)
- Primary scanning

- (b)
- Detachment of characteristic features

- (c)
- Detachment of unchangeable features

- (d)
- Handling of missing values

- (e)
- Data transformation

- (f)
- Removal of outliers

#### 2.2. Algorithms Configuration

- (a)
- Grid/random search

- (b)
- Bayesian search and optimization

- -
- Import the distributions of hyperparameters to be searched.
- -
- Import the objective function (commonly the loss function), the acquisition function (as a criterion for selecting the next set of hyperparameters from the current set), and the surrogate model.
- -
- Within a set of hyperparameters selected from the imported distributions, the procedure is conducted using a Bayes loop, as follows:
- Fit the surrogate model.
- Compute the objective function and maximize the acquisition function.
- Apply the model with the intended set of hyperparameters to the data.
- Update the surrogate model to decide which set of hyperparameters from their distributions to select next.

- (c)
- Genetic Programming

- (d)
- Tree-based Pipeline Optimization (TPOT)

- Within a generation’s population, each possible pipeline is evaluated for its fitness to the problem.
- Using the fitness results, the best pipelines are stored as a new population to process the crossover (impacted by crossover probability) and create the next potential generation of pipelines.
- All created pipelines are tested for their performance in the provided data, using pre-set scoring metric(s).
- In case the result is plausible, the created pipelines’ data is passed to the next generation.
- In case no created pipelines perform satisfactorily, a mutation is performed between the created pipelines to expand the number of pipelines (impacted by mutation probability), and data from all created pipelines (by both crossover and mutation) is passed to the next generation.
- TPOT is terminated when the stop criteria are satisfied (commonly set as the maximum number of generations reached). The representative pipeline from the final generation’s data is the most suitable machine learning pipeline for the given problem.

#### 2.3. Design of Workflows

#### 2.4. Scoring Metrics for the Designed Workflows

^{2}correlation) and Mean Square Error (i.e., MSE) are selected as two scoring metrics to evaluate the performance of all studied workflows. These metrics are selected due to their compatibility with all the levels of complexity in the designed workflows. A minor modification of MSE to negative MSE is conducted, however, this modification merely increases seamlessly between pre-processing and modeling for a few runs and does not impact modeling evaluation.

- ○
- Pearson correlation (R
^{2}correlation)

- ○
- Mean Square Error (MSE)

## 3. Analysis and Discussion

^{2}and MSE score matrices, as shown in Table 2. Besides performance, Table 2 indicates another notable observation when the modules in the native TPOT workflow and the reduced TPOT workflow are compared. The use of human domain expertise during pre-processing is comparable to the coupled decomposition-feature selection modules in TPOT. Albeit TPOT is intrinsically designed to transform and select the optimal features, dedicated involvement of domain expertise, in this study, is proved to contribute a similar effect while ensuring the preservation of the physics of the problem.

## 4. Result

^{2}and MSE as scoring matrices. In this paper, the results for 540 days of cumulative gas production are presented and discussed, however, similar results are obtained for 180 and 360 days of cumulative gas production as target parameters.

#### 4.1. Final Modelling Outcomes from the Four Designed Workflows

#### 4.2. Validation of the Four Designed Workflows on the Test Set

^{2}and lower MSE values compared to workflows 1 and 2. Based on workflow 3, the use of Bayesian optimization for hyperparameter selection as a pipeline-ended supporter to TPOT leads to a higher R

^{2}and lower MSE values in workflow 4 compared to workflow 3.

#### 4.3. Evaluation of Marcellus Shale Gas Production Type Well

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Nomenclature

$argma{x}_{X}$F | The maximum value of F under control of variables X |

Algo | Algorithm |

AI | Artificial Intelligence |

AML | Automatic Machine Learning |

EA | Evolutionary Algorithm |

EI | Expected Improvement |

GP | Genetic Programming |

KNN | K Nearest Neighbour |

NPV | Net Present Value |

P10 P50, P90 | Uncertainty probabilistic quantification (i.e., exceedance probability) |

SMBO | Sequential Model-Based Optimization |

TPOT | Tree-based Optimization Tool |

$\mu $ | Mean function in Gaussian Processes |

$K$ | Covariance (i.e., kernel) function in Gaussian Processes |

$m\left({x}_{i}\right),i\in \left[1,N\right]$ | Mean value of the random variable ${x}_{i}$ |

n | Sample size to compute the error metrics |

${\rm N}(f|\mu ,K)$ | Normal distributions of random function f with mean function $\mu $ and covariance kernel K |

$p\left(f/x\right)$ | Prior conditional function modeled by Gaussian Processes |

${x}_{1:j-1}$ | Hyperparameter sets at all the previous steps |

$x,{x}_{new}$ | Hyperparameter set at the current step |

${x}^{+}$ | Hyperparameter set at the previous step |

${y}_{i}$ | The true value of the output variable |

${Y}_{i}$ | Average of the true values of the input variables |

${\widehat{y}}_{i}$ | The predicted value of the output variable |

## Appendix A

**Figure A1.**Model parameter distributions from top left to right (number of clusters per stage; cluster spacing; water loading per cluster; proppant loading per cluster, 100 mesh; 40/50 mesh; 30/50 mesh; average rate per cluster; time in-line).

## References

- Belyadi, H.; Fathi, E.; Belyadi, F. Hydraulic Fracturing in Unconventional Reservoirs: Theories, Operations, and Economic Analysis, 2nd ed.; Gulf Professional Publishing: Cambridge, MA, USA, 2019. [Google Scholar]
- Lashari, S.Z.; Borujeni, A.T.; Fathi, E.; Sun, T.; Rahmani, R.; Khazaeli, M. Drilling Performance Monitoring, and Optimization Using Artificial Intelligence Algorithms. J. Pet. Explor. Prod. Technol.
**2019**, 9. [Google Scholar] [CrossRef] [Green Version] - Bergstra, J.; Yamins, D.; Cox, D.D. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. JMLR W CP
**2013**, 28, 115–123. [Google Scholar] - Lacoste, A.; Larochelle, H.; Laviolette, F.; Marchand, M. Sequential Model-Based Optimization. arXiv
**2014**, arXiv:1402.0796v1. [Google Scholar] - Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Shubert, B.O. A Sequential Method Seeking the Global Maximum of Function. 57AM J. Numer. Anal.
**1972**, 9, 379–388. [Google Scholar] [CrossRef] - Malkomes, G.; Schaff, C.; Garnett, R. Bayesian optimization for automated model selection. In Proceedings of the Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Mockus, J. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. Glob. Optim.
**1994**, 4, 347–365. [Google Scholar] [CrossRef] - Orlenko, A.; Moore, J.H.; Orzechowski, P.; Olson, R.S.; Carins, J.; Caraballo, P.J.; Weinshilboum, R.M.; Wang, L.; Breitenstein, M.K. Considerations for automated machine learning in clinical metabolic profiling: Altered homocysteine plasma concentration associated with metformin exposure. Pac. Symp. Biocomput.
**2018**, 23, 460–471. [Google Scholar] [PubMed] [Green Version] - Sivanandam, S.N.; Deepa, A.N. Introduction to Genetic Algorithms; Springer: Berlin/Heidelberg, Germany, 2008; ISBN 978-3-540-73189-4. [Google Scholar]
- Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; ISBN 13 978-0-262-18253-9. [Google Scholar]
- Olson, R.S.; Bartley, N.; Urbanowicz, R.J.; Moor, J.H. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. arXiv
**2016**, arXiv:1603.06212v1. [Google Scholar] - Banzhaf, W.; Francone, F.D. Genetic Programming; Morgan Kaufman Publishers Inc.: Burlington, MA, USA, 1998. [Google Scholar]

Group | Examples of Features | Number of Features |
---|---|---|

G0—Well API | API, Well number, Well Index | 7 |

G1—Trajectory | MD, TVD, Easting, Northing | 10 |

G2—Perforation | Cluster Spacing, Perforation Stage | 9 |

G3—Injection schedule | Proppant loading, Water loading, Proppant meshes, Fracture Gradient | 17 |

G4—Logging and Geomechanics | Impedance, Young Modulus, Clay Volume, Gas Content | 27 |

G5—Operations | Casing Pressure, Tubing Pressure | 24 |

G6—Production | Cumulative/normalized production data (90/180/360/540 days) | 20 |

G7—Others | BTU, frac hit | 7 |

Automation | Domain-Expertise | |
---|---|---|

R^{2} | 0.721 | 0.702 |

MSE | 1800 | 1801 |

Workflow 1 | Workflow 2 | |
---|---|---|

R^{2} | 0.685 | 0.702 |

MSE | 2832 | 2756 |

Workflow 3 | Workflow 4 | |
---|---|---|

R^{2} | 0.729 | 0.733 |

MSE | 2963 | 2543 |

Model Parameters | Objective | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Treating Lateral Length (ft) | # Cluster per Stage | Cluster Spacing (ft) | Weterloading per Cluster (bbl/cluster) | Proppent Lading per Cluster (lbs/cluster) | 100 Mesh (%) | 40/70 Mesh (%) | Young’s Modulus (MMpsi) | Poisson’s Ratio | Water Saturation (%) | Porosity | Gas Content (Scf/ton) | BTU | Average Well Spacing (ft) | 540 Days Cumulative Gas Production (Normalized) | |

Max | 12323 | 10 | 119.813 | 2932.967 | 155,908.75 | 1 | 1 | 4.604 | 0.254 | 0.83 | 0.44 | 1674.953 | 1300 | 1501 | 1.621 |

Min | 1614 | 2 | 19.834 | 674.916 | 24,039.091 | 0 | 0 | 2.459 | 0.11 | 0.009 | 0.103 | 0 | 993 | 438.868 | 0 |

Mean | 5385.699 | 4.865 | 44.294 | 1559.612 | 67,024.683 | 0.585 | 0.333 | 3.116 | 0.181 | 0.117 | 0.229 | 487.909 | 1107.637 | 1039.387 | 0.143 |

Std | 2070.061 | 0.611 | 16.085 | 446.712 | 21,826.928 | 0.242 | 0.218 | 0.429 | 0.029 | 0.098 | 0.035 | 143.885 | 85.478 | 255.413 | 0.099 |

Model Parameters | Objective | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

# Cluster per Stage | Cluster Spacing (ft) | Water Loading per Cluster (bbl/cluster) | Proppant Loading Per Cluster (bbl/cluster) | 100 Mesh (%) | 40/70 Mesh (%) | 30/50 Mesh (%) | Time In-Line, TIL (days) | Average Rate per Cluster | Perforation per Stage | 540 Days Cumulative Gas Production (Normalized) | |

Type well | 5 | 44 | 1569 | 67,374 | 0.58 | 0.34 | 0.06 | 85 | 19 | 41 | 1.13 |

P10P50 | 3 | 60 | 2170 | 84,177 | 0.59 | 0.25 | 0.16 | 21 | 19 | 40 | 1.06 |

P50 | 5.0 | 60 | 1524 | 77,381 | 0.54 | 0.46 | 0 | 102 | 19.00 | 40 | 1.13 |

P90 | 5.0 | 30 | 1208 | 79,840 | 0.56 | 0.25 | 0 | 75 | 20.00 | 40 | 1.22 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Van Pham, V.; Fathi, E.; Belyadi, F.
New Hybrid Approach for Developing Automated Machine Learning Workflows: A Real Case Application in Evaluation of Marcellus Shale Gas Production. *Fuels* **2021**, *2*, 286-303.
https://doi.org/10.3390/fuels2030017

**AMA Style**

Van Pham V, Fathi E, Belyadi F.
New Hybrid Approach for Developing Automated Machine Learning Workflows: A Real Case Application in Evaluation of Marcellus Shale Gas Production. *Fuels*. 2021; 2(3):286-303.
https://doi.org/10.3390/fuels2030017

**Chicago/Turabian Style**

Van Pham, Vuong, Ebrahim Fathi, and Fatemeh Belyadi.
2021. "New Hybrid Approach for Developing Automated Machine Learning Workflows: A Real Case Application in Evaluation of Marcellus Shale Gas Production" *Fuels* 2, no. 3: 286-303.
https://doi.org/10.3390/fuels2030017