Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields

Wang, Minhao; Wang, Zhenjia; Chen, Gangping; Zhou, Jun; Luo, Jian; Qin, Fang; Wu, Yue; Zhou, Pan; Lin, Chuqi

doi:10.3390/modelling7030099

Open AccessArticle

Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields

by

Minhao Wang

¹,

Zhenjia Wang

²,

Gangping Chen

¹,

Jun Zhou

^3,*,

Jian Luo

¹,

Fang Qin

¹,

Yue Wu

³,

Pan Zhou

³ and

Chuqi Lin

³

¹

Surface Engineering Design Center, PetroChina Southwest Oil & Gasfield Company, Chengdu 610000, China

²

PetroChina Southwest Oil & Gasfield Company, Chengdu 610000, China

³

Petroleum Engineering School, Southwest Petroleum University, Chengdu 610500, China

^*

Author to whom correspondence should be addressed.

Modelling 2026, 7(3), 99; https://doi.org/10.3390/modelling7030099 (registering DOI)

Submission received: 6 April 2026 / Revised: 7 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

Virtual flow metering (VFM) serves as an effective alternative to traditional physical flow meters, significantly reducing gas-field metering costs and operational complexity. However, conventional VFM typically employs a single-modeling approach, failing to address metering requirements across varying production conditions and data types. Focusing on wellhead choke equipment, four mechanistic models (MModels) based on choke-flow dynamics are constructed using piecewise linear regression, alongside six machine learning models. Hyperparameters are optimized via grid search and cross-validation, establishing a hybrid mechanistic and data-driven multi-model VFM method for gas wells. Systematic testing utilizes field data from gas wells in the Southwest Oil and Gas Field, with the Shapley additive explanations (SHAP) method quantifying feature contributions. MModel results indicate superior overall performance by the temperature-difference piecewise linear model, yielding a training R² of 0.91 and a mean test error of 4.59%. Under different valve-position conditions, the downstream-temperature piecewise linear model demonstrates better predictive capability when the valve position is equal to 100, whereas the valve-position piecewise linear model achieves higher accuracy when the valve position is less than 100. MLModel results reveal that among ten feature parameters, “Date” and “Valve Position Indication” contribute most significantly to prediction accuracy, accounting for over 50% of cumulative contribution in GBoost (extreme gradient boosting) and CatBoost (categorical boosting) models. Notably, the XGBoost model exhibits optimal predictive performance, achieving a training R² of 0.979 and a mean test error of merely 0.13%. Random sampling results show coefficient of variation values below 0.1 for all metrics, demonstrating exceptional robustness, providing an effective technical solution and solid theoretical support for gas-field VFM.

Keywords:

natural gas; virtual flow metering; mechanistic model; data driven; machine learning; neural network

1. Introduction

1.1. Motivation

Virtual flow metering (VFM) estimates oil-well and gas-well production rates using software algorithms and mathematical models applied to existing measurement parameters, such as pressure, temperature, and wellhead data. As a supplement or alternative to physical metering devices, VFM technology is seeing increasingly widespread application in upstream oil- and gas-field metering scenarios [1]. VFM offers advantages including low investment and maintenance costs, convenient construction and deployment, ease of integration with other systems, and real-time metering capabilities. Consequently, it can partially replace costly online analyzers with suboptimal performance, representing a metering solution with broad application prospects.

VFM technology was initially proposed in the 1990s for oil- and gas-field development. Over two decades of evolution, it has expanded to offshore fields, integrating seamlessly with flow assurance and pipeline management systems to become a critical advanced technology for offshore operations [2]. International technology firms have developed various online monitoring and management systems for flow assurance in deepwater fields, achieving successful deployment and favorable outcomes in regions such as the North Sea, the Gulf of Mexico, and West Africa.

The accelerating energy transition and the rise in renewable energy impose significant transformation pressures on the petroleum industry. In response to these challenges and technological shifts, virtual well metering has emerged as a vital solution for cost reduction and efficiency improvement [3]. Concurrently, rapid advancements in the Internet of Things, big data, cloud computing, and artificial intelligence provide robust technical support for metering innovation. Nevertheless, research on virtual gas-well metering remains nascent, with limited theoretical studies and field applications. Consequently, this study comprehensively investigates gas-well VFM systems, analyzing the principles and applicability of data-driven, mechanistic, and hybrid models. A high-precision, highly adaptable metering system is established for unconventional gas fields in southwest China, offering references to address limitations of traditional metering methods, meet digital transformation demands, and resolve metering difficulties in complex environments.

1.2. The Literature Review

VFM technology originated internationally and has evolved into a relatively mature technical framework. Nations with advanced petroleum industries, such as Norway, the UK, and the USA, have achieved significant progress in offshore VFM [4]. For instance, the Norwegian oil industry developed FlowManager to monitor and optimize multiphase flow in offshore fields. Utilizing a dynamic multiphase-flow simulator, this system enables real-time well production monitoring and prediction. Internationally, VFM technology has been widely applied in offshore fields, particularly in regions such as the North Sea, the Gulf of Mexico, and West Africa. In contrast, VFM research and application in China started relatively later and has mainly been promoted in the context of digital oilfield construction and production metering optimization. Subsequently, major corporations, like CNPC and Sinopec, have actively promoted these technologies, integrating them into digital transformation and smart oilfield construction. Based on modeling principles, existing VFM techniques fall into three categories: mechanistic model-based, data-driven, and hybrid model-based approaches [5]. To clarify the position of the present work, these existing approaches are compared below in terms of model principles, data requirements, interpretability, adaptability, and engineering applicability.

Mechanistic VFM establishes physical multiphase-flow models, incorporating fundamental fluid dynamics and thermodynamics to calculate oil-, gas-, and water-flow rates using measured pressure and temperature data [6]. These methods rely on accurate physical-process descriptions, requiring detailed wellbore geometry and fluid PVT properties. Representative software includes LedaFlow and K-Spice, which simulate multiphase-flow processes and provide a theoretical basis for VFM [7]. While MModels offer clear physical significance and strong interpretability, they suffer from complexity, numerous parameters, and the necessity for timely updates when well conditions change. Crucially, model accuracy depends heavily on input precision, including exact fluid PVT data, wellbore structure, and reservoir characteristics. In practice, acquiring these parameters precisely is often difficult, or they change during production (e.g., reservoir pressure depletion or fluid composition shifts), causing “model–plant mismatch”. Furthermore, for extremely complex flow phenomena, like high-viscosity flow or specific slug flow regimes, existing closure relationships may lack accuracy, leading to prediction deviations [8]. Simultaneously, solving complex transient models typically entails heavy computational loads, potentially hindering real-time application. Compared with these conventional mechanistic VFM approaches, the present study uses choke-flow principles and piecewise linear regression to construct simplified mechanistic models based on field-accessible parameters. This design reduces the dependence on difficult-to-obtain wellbore and fluid-property parameters while retaining a certain degree of physical interpretability.

To overcome the limitations of traditional MModels, researchers have increasingly integrated intelligent algorithms for auxiliary prediction, marking an early exploration of VFM transitioning from pure physical modeling toward data fusion. Zhou et al. [9] evaluated various models (hydraulic and thermal calculation models) and correlations (empirical formulas) to assess their performance in predicting multiphase-flow rates through surface chokes, highlighting performance bottlenecks inherent in conventional methods. Addressing this issue, Ibrahim et al. [10] employed a choke model combined with a least squares support vector machine (LSSVM) algorithm to predict gas-flow rates under subsonic conditions for nozzle and orifice chokes. Kaleem et al. [11] proposed a hybrid machine learning approach for predicting multiphase-flow rates through surface chokes. They investigated common supervised models, including k-nearest neighbors, support vector regression, decision trees, random forests, gradient boosting, and extra trees, evaluating the proposed intelligent models using production data from the Norwegian Volve oil and gas field. Early studies indicate that incorporating data-driven elements enhances the predictive accuracy of standalone physical models. However, most of these studies emphasize algorithm performance under a specific modeling framework, whereas systematic comparison between multiple mechanistic models and multiple machine learning models under different gas-well operating conditions remains limited.

In recent years, rapid advancements in machine learning and artificial intelligence have spurred extensive research into data-driven VFM. This approach establishes statistical relationships between measured parameters and flow rates, utilizing historical data to train models for flow prediction. Currently, soft-sensing devices deployed in domestic oilfields predominantly rely on MModels, whereas data-driven soft sensors remain scarce in engineering practice, appearing mainly in the academic literature. Data-driven methods offer the advantage of bypassing detailed physical modeling while automatically capturing complex nonlinear relationships; however, they require substantial volumes of high-quality historical data, and model performance may degrade when data distributions shift [12]. Common tools for constructing such models include artificial neural networks (ANNs), support vector machines (SVMs), and recurrent neural networks (RNNs), along with variants like long short-term memory (LSTM) networks, which excel at processing time-series data [13]. The primary strength of data-driven models (DAModels) lies in their ability to capture intricate latent correlations that are difficult to describe precisely with formulas. Once trained, these models typically execute predictions rapidly, making them suitable for real-time monitoring [14]. Nevertheless, several fundamental drawbacks persist. First, their internal decision logic lacks interpretability and physical credibility [15]. Second, they depend heavily on large, high-quality, representative training datasets, which are often scarce in industrial settings. Most critically, their generalizability (extrapolation ability) is poor; if actual operating conditions exceed the range covered by training data, predictions become entirely unreliable. For gas wells, which represent non-stationary processes with evolving production states, DAModels face severe “concept drift”. As process statistical characteristics change over time (e.g., transition from single-phase to two-phase flow), models trained on early data become completely ineffective in later stages [16]. Compared with single data-driven VFM models reported in previous studies, this work evaluates six mainstream ensemble-learning models under the same dataset and uses SHAP analysis to compare feature contributions, thereby improving the interpretability of data-driven VFM.

To leverage the strengths of both mechanistic and data-driven approaches, researchers have recently proposed hybrid model-based VFM methods. These techniques integrate physical models with machine learning algorithms, utilizing physical laws to constrain model behavior while employing data mining to enhance adaptability and accuracy. Hotvedt et al. [17] developed a hybrid well VFM framework, establishing an interaction logic and system architecture between machine learning models for well conditions and mechanistic well models. This system achieves stable and accurate operation driven by real-time measurement data. Studies indicate that such hybrid systems deliver stable phase-specific metering outputs under real-time data-input conditions, achieving overall errors below 5% for liquid flow and 3% for gas flow, thereby improving metering accuracy. Hybrid model construction offers flexible configurations [18]. A common serial structure employs a DAModel to estimate difficult-to-measure, time-varying parameters within MModels online, such as valve flow coefficients or pipe wall roughness. Alternatively, a parallel structure uses MModels for initial prediction, followed by a DAModel that learns and predicts the residual (model error) between the mechanistic output and actual values, applying this residual as a compensation term to the mechanistic prediction [19]. Compared with these hybrid VFM studies, the present work focuses on conventional gas-field wellhead choke equipment and proposes a multi-model VFM framework that jointly considers field data availability, operating-condition differences, model accuracy, feature contribution, and robustness. In addition to comparing prediction accuracy, Monte Carlo random sampling is introduced to evaluate the stability of model performance under different data partitions.

The fundamental advantage of hybrid models lies in using physical laws to constrain model behavior, ensuring predictions remain within physically realistic bounds, which enhances robustness and interpretability. Simultaneously, real data correct model biases, allowing adaptive compensation for errors arising from mechanistic simplifications or unknown factors, thus improving accuracy. This approach is particularly suitable for complex industrial processes where physical mechanisms are partially understood, but key parameters exhibit strong time-variance or the micro-mechanisms remain unclear, such as dynamic multiphase interface evolution or reservoir heterogeneity under complex geological conditions. Selecting a modeling method essentially involves a trade-off between “prior knowledge” and “data availability”. In scenarios characterized by incomplete knowledge and imperfect data, pure MModels fail due to parameter inaccuracies, while purely DAModels collapse due to data deficiencies and physical evolution. Consequently, hybrid models represent the optimal pathway, effectively utilizing incomplete knowledge to compensate for imperfect data while using imperfect data to correct incomplete knowledge, thereby better integrating the advantages of both approaches. The calibration mechanism generates extensive high-quality state data covering a sufficient range, effectively expanding the production prediction scope of machine learning models. Furthermore, this mechanism provides rich flow-safety simulation data, with simulation results supplementing and validating the reliability of machine learning predictions [20].

Although domestic and international scholars have conducted extensive research on VFM for oil wells and gas wells, the existing literature predominantly focuses on offline data analysis or algorithmic improvements for single models, lacking research on real-time system integration for the entire wellhead production process in gas fields. Compared with previous studies, the contribution of this work lies in establishing a comparative VFM framework that includes four choke-based mechanistic piecewise linear models and six machine learning models, identifying their applicable operating conditions, interpreting feature importance using SHAP analysis, and verifying robustness through Monte Carlo random sampling. Addressing this gap, this study aims to establish VFM tailored for gas-field wellhead production processes. By transcending the limitations of traditional metering methods and integrating the advantages of hybrid modeling, this work seeks to provide new theoretical references and technical paradigms for digital construction and intelligent metering in gas fields.

1.3. Contributions

This study makes the following primary contributions to VFM models and methods for conventional gas wells:

(1): A hybrid VFM method integrating piecewise linear regression with machine learning based on choke models is proposed. Wellhead metering schemes adaptable to diverse production conditions and data types are developed, addressing limitations of traditional single-system approaches and establishing a scalable VFM architecture.
(2): Performance variation patterns of piecewise linear regression models are revealed, clarifying the advantageous distribution of different models under specific operating conditions. These findings provide empirical evidence and decision support for selecting optimal modeling strategies based on operational characteristics during gas-field production.
(3): SHAP analysis quantifies feature contributions within DAModels, while Monte Carlo random sampling assesses model robustness. This work offers a scientific basis for feature-engineering optimization and model selection, thereby enhancing the interpretability and reliability of data-driven approaches.

1.4. Paper Organization

The remainder of this paper is organized as follows. Section 2 describes the VFM system architecture and its application. Section 3 constructs the MModels. Section 4 develops the DAModels. Section 5 presents the testing and analysis of both the MModels and DAModels. Finally, Section 6 provides concluding remarks.

2. VFM System Architecture and Application

2.1. VFM Technology

Gas-field wellhead VFM represents a digital measurement methodology integrating computer technology with fluid hydraulic and thermal simulation techniques. By acquiring key process parameters such as well pressure, temperature, and choke valve opening from oil and gas data platforms, this technique estimates well flow rates and flow regimes, providing precise flow value assessments for wellheads. Based on calculation principles, implementation approaches fall into two categories: MModels and DAModels.

2.1.1. MModels

MModel-based metering predicts flow rates by establishing hydraulic and thermal multiphase-flow models for target pipelines, utilizing acquired production data as feature input vectors. Depending on the parameter acquisition location, MModels include choke models and wellbore models. Given the field constraints on parameter acquisition, obtaining internal wellbore parameters remains difficult; consequently, MModels primarily employ choke models. Figure 1 illustrates the VFM workflow based on the choke model. The choke serves as a throttling element located at the surface or downhole of gas wells. Its core function involves controlling fluid flow rate and stabilizing bottomhole pressure by restricting the flow cross-sectional area. The choke model fundamentally establishes a quantitative relationship among choke size, fluid parameters (pressure, temperature, and composition), and flow rate, grounded in throttling flow laws. As fluid passes through the choke, the sudden reduction in flow area causes a sharp velocity increase, converting pressure energy into kinetic energy and generating a throttling effect. Focusing on this “local throttling effect”, the choke model characterizes energy conversion patterns at the choke location, transforming measurable upstream and downstream parameters (pressure and temperature) into flow rates. This approach serves as the direct calculation basis for the “wellhead flow rate” in VFM applications.

2.1.2. DAModels

Figure 2 illustrates the data-driven VFM workflow. This approach relies on data training to establish a mapping between sensor input features and output flow rates, employing machine learning algorithms to infer single-well production. The process begins by leveraging historical databases to scientifically partition training, validation, and testing sets. Standardized preprocessing steps, including denoising, normalization, and outlier removal, eliminate dimensional discrepancies and noise interference, thereby driving deep model training and parameter optimization. During this phase, the model automatically extracts high-dimensional nonlinear mappings between sensor inputs and target flow rates through iterative learning, effectively solidifying knowledge from raw data into abstract functions. Performance evaluation and parameter updates using the validation set ensure robust generalizability. Subsequently, the trained model deploys to the production site, directly ingesting real-time data streams to instantly compute and output predicted flow rates without explicit physical equation constraints. Furthermore, the system maintains a data-retention mechanism that feeds new real-time data back into the historical database, providing continuous support for subsequent model updates and adaptive evolution.

2.2. Overall Architecture of the VFM System

The conventional gas wellhead VFM system employs an advanced modular architecture to achieve automated and intelligent gas-well flow prediction. The gas-field VFM system architecture based on the combination of semi-empirical and data-driven approaches is shown in Figure 3. The core of the system consists of three modules: data preprocessing, model training, fitting and correction, and metering. By incorporating both semi-empirical models and data-driven models, the system forms a multi-method and multi-model intelligent metering framework, which can adapt to different production conditions and data characteristics, thereby improving prediction accuracy and system robustness.

Field transmission data include key process parameters, such as tubing pressure, metering pressure, metering temperature, valve position indication, downstream pressure, downstream temperature, date, and time. The data preprocessing module adopts a multi-method collaborative data-quality-control strategy. Missing values are intelligently filled using time-series interpolation to ensure data continuity. For outliers, Z-score statistical detection is applied, and domain knowledge is used to determine data validity. Invalid data are removed, while valid abnormal values are corrected using interpolation. This module provides a high-quality and complete data foundation for subsequent multi-model prediction, ensuring optimal input data conditions for different prediction models.

The model training, fitting, and correction module adopts a dual-model parallel architecture rather than a single mathematical fusion equation. The semi-empirical component and the data-driven component are independently trained and evaluated in the underlying system architecture, forming a complementary prediction foundation and providing dual support for subsequent intelligent scheduling. Based on physical principles and fluid-dynamics laws, the semi-empirical component adopts piecewise linear regression. According to data characteristics, regression strategies such as valve-position segmentation and temperature-difference segmentation are adaptively selected, and fitting parameters are optimized using intelligent optimization methods to establish prediction models with clear physical significance. Meanwhile, the data-driven component uses Monte Carlo random sampling to divide the dataset into training and testing sets. Six mainstream machine learning algorithms, including LightGBM, XGBoost, and random forest, are integrated. Through grid search and cross-validation, hyperparameter configurations are optimized to uncover complex nonlinear relationships and feature interaction effects in the data, obtaining optimal data-driven prediction models for different application scenarios.

The metering module adopts an intelligent adaptive metering strategy and provides both machine learning models and semi-empirical models for flow prediction. By analyzing the feature distribution of real-time field data, operating-condition stability, and data quality, the system automatically selects the optimal prediction model. For single-well metering, the system directly outputs the predicted flow rate. For multi-well metering systems, given the total flow-rate data, an intelligent normalization algorithm is used to collaboratively correct and allocate the predicted flow rates of individual wells. This approach ensures total flow conservation while optimizing the prediction accuracy of each individual well.

2.3. Single-Well Model Allocation and Deployment Strategy

Considering that different gas wells may differ significantly in production conditions, data acquisition completeness, and sample size, this study constructs a single-well-oriented model allocation and deployment strategy. This strategy does not couple the semi-empirical model and the data-driven model into a single mathematical expression. Instead, it first determines the candidate model set according to the availability of historical data for each well, then selects the optimal model for that well through offline training and validation, and finally deploys the selected model for online flow-rate prediction. Through this strategy, the system can maintain physical interpretability while also considering the capability of representing complex nonlinear relationships, thereby adapting to different gas-well data conditions and metering requirements.

For the i-th gas well, its historical dataset is defined as:

D_{i} = {(x_{i, t}, q_{i, t})}_{t = 1}^{N_{i}}

(1)

where D_i denotes the historical modeling dataset of the i-th gas well; x_i,t denotes the input feature vector of the i-th gas well at time t;

q_{i, t}

denotes the measured flow rate at the corresponding time, L/min;

N_{i}

denotes the total number of historical samples available for modeling this well.

To characterize the data completeness of different gas wells, the available feature set of the i-th gas well is defined as:

A_{i} = {P_{1}, P_{2}, T_{1}, T_{2}, V, Δ P, Δ T, \dots}

(2)

Δ P = P_{1} - P_{2},

(3)

Δ T = T_{1} - T_{2} .

(4)

where

P_{1}

denotes the pressure upstream of the choke valve, bar;

P_{2}

denotes the pressure downstream of the choke valve, bar;

T_{1}

denotes the temperature upstream of the choke valve, K;

T_{2}

denotes the temperature downstream of the choke valve, K; V denotes the valve position indication;

Δ P

denotes the pressure differential, bar;

Δ T

denotes the temperature differential, K.

Based on single-well data availability, the candidate model set for the i-th gas well is defined as:

Φ_{i} = {m ∣ R (m) \subseteq A_{i}},

(5)

where

Φ_{i}

denotes the candidate model set that can be constructed for the i-th gas well; m denotes a candidate model, which can be a semi-empirical piecewise model or a data-driven model;

R (m)

denotes the input feature set required by model m.

For gas wells with pressure-differential, valve-position-indication, downstream-temperature, or temperature-differential information, a pressure-differential piecewise model, valve-position-indication piecewise model, downstream-temperature piecewise model, and temperature-differential piecewise model can be constructed, respectively. For gas wells with relatively complete production parameters and sufficient sample size, data-driven models such as random forest, LightGBM, XGBoost, and CatBoost can also be further established. Therefore, the candidate model set of the i-th gas well can be expressed as:

Φ_{i} = {M_{Δ P}, M_{V}, M_{T_{2}}, M_{Δ T}, M_{RF}, M_{LGBM}, M_{XGB}, M_{Cat}, \dots} \cap φ_{i}

(6)

where

M_{Δ P}

denotes the pressure-differential piecewise model;

M_{V}

denotes the valve-position-indication piecewise model;

M_{T_{2}}

denotes the downstream-temperature piecewise model;

M_{Δ T}

denotes the temperature-differential piecewise model;

M_{RF}

denotes the random forest model;

M_{LGBM}

denotes the LightGBM model;

M_{XGB}

denotes the XGBoost model;

M_{Cat}

denotes the CatBoost model;

φ_{i}

denotes the model set that can be established for the i-th gas well under actual data conditions.

In the offline stage, each model in the candidate model set is trained and validated separately, and unified evaluation metrics are used to measure prediction performance. The optimal deployment model for the i-th gas well is defined as:

m_{i}^{*} = \arg \min E_{i} (m)

(7)

where

m_{i}^{*}

denotes the final selected optimal model for the i-th gas well;

E_{i} (m)

denotes the validation error or comprehensive evaluation metric of model m on the i-th gas well.

The core of the above model allocation and deployment strategy is to conduct differentiated modeling and optimal allocation according to the data conditions of different gas wells. For gas wells with complete temperature, pressure, and valve-position information, both semi-empirical models and machine learning models can be constructed simultaneously, and the optimal scheme can be selected through offline validation. For gas wells lacking downstream temperature but retaining pressure differential or valve position indication, the corresponding semi-empirical sub-models can be constructed as a degraded alternative. For gas wells with relatively complete production parameters and sufficient sample size, data-driven models can be preferentially adopted.

2.4. VFM System Application Scenarios

2.4.1. New Well VFM

Figure 4 illustrates the application scheme for new well VFM. Given the absence of historical metering data for new wells, the system initializes models using historical data from mature wells equipped with identical choke valves and similar gas compositions. This approach enables flow rate estimation for new wells, though further validation of accuracy remains necessary.

2.4.2. Mature Well VFM

Figure 5 depicts the application scheme for mature well VFM. Models for mature wells initialize primarily through indirect acquisition of historical production data. Where such data are unavailable, the system alternatively utilizes historical metering data from wells sharing identical choke specifications and comparable gas compositions.

2.4.3. VFM for Wells with Rotational Metering Devices

Certain well stations employ rotational metering devices, preventing continuous data transmission from individual wells. Figure 6 presents the corresponding VFM application scheme. Rotational metering provides partial historical data to initialize single-well models. During the metering period for Well-1, the system applies VFM to Well-2 while using actual measurements from Well-1 for model calibration. Conversely, during the metering period for Well-2, the system applies VFM to Well-1 while calibrating with actual measurements from Well-2. This mechanism establishes a closed-loop correction cycle based on rotational field measurements, ensuring continuous and accurate flow prediction for all wells despite the lack of continuous metering infrastructure.

3. Construction of Mechanistic VFM Models

3.1. MModel Selection

The statistical results of parameters for each formula in the choke models are presented in Table 1. Upstream pressure, downstream pressure, choke size, and gas–liquid ratio represent the most frequently utilized parameters. Regardless of the specific model selected, choke size or its flow coefficient remains an essential input. However, direct field measurement of these two critical parameters proves impossible. Consequently, the primary challenge in achieving accurate flow prediction via MModels lies in the effective and precise fitting of these parameters.

Among the candidate choke models listed in Table 1, the Valve Sizing model was selected in this study because its required parameters are most consistent with the available field data. Specifically, the field dataset contains upstream pressure, downstream pressure, upstream temperature, gas specific gravity, and valve position-related information. By contrast, several other models require parameters such as fixed choke size, gas–liquid ratio, or condensate–gas ratio, which were not continuously measured or were unavailable in the field dataset. Therefore, directly applying these models would introduce additional empirical assumptions and uncertainty. Another important reason for selecting the Valve Sizing model is that it introduces the valve flow coefficient, CV, to characterize the throttling capacity of the choke valve. Although CV cannot be directly measured in the field, it can be fitted using field-accessible variables, such as valve position indicator, downstream temperature, temperature differential, and pressure differential. This is consistent with the objective of this study, namely, to construct mechanistic VFM models by combining choke-flow equations with piecewise linear regression for CV estimation. Furthermore, the Valve Sizing model is physically suitable for gas flow through a wellhead choke valve because it is based on compressible-flow and critical-flow principles. It can account for the effects of pressure ratio, upstream pressure, upstream temperature, gas specific gravity, and valve throttling capacity on gas flow rate. Thus, compared with the other candidate models, the Valve Sizing model provides a better balance between field-data availability, physical interpretability, and engineering applicability.

A Valve Sizing formula was selected as the MModel based on available field data types, as shown in Equations (8) and (9). These equations derive from principles of isentropic flow and critical flow. For compressible fluids such as natural gas (NG), flow rate depends not only on the pressure differential but more critically on the ratio of absolute inlet pressure. When this ratio falls below a specific critical threshold, gas velocity at the minimum cross-section of the flow path reaches local sonic speed, thereby maximizing flow rate.

Q_{g} = 0.471 N_{2} C_{V} p_{1} \times \sqrt{\frac{1}{G_{g} T_{1}}} (p_{2} < 0 . 5 p_{1})

(8)

Q_{g} = N_{2} C_{V} p_{1} \times (1 - \frac{2 Δ p}{3 p_{1}}) \sqrt{\frac{Δ p}{p_{1} G_{g} T_{1}}} (p_{2} > 0 . 5 p_{1})

(9)

where

Q_{g}

denotes gas flow rate, std L/min;

C_{V}

represents the flow coefficient;

p_{1}

indicates upstream pressure, bar;

p_{2}

denotes downstream pressure, bar;

T_{1}

signifies upstream temperature, K;

G_{g}

stands for gas specific gravity;

N_{2}

= 6950;

Δ p

refers to pressure drop, bar.

3.2. Piecewise Linear Regression Model

Under ideal conditions, the flow coefficient CV, as an inherent geometrical property of the choke valve, should be determined from the manufacturer’s characteristic curve. However, the conventional gas wells investigated in this study have been in operation for many years. Due to long-term erosion by high-pressure gas flow, the choke valves may suffer from severe wear, and the actual effective CV may deviate substantially from the factory-calibrated theoretical value. Moreover, the original manufacturer CV curves are generally unavailable in the field records. Therefore, this study makes a compromise between theoretical rigor and engineering practicality. Based on the Joule–Thomson effect during throttling, downstream thermodynamic response parameters are used as indirect features to infer the actual flow capacity of the valve. In this context, the fitted CV becomes a ”lumped equivalent flow coefficient” that absorbs thermodynamic fluctuations and valve-wear effects. Although this treatment sacrifices part of the strict mechanistic interpretability, it effectively overcomes the engineering obstacle caused by missing key equipment data.

Based on the above consideration, a piecewise linear regression model is introduced to infer CV from a single feature variable and subsequently calculate the flow rate using the Valve Sizing equation. The model aims to capture the nonlinear relationship between thermodynamic parameters and the equivalent CV through piecewise mapping. Its general expression is given in Equation (10).

\begin{array}{l} y = α + β x + \sum_{j = 1}^{k - 1} γ_{j} \cdot {(x - c_{j})}_{+} \\ \{\begin{cases} x - c_{j} if x > c_{j} \\ 0 otherwise \end{cases} \end{array}

(10)

where y represents the predicted value; x denotes the feature variable; k indicates the number of segments;

c_{j}

specifies the location of the j-th breakpoint;

α

defines the intercept of the first segment; β represents the slope of the first segment; γ_j denotes the slope change at the j-th breakpoint.

Piecewise linear regression constitutes a global optimization problem. Given a fixed number of segments, the method identifies optimal breakpoint locations to minimize the sum of squared residuals. Figure 7 illustrates the characteristics of piecewise fitting. Observation reveals that fitting performance improves as the number of segments increases; however, excessive segmentation leads to overfitting. While such models fit individual data points precisely, they lose generalizability, rendering their predictive value negligible. Consequently, this study adopts a two-segment continuous piecewise linear regression (k = 2). Under this fixed segmentation condition, the least squares criterion jointly estimates the internal breakpoint

c_{1}

along with the slopes and intercepts for both segments from the data.

3.3. Application Process of MModel-Based VFM

Figure 8 depicts the application process of MModel-based VFM constructed upon mechanistic principles. This workflow employs the choke model as the flow calculation equation. Based on actual collected field data types, it offers multiple adaptive modeling methods for CV fitting to enhance model applicability and accuracy under varying data conditions. Depending on data completeness, three regression strategies are available: multivariate regression, temperature-difference-based piecewise regression, and valve-position-based piecewise regression. To optimize model parameters, parameter estimation strategies such as the least squares method, particle swarm optimization, or differential evolution algorithms can be implemented to achieve high fitting precision. Model performance undergoes a comprehensive evaluation using metrics including the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE). Ultimately, the optimal CV regression model is selected according to specific application scenario requirements, providing a reliable basis for practical engineering applications.

4. Construction of Data-Driven VFM Models

4.1. DAModels

Given the significant nonlinear relationship between gas well production and multiple feature variables, this study employs data-driven machine learning methods to construct VFM models for gas well production. Specifically, six distinct machine learning models were selected for conventional NG well production prediction, as shown in Table 2.

The quantity and setting of hyperparameters directly influence model fitting capability, generalization performance, and computational efficiency. Due to differences in algorithmic principles and structures, the function and sensitivity of hyperparameters vary significantly across models. This study optimizes hyperparameters via grid search within predefined parameter spaces. The typical hyperparameter search spaces were set as follows: number of trees [10, 500]; maximum depth/number of leaves [3, 150]; learning rate [0.01, 0.5]; feature/sample sampling ratio [0.5, 1.0]; and L1/L2 regularization coefficients [0, 10]. After optimization, the best hyperparameter configurations of each model on the internal construction set were obtained as follows.

LightGBM, an efficient implementation of gradient-boosting trees, employs histogram-based algorithms and leaf-wise growth strategies to enhance training speed and accuracy. The optimized configuration was as follows: learning rate 0.71; 36 leaves; a minimum of nine samples per leaf; feature sampling ratio 0.98; maximum bins 1023; and L1 and L2 regularization coefficients of 0.0015 and 5.5, respectively.

XGBoost adopts a regularized gradient-boosting framework. The optimized configuration was as follows: learning rate 0.047; 132 maximum leaves; subsample ratio 0.71; column sampling ratio by level 1.0; column sampling ratio by tree 0.76; minimum child weight 1.6; and L1 and L2 regularization coefficients of 0.75 and 0.092, respectively.

XGBoost-limitdepth emphasizes the combination of depth control and regularization to prevent overfitting. The optimized configuration was as follows: learning rate 0.36; maximum depth 5; minimum child weight 0.50; subsample ratio 0.57; column sampling ratios by level and by tree of 0.73 and 0.87; and L1 and L2 regularization coefficients of 0.0051 and 8.5, respectively.

Random forest, a bagging-based decision tree ensemble, reduces variance and improves robustness through bootstrap sampling and random feature subsets. The optimized configuration was as follows: six weak learners; feature subsampling ratio 0.87; and 29 maximum leaves.

Extra trees further randomizes split thresholds beyond random forest. The optimized configuration was as follows: four decision trees; feature subsampling ratio 1.0; and 20 maximum leaves.

CatBoost adopts a gradient-boosting framework utilizing ordered target encoding and symmetric tree structures. The optimized configuration was as follows: learning rate 0.034; 201 decision trees; and 26 early stopping rounds.

4.2. Data-Driven VFM Application Workflow

Machine learning effectively captures complex nonlinear interactions among multiple features by automatically learning intricate data patterns. Its core advantage lies in optimizing feature weights and model parameters via algorithms such as decision trees and ensemble learning, thereby enhancing prediction accuracy and generalizability. Figure 9 illustrates the training and optimization workflow for the machine learning models constructed. Initially, historical production data undergo cleaning and preprocessing, encompassing outlier treatment, missing-value imputation, and feature standardization. Subsequently, Monte Carlo random sampling partitions the dataset into training and validation sets to ensure statistical reliability in model evaluation. Finally, models undergo comprehensive assessment using the training set, with performance evaluated via root mean square error (RMSE), mean absolute error (MAE), and the coefficient of determination (R²). This methodology facilitates optimal model selection across diverse application scenarios, adapting to complex and variable NG production conditions and gas qualities, thus providing a high-precision, robust virtual metering solution for gas-well production dynamics prediction.

5. VFM Model Test Results

5.1. Data Cleaning and Partitioning

Before model construction, strict data cleaning and quality control were performed on the raw collected data. First, based on domain knowledge, redundant records collected during gas-well shut-in periods were removed, including 1325 invalid zero-flow records. Subsequently, Z-score statistical detection was conducted for continuous physical variables, including upstream pressure, downstream pressure, upstream temperature, downstream temperature, and measured instantaneous flow rate. The outlier threshold was set as |Z| > 3. A total of 2185 abnormal data points were identified and processed, accounting for 1.1% of the total data points. For these isolated abnormal values, linear interpolation using adjacent time points was applied for smoothing correction. After the above cleaning procedure, a high-quality dataset containing 21,990 valid historical production records was finally constructed. The statistical distribution of each feature is shown in Table 3.

In the experimental design, to comprehensively and objectively evaluate the generalizability of the model, the full cleaned dataset was randomly divided into a training set and a test set at a ratio of 70%:30%, corresponding to 15,393 training samples and 6957 test samples, respectively. During model construction and optimization, the independence of the test set was strictly maintained. Only the 70% training data were used for hyperparameter optimization combined with K-fold cross-validation. After the optimal model configuration was determined, the model was finally fitted using the full training set. In addition, to reduce the evaluation bias caused by a single random data split, Monte Carlo sampling was adopted. By repeating the above random training–testing split 100 times, the robustness and prediction boundaries of the optimal models under different combinations of physical operating states were systematically evaluated.

5.2. Analysis of VFM Model Test Results Based on MModels

5.2.1. Fitting Function Analysis of Piecewise Linear Regression Models

In MModels, flow-rate calculation hinges on fitting the valve position indicator (CV). Parameters such as valve position indicator, downstream temperature, temperature differential, and pressure differential are acquired downstream of the choke valve following the throttling effect; thus, they indirectly reflect CV magnitude. This study employed piecewise linear regression to fit CV values using these four features individually (Figure 10). Results indicate that CV does not maintain a linear relationship with increasing valve opening. Once the valve opening reaches 100, primary factors influencing downstream temperature, temperature differential, pressure differential, and CV shift from the valve diameter to the wellhead tubing pressure and production rate. Consequently, linear fitting methods fail to effectively model CV under these conditions. Furthermore, data exhibit stronger nonlinearity when downstream temperature exceeds 25 °C, temperature differential falls below 7 °C, or pressure differential is less than 1.8 MPa, leading to significantly reduced fitting accuracy.

To determine a reasonable segment number (k), fitting performance for cases with (k\geq 3) was further tested. The fitting results for (k = 3) are shown in Figure 11. Under fully open operating conditions, gas-well flow behavior is strongly influenced by complex external boundary conditions, causing highly nonlinear, dense, scattered clusters in certain regions. When a three-segment fitting strategy was adopted, the algorithm attempted to minimize local residuals within these scattered regions, resulting in strongly oscillating non-monotonic piecewise lines in the pressure-differential model. Similar abrupt vertical-transition anomalies also appeared in the valve-position-indicator model. In the downstream-temperature model, the third segment evolved into an almost flat line segment with extremely low slope and limited extrapolation value, while its (R²) improved by only 0.01 compared with the two-segment fitting result. Although the temperature-differential model did not exhibit obvious visual distortion, its (R²) improvement was only 0.02. Furthermore, increasing the number of segments beyond three would similarly introduce oscillatory fitting behavior in densely distributed regions of the temperature-differential data. Therefore, considering physical consistency and engineering robustness, the final model in this study was limited to a two-segment continuous linear regression model.

5.2.2. Fitting Error Analysis of Piecewise Linear Regression Models

Figure 12 illustrates the fitting errors for the piecewise linear model on the training set. Overall, piecewise linear models based on single features demonstrate poor fitting accuracy. The temperature-differential model achieves the best overall performance, yet still yields errors exceeding 5% for 75% of samples and over 10% for 45% of samples. Mean absolute errors for models based on individual features are: temperature differential (12.94%), downstream temperature (34.66%), pressure differential (16.5%), and valve position indicator (10.08%). Restricting the analysis to samples with valve openings below 100 reveals the valve-position-indicator model as superior, achieving an R² of 0.979, RMSE of 1.794, MSE of 3.22, MAE of 1.13, and a mean relative error of 2.7%, with only isolated large errors. Thus, predicting CV via valve position indicator offers higher accuracy when valve opening remains below 100.

5.2.3. Test Result Analysis of Piecewise Linear Models

Figure 13 presents the flow-rate prediction results of the piecewise linear regression models on the test set, comprising 6589 samples. The first 3347 samples correspond to data collected at a valve opening of 100. Prediction accuracy is strongly related to the valve position indicator. For samples with a valve position equal to 100, the valve position indicator reaches its upper limit and cannot effectively reflect further variations in flow rate. Under this condition, downstream thermodynamic parameters become more informative. Among the tested mechanistic models, the downstream-temperature piecewise linear model performs best, yielding a mean error of 14.81%. Conversely, for samples with valve positions less than 100, changes in valve opening directly affect the effective flow area of the choke valve. Therefore, the valve-position piecewise linear model achieves the highest accuracy in this regime, with a mean error of 1.35%. Although the temperature-difference piecewise linear model does not provide the best performance in either individual subset, it exhibits the lowest overall mean test error of 4.59%, indicating the best comprehensive predictive performance across the entire test set.

5.3. VFM Model Test Results Based on Data-Driven Approaches

5.3.1. Feature Contributions

SHAP analysis was employed to interpret machine learning “black-box” models. Grounded in game theory, SHAP analysis quantifies feature contributions to production prediction by calculating Shapley values, assessing feature importance while providing both global importance metrics and local prediction explanations. The initial feature-importance analysis results using all input features are shown in Figure 14. Despite variations in feature contributions across different machine learning models due to distinct learning mechanisms and hyperparameter configurations, “Date” and “Valve Position Indication” consistently rank as the top two contributors in most models. At first glance, this result appears to be consistent with the actual production behavior of gas wells, because gas-well output generally declines with production time. However, further analysis reveals that tree-based models such as CatBoost, LGBM, and random forest exhibit an abnormally high dependence on the “Date” feature, with its contribution exceeding 50% in some cases. This extremely high contribution indicates the strong interpolation capability of tree-based models when handling time-series data. Under random train–test partitioning, the models may exploit neighboring temporal indices, resulting in potential label leakage, rather than truly learning the physical mapping between gas-well operating states and flow rate.

To eliminate the spurious correlation caused by temporal interpolation, the Date feature was removed, and SHAP-based feature-contribution analysis was performed again. The results are shown in Figure 15. The comparison indicates that, after removing the temporal feature, the models shift their learning focus to physically meaningful driving factors. Valve position indication, pressure differential, and upstream temperature become the dominant features in multiple models. It is particularly noteworthy that, when the Date feature was included, pressure differential ranked among the least important features in LGBM and random forest. After removing Date, however, pressure differential became one of the most influential features. This significant reconstruction of feature importance not only fundamentally reduces the risk of data leakage but also demonstrates that the models without temporal features can more accurately capture the hydrodynamic and thermodynamic mechanisms of the wellhead throttling process.

5.3.2. Model Test Result—With Date Feature Included

To comprehensively evaluate the performance of various machine learning models for gas wellhead virtual metering tasks, this paper systematically tested six mainstream ensemble learning models: random forest, LightGBM, XGBoost, extra trees, CatBoost, and depth-limited XGBoost-limitdepth. Model performance was assessed using MAE, MSE, RMSE, and R². Figure 16 shows the fitting results on the training set. XGBoost outperformed other models with MSE = 0.08 × 10⁸, RMSE = 2843.7, MAE = 1696.83, and R² = 0.979. Among 15,393 samples, only seven exhibited relative errors exceeding 5%, with a mean relative error of 0.36%. LightGBM and XGBoost-limitdepth demonstrated performance close to XGBoost, both achieving R² values of 0.968. Although LightGBM yielded lower MSE, RMSE, and MAE overall compared to XGBoost-limitdepth, the latter achieved a smaller mean relative error (0.65%) and fewer high-error samples. Analysis suggests LightGBM fits most samples with high precision but deviates significantly under rare abnormal conditions. XGBoost-limitdepth effectively mitigates overfitting through depth restriction, trading slight average accuracy reduction for more stable error distribution and enhanced generalizability.

Figure 17 presents the machine learning model test results. Among 6957 test samples, XGBoost achieved the highest prediction accuracy with a mean relative error of only 0.23%. The remaining models ranked by performance from highest to lowest were LightGBM, XGBoost-limitdepth, CatBoost, random forest, and extra trees, consistent with the training set rankings. XGBoost effectively controls model complexity and suppresses overfitting via built-in L1 and L2 regularization terms. Its gradient-boosting architecture progressively learns complex nonlinear relationships and high-order interaction effects between production parameters (pressure, temperature, and valve position indicator) and output. Additionally, the model approximates the loss function using second-order Taylor expansion and adaptively adjusts weak learner weights, maintaining excellent prediction robustness, even amidst data noise and outliers.

5.3.3. Model Test Result—Date Feature Removed

Because tree-based models can easily exploit the Date feature for temporal interpolation, the test performance may be artificially inflated when Date is retained. Therefore, in this section, the Date feature was removed, and the machine learning models were retrained and retested using only physically meaningful production variables. The results show that, after removing the temporal feature, all models still maintain good fitting capability during the training stage, with R² values greater than 0.95.

The prediction results on the test set are shown in Figure 18. XGBoost achieves the lowest mean relative error of 5.39%. Compared with the model retaining the Date feature, its mean relative error increases by 5.16 percentage points. This change confirms that the extremely high accuracy of the original model was largely caused by excessive interpolation along the time series and, therefore, represents spurious accuracy. In contrast, the results obtained after removing the Date feature more realistically reflect the generalizability of the models based on physical production parameters. Further comparison of the evaluation metrics shows that XGBoost achieves the smallest mean relative error, while CatBoost obtains the highest R². This indicates that XGBoost provides more stable overall prediction under normal operating conditions, whereas CatBoost may have advantages in handling a small number of extreme operating conditions. Further analysis of the prediction deviations shows that the original historical data have an obvious sample gap in the production-rate interval of 80 × 10⁴–100 × 10⁴ m³/d. As a result, the models could not sufficiently learn the nonlinear mapping relationship in this interval, which is one of the important objective reasons for the increased local prediction error.

5.3.4. Model Robustness Analysis Based on Monte Carlo Sampling

Single train–test split methods exhibit significant limitations, as different data partitioning strategies substantially affect machine learning model performance evaluation, introducing considerable randomness and instability. This evaluation uncertainty primarily stems from variations in training set composition, which alter learned feature relationships. Consequently, a single split fails to comprehensively reflect true model generalizability under varying operating conditions or determine which partition best represents practical application performance. To address this issue, this study adopted Monte Carlo sampling, systematically evaluating model robustness through 100 random data splits (Figure 19). Across 100 groups of training and test sets generated by different partitions, the machine learning models displayed distinct robustness characteristics. The extra trees model exhibited substantial fluctuation in test performance, with an MSE ranging from 5.15 × 10⁸ to 1.30 × 10⁹, indicating significant sensitivity to data partitioning. In contrast, XGBoost demonstrated markedly narrower fluctuation ranges across multiple samplings, with coefficients of variation for all metrics remaining below 0.1, reflecting superior robustness.

However, when the Date feature is included, random Monte Carlo sampling still cannot fully eliminate temporal interpolation effects. Because test samples may have neighboring training samples in time, the narrow metric distributions in Figure 19 mainly indicate that the Date-related interpolation effect is stable across different random splits. Therefore, these results should not be regarded as sufficient evidence of model robustness under genuinely unseen operating conditions. To evaluate the true robustness of the models under reduced temporal leakage, Monte Carlo sampling was further conducted after removing the Date feature. The results are shown in Figure 20. Compared with Figure 19, the absolute performance metrics decrease to a more realistic level after removing the temporal interpolation effect, and the fluctuation ranges across different sampling rounds become wider. This moderate dispersion of performance reflects the increased difficulty of learning flow-rate prediction when the models rely only on complex physical operating variables rather than temporal dependence. Nevertheless, even under this stricter testing condition, the core metrics of models such as XGBoost remain within an engineering-acceptable range over 100 random tests, and no extreme performance collapse is observed. These results suggest that the revised models retain practical predictive capability after eliminating the main source of label leakage.

6. Conclusions

A piecewise linear regression MModel based on the choke valve model and multiple DAModels were constructed for conventional gas wellhead VFM systems. Using field data from a conventional gas well in the Southwest Oil and Gas Field, the predictive performance of the MModels and DAModels was systematically evaluated. The main conclusions are as follows:

(1): Guided by choke valve mechanism principles, four piecewise linear regression models were developed using the valve position indicator, temperature differential, pressure differential, and downstream temperature, respectively. Additionally, six data-driven machine learning models were established via grid search optimization, forming a comprehensive VFM model framework.
(2): The MModel analysis indicates that the temperature-differential piecewise model achieved optimal performance on both training and test sets (R² = 0.91, mean error 4.59%). Under conditions where valve opening exceeds 100%, the downstream temperature piecewise model yielded lower prediction errors (14.3%). Conversely, when valve opening remains below 100%, the valve-position-indicator piecewise model demonstrated the best fitting accuracy, with a mean error of only 1.35%.
(3): Feature importance analysis reveals that the Date feature contributes most significantly, exceeding 50% in CatBoost and random forest models. The valve position indicator follows, with contributions ranging between 14% and 25%. Time feature contributions remain below 1% across all models, indicating that production date and valve position indicator exert a decisive influence on production prediction.
(4): DAModels exhibit superior overall performance, with all models achieving R² values above 0.95 and 99% of samples showing prediction errors below 5%. The XGBoost model stands out, with an R² of 0.979 and a mean error of merely 0.13%. Monte Carlo random sampling validation confirms its minimal performance metric fluctuations, with coefficients of variation for all indicators remaining below 0.1, demonstrating exceptional prediction accuracy and robustness.
(5): Nevertheless, the proposed models still have limitations under changing operating conditions. The current evaluation is based on historical data from a specific gas well and mainly reflects the operating range covered by this dataset. When reservoir pressure, gas composition, choke characteristics, sensor conditions, or flow regimes change significantly, the relationship between input features and flow rate may shift. Under such circumstances, both mechanistic models and data-driven models may suffer from reduced accuracy. Therefore, for long-term field deployment, continuous performance monitoring, periodic model updating, and recalibration using newly collected production data are required to maintain prediction reliability. Future work will focus on adaptive updating strategies and broader validation using data from more wells and more diverse operating conditions.

Author Contributions

Conceptualization, M.W., Z.W. and J.Z.; methodology, M.W., Z.W. and J.Z.; software, J.L., G.C. and F.Q.; validation, J.L. and F.Q.; formal analysis, M.W., Z.W., G.C. and J.Z.; investigation, F.Q.; resources, F.Q. and Y.W.; data curation, F.Q. and Y.W.; writing—original draft preparation, M.W., Z.W., J.Z., Y.W., P.Z. and C.L.; writing—review and editing, J.Z.; visualization, M.W.; supervision, M.W.; project administration, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

Authors Minhao Wang, Zhenjia Wang, Gangping Chen, Jian Luo and Fang Qin were employed by PetroChina Southwest Oil and Gas Field Company (China). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gong, J.; Kang, Q.; Wu, H.; Li, X.; Shi, B.; Song, S. Application and prospects of multi-phase pipeline simulation technology in empowering the intelligent oil and gas fields. J. Pipeline Sci. Eng. 2023, 3, 100127. [Google Scholar] [CrossRef]
Rayner, D.; Warda, A.A.B.; Adriana, M.; Afra, A.; Amna, S.A.; Fatima, B.; Azubuike, O.; Steven, H.; Sara, B.; Lukas, B.; et al. Advanced Well Test Advisory System and Virtual Flow Metering to Improve Production Allocation and Well Surveillance. In Proceedings of the Abu Dhabi International Petroleum Exhibition and Conference, Abu Dhabi, United Arab Emirates, 4–7 November 2024. [Google Scholar]
Ursini, F.; Rossi, R.; Castelnuovo, L.; Perrone, A.; Bendari, A.; Pollero, M. The benefits of virtual meter applications on production monitoring and reservoir management. In Proceedings of the SPE Reservoir Characterisation and Simulation Conference and Exhibition, SPE, Abu Dhabi, United Arab Emirates, 17–19 September 2019. [Google Scholar]
Cheng, B.; Li, Q.P.; Wang, J.; Qing, W. Virtual subsea flow metering technology for gas condensate fields and its application in offshore China. In International Conference on Offshore Mechanics and Arctic Engineering; American Society of Mechanical Engineers: New York, NY, USA, 2018; Volume 8, p. 51296. [Google Scholar]
Bikmukhametov, T.; Jäschke, J. First principles and machine learning virtual flow metering: A literature review. J. Pet. Sci. Eng. 2020, 184, 106487. [Google Scholar] [CrossRef]
Sandnes, A.T.; Grimstad, B.; Kolbjørnsen, O. Multi-task learning for virtual flow metering. Knowl.-Based Syst. 2021, 232, 107458. [Google Scholar] [CrossRef]
Jadid, K.M. Performance Evaluation of Virtual Flow Metering Models and Its Application to Metering Backup and Production Allocation. Ph.D. Thesis, Louisiana State University and Agricultural & Mechanical College, Baton Rouge, LA, USA, 2017. [Google Scholar]
Baba, Y.D.; Chat, A.; Aliyu, A.; Okereke, N.N.; Ogunyemi, A.; Owolabi, J.O. Study of High Viscous Multiphase Flow Using OLGA Flow Simulator. J. Eng. Technol. 2019, 4, 8713. [Google Scholar] [CrossRef]
Zhou, T.; Kabir, C.S.; Hoadley, S.F.; Hasan, A.R. Probing rate estimation methods for multiphase flow through surface chokes. J. Pet. Sci. Eng. 2018, 169, 230–240. [Google Scholar] [CrossRef]
Nejatian, I.; Kanani, M.; Arabloo, M.; Bahadori, A.; Zendehboudi, S. Prediction of natural gas flow through chokes using support vector machine algorithm. J. Nat. Gas Sci. Eng. 2014, 18, 155–163. [Google Scholar] [CrossRef]
Kaleem, W.; Tewari, S.; Fogat, M.; Martyushev, D.A. A hybrid machine learning approach based study of production forecasting and factors influencing the multiphase flow through surface chokes. Petroleum 2024, 10, 354–371. [Google Scholar] [CrossRef]
Convery, O.; Smith, L.; Gal, Y.; Hanuka, A. Uncertainty quantification for virtual diagnostic of particle accelerators. Phys. Rev. Accel. Beams 2021, 24, 074602. [Google Scholar] [CrossRef]
Mohammadmoradi, P.; Hessam, M.M.; Apostolos, K. Data-driven production forecasting of unconventional wells with apache spark. In Proceedings of the SPE Western Regional Meeting, Garden Grove, CA, USA, 22–26 April 2018. [Google Scholar]
Hotvedt, M.; Grimstad, B.; Imsland, L. A hybrid data-driven, mechanistic virtual flow meter—A case study. IFAC-PapersOnLine 2020, 53, 11692–11697. [Google Scholar] [CrossRef]
Li, J.; Lin, Y.; An, C.F.; Li, X.; Xiao, T.H.; Hou, L.Y.; Qi, J.N.; Gong, J. KGPM: A Knowledge-Guided Predictive Model for Virtual Metering in Gas Wells. ACS Omega 2024, 9, 51570–51579. [Google Scholar] [CrossRef] [PubMed]
Liang, B.; Liu, J.; Kang, L.X.; Jiang, K.; You, J.Y.; Jeong, H.; Meng, Z. A novel framework for predicting non-stationary production time series of shale gas based on BiLSTM-RF-MPA deep fusion model. Pet. Sci. 2024, 21, 3326–3339. [Google Scholar] [CrossRef]
Hotvedt, M.; Bjarne, A.G.; Lars, S.I. Passive learning to address nonstationarity in virtual flow metering applications. Expert Syst. Appl. 2022, 210, 118382. [Google Scholar] [CrossRef]
Mishra, S.; Parag, K.; Devina, R. Multiphase Virtual Flow Metering: A Step Change in Production Management. In Proceedings of the Offshore Technology Conference Asia, Kuala Lumpur, Malaysia, 27 February–1 March 2024. [Google Scholar]
Ghosh, D.; Hermonat, E.; Mhaskar, P.; Snowling, S.; Goel, R. Hybrid modeling approach integrating first-principles models with subspace identification. Ind. Eng. Chem. Res. 2019, 58, 13533–13543. [Google Scholar] [CrossRef]
Mu, J.C.; Shane, M.A.; Jie, O.Y.; Wu, H.F. Single well virtual metering research and application based on hybrid modeling of machine learning and mechanism model. J. Pipeline Sci. Eng. 2023, 3, 100111. [Google Scholar] [CrossRef]

Figure 1. VFM workflow based on the choke model.

Figure 2. Data-driven VFM workflow.

Figure 3. Gas-field VFM system architecture integrating mechanistic and data-driven approaches.

Figure 4. New well VFM scheme.

Figure 5. Mature well VFM scheme.

Figure 6. VFM scheme for wells equipped with rotational metering devices.

Figure 7. Fitting plots for different numbers of segments: (a) 1 segment; (b) 2 segments; (c) 3 segments; (d) 4 segments.

Figure 8. Application process of MModel.

Figure 9. Machine learning model training workflow.

Figure 10. Piecewise linear fitting functions: (a) valve position indicator; (b) valve outlet temperature; (c) temperature differential; (d) pressure differential.

Figure 11. Piecewise linear fitting functions (k = 3): (a) valve position indicator; (b) valve outlet temperature; (c) temperature differential; (d) pressure differential.

Figure 12. Piecewise linear fitting errors: (a) all samples; (b) samples with valve opening < 100.

Figure 13. Piecewise linear model fitting results on test set.

Figure 14. Feature contributions with the Date feature included.

Figure 15. Feature contributions without the Date feature.

Figure 16. Machine learning model fitting results on training set: (a) extra trees; (b) LGBM; (c) XGBoost-limitdepth; (d) random forest; (e) XGBoost; (f) CatBoost.

Figure 17. Machine learning model prediction results on test set: (a) CatBoost; (b) LGBM; (c) extra trees; (d) random forest; (e) XGBoost; (f) XGBoost-limitdepth.

Figure 18. Machine learning model prediction results on the test set after removing the Date feature: (a) CatBoost; (b) extra trees; (c) random forest; (d) XGBoost-limitdepth; (e) XGBoost; (f) LGBM.

Figure 19. Performance evaluation of models on test sets via random sampling with the Date feature included: (a) MSE; (b) RMSE; (c) MAE; (d) R².

Figure 20. Performance evaluation of models on test sets via random sampling after removing the Date feature: (a) MSE; (b) RMSE; (c) MAE; (d) R².

Table 1. Statistics of nozzle model parameters.

Formula	Upstream Pressure	Downstream Pressure	Upstream Temperature	Gas Specific Gravity	Valve Size	CV	Gas–Liquid Ratio	Condensate–Gas Ratio
Valve Sizing	●	●	●	●		●
H. Al-Attar	●	●			●		●
Nasriani/Kalantari Asl	●	●			●		●
Seidi and Sayahi	●	●			●		●
Bokhamseen	●				●			●
Nasriani	●	●			●		●

Note: ● indicates that the parameter is included in the formula.

Table 2. Categories of machine learning model hyperparameters.

Model	Hyperparameter Categories
LightGBM	Number of trees, number of leaves, minimum samples per leaf, learning rate, maximum bins, feature sampling ratio, L1 regularization coefficient, L2 regularization coefficient
XGBoost	Number of trees, maximum number of leaves, minimum sum of instance weights in child node, learning rate, sample sampling ratio, feature sampling ratio per level, feature sampling ratio per tree, L1 regularization coefficient, L2 regularization coefficient
Extra trees	Number of trees, maximum number of features, maximum number of leaves
Random forest	Number of trees, maximum number of features, maximum number of leaves
CatBoost	Early stopping rounds, learning rate, number of trees
XGBoost-limitdepth	Number of trees, maximum tree depth, minimum sum of instance weights in child node, learning rate, sample sampling ratio, feature sampling ratio per level, feature sampling ratio per tree, L1 regularization coefficient, L2 regularization coefficient

Table 3. Data distribution.

Label	Name	Symbol	Unit	Range	Mean
$X_{1}$	Date	$D_{1}$	/	1/12/2024–9/3/2025	/
$X_{2}$	Time	$D_{2}$	/	0:00:00–23:54:00	/
$X_{3}$	Wellhead tubing pressure	W	Mpa	7.61–23.17	12.44
$X_{4}$	Upstream pressure	P₁	Mpa	7.52–23.17	12.39
$X_{5}$	Upstream temperature	T₁	°C	10.83–41.74	36.16
$X_{6}$	Valve position indicator	CV	/	3.69–100	72
$X_{7}$	Downstream pressure	P₂	Mpa	7.32–14.24	10.27
$X_{8}$	Downstream temperature	T₂	°C	0.28–32.95	28.54
$X_{9}$	Pressure differential	ΔP	Mpa	0.01–12.45	2.17
$X_{10}$	Temperature differential	ΔT	°C	2.17–25.28	7.61
$Y$	Instantaneous flow rate	F	m³/d	36,497.79–1,481,612.25	1,109,911.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Wang, Z.; Chen, G.; Zhou, J.; Luo, J.; Qin, F.; Wu, Y.; Zhou, P.; Lin, C. Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields. Modelling 2026, 7, 99. https://doi.org/10.3390/modelling7030099

AMA Style

Wang M, Wang Z, Chen G, Zhou J, Luo J, Qin F, Wu Y, Zhou P, Lin C. Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields. Modelling. 2026; 7(3):99. https://doi.org/10.3390/modelling7030099

Chicago/Turabian Style

Wang, Minhao, Zhenjia Wang, Gangping Chen, Jun Zhou, Jian Luo, Fang Qin, Yue Wu, Pan Zhou, and Chuqi Lin. 2026. "Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields" Modelling 7, no. 3: 99. https://doi.org/10.3390/modelling7030099

APA Style

Wang, M., Wang, Z., Chen, G., Zhou, J., Luo, J., Qin, F., Wu, Y., Zhou, P., & Lin, C. (2026). Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields. Modelling, 7(3), 99. https://doi.org/10.3390/modelling7030099

Article Menu

Hybrid Mechanistic–Data-Driven Virtual Metering Models and Methodologies for Conventional Gas Fields

Abstract

1. Introduction

1.1. Motivation

1.2. The Literature Review

1.3. Contributions

1.4. Paper Organization

2. VFM System Architecture and Application

2.1. VFM Technology

2.1.1. MModels

2.1.2. DAModels

2.2. Overall Architecture of the VFM System

2.3. Single-Well Model Allocation and Deployment Strategy

2.4. VFM System Application Scenarios

2.4.1. New Well VFM

2.4.2. Mature Well VFM

2.4.3. VFM for Wells with Rotational Metering Devices

3. Construction of Mechanistic VFM Models

3.1. MModel Selection

3.2. Piecewise Linear Regression Model

3.3. Application Process of MModel-Based VFM

4. Construction of Data-Driven VFM Models

4.1. DAModels

4.2. Data-Driven VFM Application Workflow

5. VFM Model Test Results

5.1. Data Cleaning and Partitioning

5.2. Analysis of VFM Model Test Results Based on MModels

5.2.1. Fitting Function Analysis of Piecewise Linear Regression Models

5.2.2. Fitting Error Analysis of Piecewise Linear Regression Models

5.2.3. Test Result Analysis of Piecewise Linear Models

5.3. VFM Model Test Results Based on Data-Driven Approaches

5.3.1. Feature Contributions

5.3.2. Model Test Result—With Date Feature Included

5.3.3. Model Test Result—Date Feature Removed

5.3.4. Model Robustness Analysis Based on Monte Carlo Sampling

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI