Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms

Zhang, Chengming; Feng, Fuping; Zhang, Cong; Li, Shiyuan; Xie, Junzhuzi

doi:10.3390/pr13124012

Open AccessArticle

Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms

by

Chengming Zhang

^1,2,

Fuping Feng

^1,*,

Cong Zhang

³

,

Shiyuan Li

³ and

Junzhuzi Xie

³

¹

Key Laboratory of Enhanced Oil and Gas Recovery, Ministry of Education, Northeast Petroleum University, Daqing 163318, China

²

Xinli Oil Production Plant of Jilin Oilfield Branch of PetroChina, Songyuan 138000, China

³

School of Electrical and Information Engineering, Northeast Petroleum University, Daqing 163318, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(12), 4012; https://doi.org/10.3390/pr13124012

Submission received: 10 November 2025 / Revised: 29 November 2025 / Accepted: 4 December 2025 / Published: 11 December 2025

(This article belongs to the Section Petroleum and Low-Carbon Energy Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

Traditional oilfields face challenges such as high energy consumption, imprecise control, and lax management in mid-to-late development stages, leading to increased costs and reduced efficiency. To address these issues, this work aims to develop an intelligent optimization framework for intermittent pumping by explicitly integrating physical mechanisms with data-driven modeling. Specifically, we propose a data–physics dual-driven method that combines physics-based parameters derived from seepage mechanics with data-driven feature selection using Pearson correlation analysis to identify nine key production factors. An improved CatBoost regression framework is developed through systematic preprocessing, including data cleaning, cubic polynomial feature expansion, F-value screening, and Z-score normalization. The model is further enhanced using Bayesian hyperparameter optimization, a weight adaptation mechanism, and an attention-based multi-level architecture. The novelty of this work lies in the unified dual-driven optimization strategy and the enhanced CatBoost framework that jointly improve prediction accuracy and model generalization. Experimental results demonstrate that the proposed method can accurately predict pumping operation times. Compared with the original CatBoost model, the MAE of the large-interval model decreases by 56.94%, while that of the small-interval model decreases by 16.23%. In addition, the accuracy of the large-interval model increases by 4.1%, and that of the small-interval model increases by 1.22%. These improvements show that the enhanced CatBoost model significantly strengthens predictive performance. This approach provides a reliable basis for optimizing pumping schedules, reducing energy consumption, and promoting intelligent and refined oilfield management.

Keywords:

pumping regime optimization; dual-driven strategy; mathematical modelling; physical mechanisms; CatBoost regression framework

1. Introduction

Due to rising global energy demand and increasing extraction difficulties, late-stage oilfields face depleted formation energy and insufficient fluid supply. Intermittent oil extraction technology, as an effective means of addressing this issue, can significantly improve oil well extraction efficiency and reduce energy consumption costs by periodically controlling the production and shutdown processes of oil wells.

However, traditional pumping control still suffers from several well-known inefficiencies. Specifically, conventional beam-pump (sucker rod) systems exhibit very low energy efficiency, with overall system efficiency as low as 12–23% [1]. Large-scale statistical analysis of nearly 45,000 wells in China further shows that under low daily production conditions—typical of late-stage or depleted wells—the normalized energy consumption (kW·h per ton per 100 m lift) increases sharply, meaning a substantial portion of input electrical energy is wasted as mechanical losses, friction, unbalanced torque, and pump-off empty pumping [2]. These results quantitatively confirm that traditional continuous pumping control entails substantial energy losses and inefficiencies. Intermittent pumping relies heavily on precise scheduling, as the timing of pump start-up and shutdown directly determines whether liquids are fully extracted, whether the pump runs dry, and overall energy consumption. Any deviation significantly impacts production efficiency and economic benefits, whereas continuous pumping does not exhibit this cyclical dependency. Therefore, determining an optimal intermittent pumping schedule that balances production, energy use, and equipment lifespan remains a major challenge. Recent multi-objective optimization work on low-permeability oil wells further confirms that appropriately tuned production parameters can significantly enhance energy efficiency while maintaining liquid production, highlighting the need to systematically design intermittent control regimes rather than relying solely on empirical rules [3].

Existing studies have provided an important theoretical foundation for the optimization of intermittent oil extraction systems. Liang et al. [4] analyzed liquid level dynamics using power curves and manometer data, and proposed an intermittent extraction method based on liquid level recovery to determine optimal operating time. However, it focuses on analyzing single-well production parameters and does not adequately consider complex factors such as geological conditions. Sun et al. [5] utilized data mining techniques to construct a mathematical model of the temporal variation in dynamic liquid level height during intermittent shutdown periods and pumping periods, and combined this with a particle swarm optimization algorithm to achieve intelligent optimization of the intermittent pumping regime, significantly enhancing oil recovery efficiency and economic benefits. However, such data-driven methods often lack deep integration with reservoir physical mechanisms, and the generalization capability and physical interpretability of the models require further enhancement. In the context of gas field development, Cai et al. [6] proposed an optimization strategy for intermittent production wells in the Jingbian Gas Field that jointly considers single-well schedules and staggered multi-well operation under surface-network constraints, thereby improving both well-level production efficiency and gathering system stability. For shale gas wells, Fan et al. [7] established an intermittent optimization framework based on reservoir–wellbore coupling, in which transient two-phase flow and pressure transmission are explicitly modeled to achieve more stable and efficient intermittent production.

Within the broader context of petroleum engineering practice, physically based numerical simulation has become an essential tool for analyzing complex flow processes and guiding optimized engineering design. Li et al. [8] addressed the issue of gas hydrate formation within gas–water multiphase flows in wellbore and subsea gathering systems during deepwater turbidite reservoir development. They established a coupled multiphase flow and heat transfer numerical model for wellbore–subsea pipelines, quantitatively predicting hydrate formation risks in a deepwater gas field in the South China Sea. This provided a basis for wellbore insulation, choke control, and anti-blocking measures in the gathering system. Wu and Ansari [9] developed an integrated numerical simulation framework encompassing CO₂ geological sequestration and hydrogen storage for depleted gas reservoir reuse. This systematically assessed formation pressure response, fluid phase evolution, and potential leakage risks under different injection–production scenarios, providing quantitative support for reservoir safety evaluation and long-term capacity prediction. Cao et al. [10] employed three-dimensional numerical modeling coupling crosslinker properties with reservoir geological conditions to investigate fracture propagation morphology and flow capacity evolution during coalbed hydraulic fracturing. Their work elucidated the influence of crosslinker type, formation stress, and temperature on fracturing effectiveness. The aforementioned studies demonstrate that numerical simulation has been extensively applied to diverse engineering challenges, including wellbore flow safety, depleted reservoir reactivation, and unconventional reservoir fracturing. This provides crucial methodological reference for incorporating physical constraints and mechanism analysis into the optimization of the intermittent pumping regimes discussed herein.

In addition to the above approaches, several studies have employed statistical optimization methods such as Response Surface Methodology (RSM) and Analysis of Variance (ANOVA) to evaluate parameter influence and establish empirical optimization models [11]. RSM enables the efficient exploration of parameter interactions and the identification of optimal regions, whereas ANOVA statistically validates the significance of influencing factors. Nevertheless, these methods depend on simplified assumptions and have limited ability to capture the nonlinear and strongly coupled characteristics of oil well production systems. This further demonstrates the need for a more comprehensive framework that integrates physical mechanisms with data-driven modeling. Moreover, dynamic load-optimized selection charts for flexible ultra-long stroke pumping units in low-yield wells have been developed to better match pumping equipment with operating conditions, further emphasizing the role of quantitative design tools in pumping system optimization [12].

Currently, the optimization of the interval pumping system faces numerous challenges. While traditional physical methods can reveal the underlying mechanisms of oil well production, they struggle with parameter selection and adaptability under complex operating conditions. On the other hand, data-driven methods, though effective at processing large datasets, lack physical constraints, leading to unstable predictions when data are insufficient or well conditions change. On the data-driven side, adaptive fusion-based production forecasting models for unconventional oil and gas wells have demonstrated that combining multiple base learners can markedly improve predictive accuracy and robustness in complex reservoirs [13]. Rapid classification and diagnosis workflows for gas wells driven by production data have further shown that high-frequency production monitoring can effectively support the identification of abnormal operating states and operational decision-making [14]. For artificial lift systems, deep recurrent neural network models have been used to predict and optimize the energy consumption of electrical submersible pump well systems, revealing that sequence-learning algorithms can accurately capture the nonlinear mapping between operating parameters and power usage and thus support energy-saving operation strategies [15]. Existing research often uses physical models and data-driven methods in isolation, failing to leverage their synergistic advantages.

To this end, this work proposes a data–physics dual-driven optimization method. In terms of the physics-driven, the relationships between parameters in physical processes such as reservoir flow, wellbore flow, and pump efficiency changes are derived through physical formulas. The correlations and relationships between parameters in the physical model and production parameters are explored, and by solving these correlations, so too are the key parameters reflecting the essence of oil well production, such as geological factors (porosity, formation coefficient), physical factors (pump diameter, collapse efficiency), and production factors (daily liquid production, water cut, stroke, stroke rate, and oil pressure). In this work, ‘physically driven parameters’ primarily refer to variables directly appearing in reservoir seepage and wellbore flow control equations. These include formation coefficients and porosity reflecting the porosity–permeability relationship; formation pressure/casing pressure characterizing reservoir energy state; and stroke, stroke rate, pump diameter, and pump efficiency constrained by pumping unit kinematics and the empirical relationship between submergence and pump efficiency. Correspondingly, daily liquid production and water cut serve as data-driven indicators characterizing production response. On the data-driven side, we collect massive historical data from the oil well production process, including liquid production volume, water cut, porosity, formation coefficient, stroke, stroke rate, pump diameter, and pump efficiency, among other multi-source information.

Subsequently, the data obtained from data-driven and physics-driven approaches are integrated and jointly fed into the improved CatBoost algorithm model as input. The improved CatBoost algorithm dynamically adjusts the hyperparameters through Bayesian optimization, combining a weight adaptation update mechanism with an attention mechanism. This design helps the model capture nonlinear relationships and handle noise and high-dimensional data. Additionally, the introduction of physical parameters provides the model with robust physical constraints, preventing it from producing unreasonable predictions. It ensures that our model not only possesses strong data processing capabilities but also maintains good physical interpretability and generalization ability. By combining the data–physics dual-driven approach with the improved CatBoost algorithm, our model aims to construct an efficient, precise, and adaptable intermittent oil production regime optimization system, providing strong technical support and decision-making basis for intelligent oilfield development and efficient production.

2. Current Status of Research at Home and Abroad

In the field of oil extraction, giving full play to the reservoir’s own advantages, continuously improving pumping efficiency, and maintaining stable and high production have become the long-term research directions of current and future oil workers. Among them, it is a common method to realize intermittent production by reasonably controlling the time of starting and stopping the pump in the production process. With the deepening of oilfield development, the oilfield is bound to enter the middle and late stage of development, at which time the formation pressure is not enough to lift the crude oil to the wellhead, and artificial lift technology must be used for oil recovery. In addition, some oilfields need to use artificial lift technology from the early stage of development due to the initial low formation pressure.

Whether from the point of view of pump efficiency or energy consumption, intermittent pumping overcomes the shortcomings of insufficient continuous supply of oil, thus becoming a very effective way to improve the efficiency of the pump, which requires a long period of time for fluid level recovery, accompanied by a rise in wellbore pressure, which will surely lead to a more impermeable reservoir. From the theoretical research and actual testing of the situation, a high production and high efficiency situation does exist. These conditions have been studied in great detail in major oilfields and scientific research institutes at home and abroad, proving that it should be possible to automatically self-adjust the operating state of the pumping unit and develop production-enhancing devices so that the pumping unit will continue to operate and work in an optimal state.

2.1. Research on Traditional Technologies and Methods for Optimizing the Inter-Well Pumping System

In 2020, Li et al. [16] addressed the challenge that the interval between well pumping operations primarily relied on engineer experience, making it difficult to balance production stability and energy consumption control. They proposed a hybrid prediction method combining an analytical model with an IBSO-KELM soft-measurement model: First, an analytical model provided empirical interval times. Subsequently, an improved Iterative Brainstorming Optimization (IBSO) algorithm optimized the Kernel Extreme Learning Machine (KELM) parameters to compensate for analytical errors. This significantly enhanced the prediction accuracy of pumping intervals, providing quantitative support for formulating rational start–stop regimes.

In 2021, Chen et al. [17] proposed a hybrid soft-measurement model for dynamic liquid levels in production wells. This model integrates mechanism-based modeling, statistical regression, and data-driven soft-measurement techniques in parallel. Employing intelligent optimization methods such as swarm algorithms to enhance parameter identification accuracy, it delivers relatively precise dynamic liquid level estimates even under noisy conditions and fluctuating operating states. This provides crucial liquid level constraints for optimizing intermittent pumping regimes.

In 2022, Artun [18] systematically summarized the comprehensive workflow for machine learning-assisted reservoir performance prediction, proposing an integrated methodology spanning data cleaning, feature engineering, model training, and result interpretation. The study emphasized the importance of model interpretability and uncertainty analysis in production capacity forecasting and development scheme optimization, thereby establishing a complete process paradigm for incorporating machine learning into reservoir-well network production optimization.

In 2023, Sun et al. [19] constructed a multi-objective scheduling model for intermit-tent production wells based on NSGA-II. This model integrated multiple objectives—including production rate, water cut, and power consumption—into a unified framework. By solving the Pareto frontier, it generated multiple sets of start–stop schedules, offering a multi-objective optimization approach for field-level trade-offs between recovery rate, energy consumption, and equipment lifespan. This demonstrates the application potential of intelligent optimization algorithms in designing intermittent production regimes.

In 2023, Chen et al. [20] proposed a big data-based production measurement method for electric parameters to address the reliance of traditional production measurement on well testing and manual recording. By utilizing extensive data on the current, voltage, and active power from pumping units, they achieved online estimation of liquid production from sucker rod pumping (SRP) wells through feature construction and machine learning modeling. This approach effectively reduced the labor and time costs associated with conventional well testing and metering operations, while providing foundational data support for subsequent intelligent production optimization.

In 2024, Yin et al. [21] proposed a real-time dynamic liquid level calculation method for sucker rod wells based on multi-source data and multi-perspective feature fusion. This approach maps characteristics from load diagrams, electrical parameters, and production parameters into a unified feature space. A multi-perspective network architecture was designed for fusion learning, significantly enhancing the robustness and real-time capability of dynamic liquid level prediction under complex operating conditions. This provides technical support for precisely characterizing liquid level recovery processes before and after pumping cycle start/stop events and optimizing start/stop timing.

In 2025, Artun et al. [22] proposed an integrated workflow for blocks with incomplete logging data. This approach converts deficient logging data into usable inputs through spatial interpolation and uncertainty analysis, subsequently employing data analytics and machine learning models to predict and optimize reservoir-well network performance. Their work demonstrated that, even under conditions of data scarcity and varying quality, a data-driven model suitable for development decision-making can be constructed through a well-designed workflow.

In 2025, Qiao et al. [23] integrated intermittent well start–stop scheduling with photovoltaic grid-connected operation, establishing a joint optimization model for ‘photovoltaic grid–intermittent wells’. By employing genetic algorithms to search for the optimal start–stop combinations, they synchronized well operation timing with electricity price levels and photovoltaic output processes. This approach achieved dual objectives: reducing electricity costs while enhancing renewable energy absorption capacity, all while ensuring formation protection and pump efficiency.

In 2025, Wang et al. [24] proposed an improved adaptive GAPSO hybrid algorithm for wind–solar–storage microgrid scenarios. This approach jointly optimizes the off-peak pumping operation of a ‘wind–solar–storage–oil pumping well cluster’, aligning the pumping schedule with peak shaving and valley filling objectives on the power side. Results demonstrated that optimizing well cluster start–stop strategies under multi-energy complementary conditions effectively reduces system peak-to-valley differences while enhancing overall energy utilization efficiency, reflecting the evolving trend towards source–grid–load–storage coordination in inter-well pumping optimization.

In 2025, Canbolat et al. [25] proposed a data-driven reservoir performance prediction workflow integrating ‘data analysis + machine learning forecasting + performance mapping’. This approach combines historical logging and production data with spatial interpolation results, utilizing machine learning models to forecast key production indicators. Performance mapping identifies potential high-yield zones and optimization directions for well networks, providing an actionable technical pathway for selecting development schemes under complex reservoir conditions.

2.2. Current Status of CatBoost Algorithm Improvement Research

In 2022, Lu et al. [26] proposed a hybrid modeling approach combining ‘modified burial depth parameters with the CatBoost algorithm’ for predicting deep coalbed methane content. By introducing the modified burial depth ZZZ as a hydrogeological correction parameter and systematically tuning the CatBoost model’s hyperparameters, they achieved high-precision predictions of coalbed methane content. Results demonstrated that this method significantly reduced prediction errors compared to traditional regression models under small-sample conditions, indicating that CatBoost can serve as a vital tool for unconventional gas reservoir reserve and production capacity evaluation when fully leveraging domain-specific prior knowledge.

In 2022, Yi et al. [27] proposed the ADASYN–CatBoost lithology identification method for the imbalanced logging dataset of the Zhaoxian Gold Mine area. This approach first employs the ADASYN adaptive oversampling algorithm to generate synthetic samples of the minority class within the feature space, significantly mitigating the issue of disproportionate sample sizes across lithological categories. Subsequently, CatBoost is trained as the classifier. Case studies demonstrate that this model achieves favorable balance in identifying both majority and minority lithologies, with overall classification accuracy and minority recall rates markedly superior to conventional methods. This highlights the advantages of the ‘resampling strategy + CatBoost’ approach in addressing class imbalance issues within petroleum exploration.

In 2022, Bo et al. [28] proposed a hybrid algorithm combining CatBoost with Sequential Model-Based Optimization (SMBO) for rock mass classification and forward geological prediction during TBM excavation in hard rock tunnels. This approach employs SMBO to automatically search for CatBoost’s critical hyperparameters, enabling the model to maintain robust generalization performance under limited sample conditions. It achieves real-time, high-precision identification of rock mass categories in complex geological environments. Although the research focuses on tunnel rock masses, the technical framework of ‘SMBO–CatBoost hyperparameter auto-tuning + real-time field prediction’ holds significant implications for wellbore condition identification and inter-well pumping status prediction in oilfields.

In 2023, Liang and Xiong [29] proposed the KPCA–BO–CatBoost lithology identification model, integrating Kernel Principal Component Analysis (KPCA) with Bayesian optimization. This approach first employs KPCA to extract low-dimensional key features from multi-parameter logging-while-drilling data, eliminating redundant and noisy information. Subsequently, Bayesian optimization automatically selects CatBoost hyperparameters such as learning rate and tree depth, ultimately achieving real-time logging-while-drilling identification of multiple lithologies. Test results demonstrate that this model outperforms conventional methods in both identification accuracy and computational efficiency, achieving high-precision, near-real-time lithology recognition. This provides effective technical support for ensuring wellbore trajectory control and drilling safety.

3. Distinction Between Large-Interval and Small-Interval Models

Large-interval pumping refers to a production strategy that reduces extraction intensity by significantly extending the operating cycle or lowering the frequency of pump starts, which helps reduce energy consumption, equipment wear, and maintenance costs but may lead to larger production fluctuations or restart difficulties due to fluid settling. Small-interval pumping, by contrast, relies on more frequent and finer adjustments to operational parameters (e.g., shorter cycles) to stabilize production while achieving moderate energy savings, albeit requiring more frequent monitoring.

The two models have identical architectures; they differ only in the training data sources. The large-interval model is trained on data collected under long-cycle operating conditions, whereas the small-interval model is trained on short-cycle conditions.

The differences in training hyperparameters naturally arise from the characteristics of the two datasets and are explained later in the hyperparameter tuning section.

4. Based on a Data–Physics Dual-Driven Approach

4.1. Physics-Driven Method

According to the basic principles of seepage mechanics, reservoir physics, and reservoir engineering, it is known that the oil production volume is related to the capacity factor

C

, formation pressure

P e

, index parameter, permeability

k

, wellbore flow pressure

P w f

, fluid viscosity

μ

, flow path length

L

, and flow cross-sectional are

A

among the physics-driven based parameters.

In classical theories of seepage mechanics and reservoir engineering, physical parameters such as average formation pressure, bottomhole flow pressure, permeability, and flow cross-sectional area are key factors governing the inflow capacity of production wells. To employ a production relationship with a clear engineering basis that facilitates subsequent modeling in this work, we have selected the classical Fetkovich empirical production equation to describe the inflow performance of a single well [30]. Its form is as follows:

Q = C \times {(P^{2} e - P^{2} w f)}^{n}

(1)

where Q denotes the wellbore liquid production rate, reflecting the output capacity of the producing well;

{P^{2}}_{e}

represents the average formation pressure;

{P^{2}}_{w} f

denotes the wellbore flow pressure; C is the flow coefficient; and n is the empirical exponent characterizing nonlinear flow behavior (e.g., n ≈ 1 typically applies to dissolved gas-driven reservoirs, whereas n > 1 is often observed in low-permeability reservoirs). This formulation was first proposed by Fetkovich based on multi-point back-pressure testing and has since been widely adopted in extensive subsequent research and engineering practice to describe the empirical relationship between production well output and pressure.

In addition, Darcy’s law [31] shows that, for single-phase laminar flow in a porous medium, the volumetric flow rate is proportional to the permeability k and flow cross-sectional area A, as well as to the pressure difference between reservoir pressure and bottomhole flowing pressure, and inversely proportional to fluid viscosity μ and flow path length L. The classical one-dimensional form of Darcy’s law can be written as follows:

q = \frac{k A (P e - P w f)}{μ L}

(2)

where

k

is the permeability, which measures the ability of the formation to transmit fluids. The higher the permeability, the greater the formation’s ability to conduct fluid, and the flow rate per unit of time q increases.

A

is the flow cross-sectional area, which refers to the effective area through which the oil flow passes, such as the area of a seepage channel in a wellbore or formation. As the cross-sectional area increases, the flow rate per unit time passes increases.

μ

is fluid viscosity, which is used to measure the resistance to flow between molecules within a fluid. The greater the viscosity is, the greater the resistance to fluid flow and the decrease in flow rate

q

are.

L

is the flow path length, the distance the fluid inside the formation travels from the start of the seepage to the wellbore. In deep reservoirs or highly porous formations, a longer flow path usually means a greater pressure drop and lower production.

The derivation of Equations (1) and (2) is based upon several standard assumptions in seepage mechanics:

(1): It is assumed that the reservoir contains a single-phase, slightly compressible liquid flow, with the porous medium being homogeneous and isotropic. Consequently, within each pumping cycle, the lithology and fluid properties within the control volume do not undergo significant variation.
(2): It is assumed that flow is laminar radial flow, rendering Darcy’s law applicable at the study scale.
(3): Porosity, permeability, and fluid viscosity are treated as near-constant values over short timeframes, whilst capillary and gravitational effects are neglected relative to viscous pressure loss.
(4): The system is assumed to be in a boundary-controlled (quasi-steady) flow phase, allowing the driving force to be characterized by the average formation pressure pe and wellbore flow pressure pwf, with wellbore storage and skin effects unified and equivalent to the production index J in Equation (1).

Therefore, fluid production from wells can be improved by optimizing formation pressure management, lowering bottomhole flow pressure, increasing permeability, and decreasing fluid viscosity. Furthermore, Khalaf et al. [32] proposed a method for removing wellbore storage effects from pressure curves, based on a stable deconvolution algorithm. This approach addresses the distortion of pressure curves caused by wellbore storage effects, enabling the recovery of more accurate formation pressure response curves even under noisy conditions and variable flow rates. It provides a significant analytical tool for interpreting well testing data in complex operational scenarios.

In summary, the physics-driven section of this work explicitly employs two key seepage equations: firstly, the Fetkovich-type inflow property relation (Equation (1)), which characterizes the nonlinear pressure–production relationship between average formation pressure, bottomhole flow pressure, and fluid production rate; secondly, Darcy’s law (Equation (2)), which constrains the combined influence of permeability, flow channel cross-sectional area, viscosity, and seepage path length on flow rate.

4.2. Data-Driven Method

Based on the 19 production parameter data collected, namely “Well Number”, “Intermittent Pumping Type”, “Block”, “Horizon”, “Gas-Oil Ratio”, “Crude Oil Viscosity”, “Stroke Length”, “Stroke Frequency”, “Pump Diameter”, “Pump Efficiency”, “Submergence”, “Daily Oil Production”, “Daily Liquid Production”, “Water Cut”, “Tubing Pressure”, “Porosity”, “Formation Coefficient”, “Flowing Bottom-Hole Pressure”, and “Pay Zone”, the Pearson analysis method is used to screen key parameters, revealing the linear correlation between equipment operating parameters and production indicators and providing data support for the optimization of the intermittent pumping system.

Pearson correlation is the linear correlation between quantitative variables, and the formula is as follows:

r_{x y} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

(3)

where

n

is the sample size, i.e., the number of observations.

x_{i}

denotes the

X

th observation of the variable

i

.

y_{i}

denotes the

Y

observation of the variable

i

.

\bar{x}

denotes the mean of the variable

X

, i.e.,

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

.

\bar{y}

denotes the mean value of variable

Y

,

\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

.

r_{x y}

is the Pearson’s correlation coefficient, which takes values in the range of [−1, 1] and is used to measure the strength and direction of the linear correlation between variable X and variable Y. The closer the absolute value of

r_{x y}

is to 1, the stronger the linear correlation between the two variables; the closer the absolute value of

r_{x y}

is to 0, the weaker the linear correlation between the two variables. When

r_{x y}

> 0, it indicates that the two variables are positively correlated, i.e., when one variable increases, the other tends to increase; when

r_{x y}

< 0, it indicates that the two variables are negatively correlated, i.e., when one variable increases, the other tends to decrease.

According to the relationships among production parameters, dynamic liquid level, oil production, and geological factors, the most influential factor on oil production is the production parameters of the pumping unit well [33]. The dynamic liquid level is related to submergence, pump setting depth, stroke length, and stroke frequency [34]. Oil production is related to tubing pressure, water cut, daily oil production/daily liquid production. Geological factors are related to the porosity and formation coefficient. Meanwhile, physical factors during the pumping process, such as pump diameter and pump efficiency, also affect oil production. Pump efficiency increases with the increase in submergence, but there is a reasonable limit for submergence. When this limit is exceeded, pump efficiency decreases instead.

4.3. Data–Physics Dual-Driven

In porous media, the relationship of permeability and porosity can be described by the Kozeny–Carman equation:

k = \frac{ϕ^{3} \cdot d^{2}}{S^{2} \cdot {(1 - ϕ)}^{2}}

(4)

where

k

is permeability,

ϕ

is porosity,

d

is particle diameter, and

S

is surface area. However, in practice, permeability and porosity are often obtained by fitting well logging data, and a linear regression model exists:

k = a \cdot ϕ + b

(5)

where

a

is the slope,

ϕ

reflecting the sensitivity of permeability to changes in porosity, and b is the intercept, reflecting the permeability at zero porosity (theoretically 0).

Similarly, according to the pressure gradient equation,

\frac{d P}{d x} = - \frac{q \cdot μ}{k \cdot A}

(6)

where

\frac{d P}{d x}

is the pressure gradient. Since the strokes of the pumping unit directly affect the flow rate per unit time

V_{s}

,

V_{s} = A \times S

(7)

q = V_{s} \times N_{s}

(8)

q = A \times S \times N_{s}

(9)

where

N_{s}

is the stroke frequency,

A

is the flow cross-sectional area, and

S

is the stroke length.

Substitute the unit time flow equation for the pressure gradient equation:

\frac{d P}{d x} = - \frac{(A \times S \times N_{s}) \times μ}{k \cdot A}

(10)

\frac{d P}{d x} = - \frac{S \times N_{s} \times μ}{k}

(11)

According to the formula, it can be seen that increasing the stroke length can have an increase in the single displacement, thus increasing the flow rate per unit time, leading to an increase in the pressure gradient, and fluid flow rate increases. By increasing the stroke frequency, increasing the pumping unit time working frequency, and directly increasing the flow rate, leading to a further increase in the pressure gradient, wellbore flow resistance increases. Thus, stroke length, stroke frequency, and pressure gradients exist:

\frac{d P}{d x} \propto S \times N_{s}

(12)

When both stroke length and stroke frequency are compared, the pressure gradient increases, i.e., the pressure loss per unit length is significant. Due to the increase in flow velocity caused by the high stroke and stroke number, the fluid inertia force is enhanced, causing the flow resistance to rise. If the decay of oil reservoir energy in a short period of time is

Δ P = P_{e} - P_{w f}

(13)

Substitute the stroke length and stroke frequency equations:

Δ P = Δ P_{0} - \frac{S \times N_{s} \times μ \times x}{k}

(14)

where

P_{0}

is the initial differential pressure and

x

is the flow distance. High stroke length and high stroke frequency accelerate the rate of differential pressure decay, and the bottomhole pressure drops rapidly. If the pressure decay is too fast, the energy of the oil reservoir cannot be replenished in time, which eventually leads to unstable production. Under high-frequency pumping, the flow rate changes significantly, which is easy to produce pump suction emptying and wellbore fluid accumulation.

Therefore, combining physical parameters with production parameters has better theoretical support.

Based on the calculation using the Pearson correlation coefficient formula, five factor combinations with relatively high correlations were selected: Stroke Length, Daily Liquid Production, Water Cut, Porosity, and Formation Coefficient. The correlation analysis of various factors in the intermittent pumping system is shown in Figure 1.

Considering both physics-driven parameters and data-driven parameters comprehensively, the liquid production is influenced by various production parameters. Finally, a combination of nine parameters is selected: Formation Coefficient, Daily Liquid Production, Porosity, Water Cut, Tubing Pressure, Stroke Length, Stroke Frequency, Pump Diameter, and Pump Efficiency. The correlation analysis diagram of characteristic parameters is shown in Figure 2.

5. Optimization Algorithm for Intermittent Sampling System Driven by Data–Physics Models

Based on the CatBoost regression framework [35], this inter-pumping regime selection method realizes high-precision prediction of well operation time and intelligent recommendation of inter-pumping regimes through systematic data preprocessing, feature construction, model training, and automated hyperparameter optimization processes. Throughout the modeling process, we always adhere to the principle of data-driven and engineering usability, and strive to ensure the prediction accuracy while improving the model’s stability, generalization ability, and landing feasibility.

5.1. Data Preparation

Through the combination of mathematical modeling and physical mechanism analysis, nine key characteristic parameters that are closely related to the operation system of inter-well pumping were screened out, namely, formation coefficient, daily liquid production, porosity, water content, oil pressure, stroke, stroke times, pump diameter, and pump efficiency. These characteristic variables portray the overall operating status of the wells from three dimensions of geological conditions, process parameters, and equipment performance, and have strong representativeness and interpretability. The target variable is set as the “running time” of each round, i.e., the duration of continuous operation after the well is turned on for a single time, aiming at realizing the accurate prediction of the pumping cycle.

In terms of data cleaning, the IQR (interquartile range) method is used to identify and eliminate outlier samples, and records with a high proportion of missing values are deleted to ensure the quality and consistency of the input data. In addition, in order to enhance the adaptability of the model to different operating modes, the original data are divided into two subsets based on the existing “Operating Cycle Classification” field: large cycle (classification labeled as 1) and small cycle (classification labeled as 2). This strategy takes into account the differences in the main factors affecting the operating time under different operating regimes, and modeling them separately helps to improve the model fitting accuracy and avoid data interference between different cycles. In the subsequent modeling, the two subsets will be trained and evaluated independently to obtain more targeted optimization suggestions for the inter-sampling regime.

5.2. Feature Engineering

To improve the model’s ability to capture complex nonlinear relationships, the feature engineering stage extends the third-order polynomials on top of the original 9 features to generate up to 165 combined features. This operation effectively enhances the model’s ability to express multi-dimensional feature interactions and improves its adaptability to complex working conditions. However, the high-dimensional feature space also brings the problems of increasing redundant information and overfitting risk.

In this work, F-value is introduced as a feature saliency index, and the SelectKBest algorithm is used to select the top 15 combinations of features that have a significant effect on the target variable “running time” as the final set of input variables, which retains the nonlinear information while effectively controlling the feature dimensions, which improves the generalization performance and computational efficiency of the model. In addition, all the features are Z-score standardized to eliminate the differences in scale, which further improves the convergence speed and stability of the model in the gradient iteration process.

5.3. Hyperparameter Tuning Based on Bayesian Optimization

In the optimization of the pumping regime between wells, a Bayesian optimization algorithm was introduced to tune the hyperparameters of the CatBoost model more efficiently. For the specific algorithm, see “Algorithm 1”.

Algorithm 1 Hyperparameter optimization algorithm for CatBoost model based on Bayesian optimization

Input:

CatBoost model hyper-parameter initial value ranges (learning rate range, tree maximum depth range, iteration number range, L2 regularization factor range)
objective function (F1 score or AUC value on the validation set)
Maximum number of iterations
convergence conditions

Procedure: function BayesianOptimization ()
1:  Several combinations of hyper-parameters are randomly selected and their objective function values are evaluated to obtain an initial dataset S
2:  Initialize the Gaussian process agent model GP and fit GP with S
3:  for i from 1 to the maximum number of iterations do
4:     Use the upper confidence interval (UCB) collection function to select the next hyper-parameter combination in the hyper-parameter space x_next
5:     Evaluate the objective function value y_next corresponding to x_next
6:     Add (x_next,y_next) to the dataset S
7:     Update the Gaussian process agent model GP and refit GP with the updated S
8:     if the convergence condition is satisfied then
9:         break
10:    end if
11:  end for
Output: optimized hyper-parameter combinations (optimal learning rate, optimal maximum depth of tree, optimal number of iterations, optimal L2 regularization factor)

In this work, the convergence conditions for Bayesian optimization include the following: (1) the improvement in the objective function (F1 or AUC) becomes smaller than a predefined threshold for several consecutive iterations, indicating that further exploration yields negligible benefit; or (2) the predefined maximum number of iterations is reached. These criteria ensure computational efficiency while avoiding unnecessary evaluations.

Hyperparameter tuning based on Bayesian optimization explores the parameter space more efficiently and avoids falling into local optima by constructing an agent model and dynamically adjusting the hyperparameter search direction by utilizing prior knowledge and historical evaluation results.

The core of Bayesian optimization lies in constructing an agent model to approximate the real objective function. Bayesian optimization has been widely applied as an efficient strategy for hyperparameter tuning in machine learning models [36]. In inter-well pumping regime optimization problems, the objective function is the model’s performance metrics on the validation set, such as F1 scores or AUC values. F1 and AUC are not used as final evaluation metrics of the pumping-time prediction model. Instead, they serve only as auxiliary surrogate signals within the Bayesian optimization process to guide the search toward regions where the model more reliably distinguishes valid from invalid predictions in intermediate classification-based assessments. A Gaussian process is usually chosen for the agent model because it captures the uncertainty of the objective function and provides the predicted mean and variance. Initially, we randomly select several hyperparameter combinations to evaluate, obtain the corresponding objective function values, and then use these data points to fit a Gaussian process model.

Next, the acquisition function is selected to determine the next most promising combination of hyperparameters. The capture function measures the potential contribution of each possible hyperparameter combination to the optimized objective function. Commonly used collection functions include Expected Improvement (EI) and Upper Confidence Bound (UCB), where EI focuses on the degree of improvement that can be achieved from the current optimal solution, and UCB strikes a balance between utilizing known information and exploring unknown regions. In the actual well production environment, there are various uncertainties, such as changes in geological conditions, aging equipment, etc., which will have an impact on the optimization effect of the working regime. The UCB can reflect the uncertainty based on the magnitude of the prediction variance and take this uncertainty into account in the selection of the sampling point, thus improving the robustness of the optimization scheme to a certain extent, so we chose the UCB acquisition function.

The Upper Confidence Bound (UCB) acquisition function is defined as follows:

U C B (x) = μ (x) + k σ (x)

(15)

where

μ (x)

is the predicted mean of the surrogate model,

σ (x)

is the predicted standard deviation, and k is a tunable parameter controlling the exploration–exploitation balance. The first term,

μ (x)

, encourages sampling near regions expected to yield high objective values, while the second term,

k σ (x)

, promotes exploration in regions with large uncertainty. In this way, the UCB criterion selects hyperparameter combinations that are not only potentially optimal but also informative about uncertain areas of the parameter space. This property makes UCB particularly suitable for well production optimization problems, where geological and equipment uncertainties can lead to significant variability in model predictions.

The Expected Improvement (EI) sampling function measures the expected improvement that the next sampling point may bring based on the current optimal value. Assuming that the current optimal value is

f (x^{+})

and the Gaussian process predicts the mean value at point x as µ(x) and the variance as σ2(x), then the formula for EI is

E I (x) = E [\max (f (x) - f (x^{+}), 0)]

(16)

Under the assumption of a Gaussian distribution, the formula for EI can be further expanded as follows:

\{\begin{matrix} Z = \frac{μ (x) - f (x^{+}) - ε}{σ (x)} \\ E I (x) = (μ (x) - f (x^{+}) - ξ) \cdot Φ (Z) + σ (x) \cdot ϕ (Z) \end{matrix}

(17)

where

ξ

is a parameter that controls the balance between exploration and exploitation (usually set to 0 or a small positive value).

Φ (Z)

is the cumulative distribution function (CDF) of the standard normal distribution,

Φ (Z)

is the probability density function (PDF) of the standard normal distribution.

In each iteration, the next hyperparameter combination is selected for evaluation based on the acquisition function. The new hyperparameter combination and its corresponding objective function value are added to the dataset and the Gaussian process agent model is updated. Through continuous iterations, the agent model gradually approaches the real objective function and the hyperparameter search direction is optimized. This process continues until a predetermined number of iterations is reached or a convergence condition is satisfied.

Bayesian optimization allows for the optimization of several key hyperparameters of the CatBoost model, including the number of iterations, the learning rate, the tree depth, and the regularization parameter. The number of iterations controls the number of training rounds for the model; too many may lead to overfitting and too few may lead to underfitting. The learning rate determines how much the model updates the weights in each iteration; too much learning rate may lead to unstable model training, too little may increase the training time. Tree depth limits the depth of each decision tree; a larger tree depth may increase the complexity of the model and improve the fitting ability, but it can also easily lead to overfitting. The regularization parameter is used to prevent the model from overfitting, and this work uses the L2 regularization parameter, which can constrain the weights and avoid the model from being too complex.

As Table 1 demonstrates, following Bayesian optimization adjustments to hyperparameters, the CatBoost model exhibits markedly enhanced predictive performance across both fine-dropout and coarse-dropout tasks. This indicates that Bayesian optimization effectively identifies optimal hyperparameter combinations tailored to distinct dropout regimes, thereby bolstering the model’s fitting capability and generalization capacity. Further analysis reveals discernible patterns in the optimal hyperparameters across sampling regimes: the coarse sampling task performs better with smaller learning rates, shallower tree depths, and higher iteration counts, whereas the fine sampling task favors slightly higher learning rates, deeper tree structures, and fewer iterations. This demonstrates that Bayesian optimization can automatically balance model complexity and training processes according to task characteristics, enabling the model to better capture nonlinear relationships and dynamic variations within oil well production data.

In summary, the introduction of a hyperparameter tuning strategy incorporating Bayesian optimization not only enhances the predictive accuracy of the CatBoost model but also improves its stability and adaptability across different interval pumping regimes. This provides reliable model support for optimizing oil well interval pumping regimes. This optimization strategy offers a scientific basis for subsequent adjustments to production conditions and intelligent decision-making, while also serving as a reference for tuning ensemble learning models in similar oil and gas production forecasting tasks.

5.4. Adaptive Updating of Weights

In traditional integrated learning methods, the weights of the base classifiers are usually equal or fixed during the training process. However, in the inter-well pumping system optimization problem, different base classifiers may not contribute equally to the final prediction result. In order to integrate the advantages of each base classifier more accurately, the improved CatBoost model introduces a dynamic weight adaptive updating mechanism. This mechanism dynamically adjusts the weights of the base classifiers according to their prediction errors on the validation set. The smaller the prediction error is, the better the performance of the base classifier is, and the larger its corresponding weight is. The weight update formula is as follows:

w_{t} = \frac{1}{e r r o r_{t}}

(18)

where

w_{t}

denotes the weight of the

t_{t h}

base classifier and

e r r o r_{t}

denotes the prediction error of the

t_{t h}

base classifier. The prediction error can be measured by metrics such as mean square error (MSE). MSE is calculated by the formula

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(19)

where

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value, and n is the number of samples.

Specifically, assume there are T base classifiers. During training, each base classifier is first trained on the training set, then the prediction error

e r r o r_{t}

of each base classifier is calculated on the validation set, and finally the weights

w_{t}

of the base classifiers are updated based on the prediction errors. In the final prediction stage, the prediction results of all base classifiers are weighted and fused, i.e., the prediction result of each base classifier is multiplied by its corresponding weight, and then summed to obtain the final prediction result. This weight adaptive update mechanism enables the model to automatically adjust the contribution of each base classifier, ensuring that base classifiers with better performance have a greater influence in the final ensemble prediction, thereby improving overall prediction performance. Compared to traditional fixed-weight methods, this mechanism offers significant advantages. It dynamically adjusts weights based on the actual performance of base classifiers, enhancing the model’s adaptability and robustness, while more accurately leveraging the strengths of high-performing base classifiers to improve prediction accuracy. Such dynamic weighted ensemble strategies have been shown to significantly improve predictive robustness in complex environments [37]. For the specific algorithm, see “Algorithm 2”.

Algorithm 2 CatBoost integrated prediction algorithm based on adaptive updating of weights

Input:

training set Dtrain = {(x₁, y₁), …, (x_m, y_m)}
validation set Dval = {(x_v1, y_v1), …, (x_vn, y_vn)}
number of base classifiers T
mean square error (MSE)

Procedure: function AdaptiveWeightedEnsemble ()
1:  Generate a list of weights, initialized to an all-0 list of length T
2:  for t from 0 to T − 1 do
3:      Train the

t

base classifier Ct on the training set

D t r a i n

4: Predicting the validation set

D v a l

with Ct yields the prediction

p r e d t

5: Calculate the prediction error

e r r o r_{t}

for the

t

base classifier
6: Evaluate the objective function value y_next corresponding to x_next
7: If the mean square error (MSE) is used,

e r r o r_{t}

8: Update weight[t] according to the prediction error

e r r o r_{t}

.
9: end for
10: Normalize the weights list weights:
11: Calculate the sum of the weights

s u m_w e i g h t s = s u m w = \sum_{t = 0}^{T - 1} w e i g h t s [t]

12: for t from 0 to T − 1 do
13:

w e i g h t s = \frac{w e i g h t s [t]}{{s u m}_{w} e i g h t s}

14:      end for
15:  Initialize the final prediction final_prediction = 0
16:  for t 0 to T − 1 do
17:      Get the prediction for the

t t h

base classifier

p r e d_{t}

18: Multiply

p r e d_{t}

by weights[t] and accumulate to final_prediction:
19: final_prediction+ =

p r e d t

× weights[t]
20: end for
Output: final prediction result final_prediction

When implementing the adaptive updating of weights, it is first necessary to prepare a validation set which is used to evaluate the prediction error of each base classifier. The size of the validation set and the method of selecting it are crucial for the accuracy of the weight update. In general, the validation set should be able to represent the entire data distribution and not overlap with the training and test sets to ensure the objectivity and accuracy of the evaluation results.

When training each base classifier, its prediction results on the validation set need to be recorded and the corresponding prediction error calculated. The calculation method of prediction error can be selected according to the specific problem and data characteristics, and in addition to the mean square error (MSE), indicators such as the mean absolute error (MAE) and classification accuracy can also be used. For the problem of optimizing the pumping regime between wells, MSE is a commonly used evaluation index because the objective is to predict continuous variables such as the production and water content of the wells.

After calculating the prediction error of each base classifier, its weights are updated according to the above formula

w_{t} = \frac{1}{e r r o r_{t}}

. The process of weight updating can be understood as a quantitative evaluation of the performance of the base classifiers, and the base classifiers with better performance will be given more weights, thus playing a greater role in the final integrated prediction. The frequency of weight update can be set flexibly according to the training progress and model convergence, and the weight update is generally performed immediately after the training of each base classifier is completed, so as to reflect the changes in model performance in a timely manner.

In the final prediction stage, the prediction results of all base classifiers are weighted and fused. The process of weighted fusion can be represented as follows:

{\hat{y}}_{final} = \sum_{t = 1}^{T} w_{t} \cdot {\hat{y}}_{t}

(20)

where

{\hat{y}}_{final}

denotes the final prediction result,

{\hat{y}}_{t}

denotes the prediction result of the

t

th base classifier, and

w_{t}

denotes the weight of the

t

th base classifier.

In this way, the model can fully utilize the advantages of each base classifier to improve the accuracy and reliability of the prediction results. The introduction of the weight adaptive updating mechanism enables the improved CatBoost model to better adapt to the characteristics and changes in the data when dealing with complex well data, and enhances the model’s generalization ability and stability.

5.5. Dynamic Regulation of Parameters Under the Attention Mechanism

When constructing an integrated learning model with multiple levels, the improved CatBoost model introduces an attention mechanism to regulate the parameters of the next level in order to better utilize the information from the previous-level model so that the next level model can learn more purposefully. The core idea of the attention mechanism is to let the model automatically learn the importance of each base classifier output and use it as the weight of the input features of the next level model. This principle is consistent with the modern formulation of attention mechanisms, where models learn data-dependent importance weights to enhance representation learning [38]. Specifically, prior to the training of the model at level t, the outputs of the models at the previous t − 1 levels are weighted and fused, with the weights determined by the attention mechanism. The fused results are then used as part of the input features of the level t model, thus influencing the parameter learning of the level t model. The formula for the attentional weights is as follows:

α_{t, i} = \frac{\exp (e_{t, i})}{\sum_{j = 1}^{t - 1} \exp (e_{t, j})}

(21)

where

α_{t, i}

denotes the attentional weight of the level t model to the

i

-th previous-level model output, and

α_{t, i}

denotes the un-normalized energy value of the level t model to the

i

-th previous-level model output. The energy value

e_{t, i}

is usually generated by a small neural network or a simple linear transformation, which aims to measure the relevance of the previous-level model output to the current-level model’s target task.

Taking the multi-level model for the optimization of the inter-well pumping regime as an example, the first level of the model may mainly learn the influence of the basic geological characteristics of the wells on the production. These basic geologic features include the porosity, permeability, and thickness of the oil formation, which are the basic factors affecting the production capacity of the well. The first-level model outputs preliminary prediction results by learning the relationship between these features and oil well production. Then, the attention weights of the output of the first-level model are calculated through the attention mechanism, and these weights reflect the relevance of the output of the first-level model to the target task of the second-level model.

Next, the outputs of the first-level model are weighted and fused together with the original input features as the input features of the second-level model. During the learning process, the second-stage model can pay more attention to the information related to the dynamic production data of the wells in the first-stage model, such as the daily oil production and water content change in the wells. These dynamic production data reflect the actual performance of oil wells under different production regimes, and are important for predicting the production effect of oil wells under different inter-pumping regimes. In this way, the second-stage model can more accurately predict the production, water content, and other key indicators of oil wells under different inter-pumping regimes, thus providing a scientific basis for formulating a reasonable inter-pumping regime. The advantage of the attention mechanism is that it enables the next-level model to pay more attention to the useful information in the previous-level model and ignore the irrelevant or noisy information, thus improving the model’s learning ability and generalization performance. At the same time, it realizes the deep fusion of multi-level features and gradually extracts higher-level feature representations, which further improves the model’s understanding and prediction ability of complex oil well data. For the specific algorithm, see “Algorithm 3”.

Algorithm 3 multi-level CatBoost integrated learning algorithm based on attention mechanism

Input:

raw well data collection D
model level n
Energy calculation module configuration (e.g., number of neural network layers, parameters, etc.)

Process: Functions AttentionBasedMultiLevelEnsemble (D, n, energy_config)
1: Initialization:
2: antecedent output prev_output =

φ

3:    Current Input Data current_input = D
4: for i from 1 to n do
5:    if i == 1 then:
6:        Train the model at level i:

m o d e l i = C a t B o o s t T r a i n (c u r r e n t_i n p u t)

7: Calculate the output of level i:

o u t p u t i = C a t B o o s t P r e d i c t (m o d e l i, c u r r e n t_i n p u t)

8: else:
9: Build the energy calculation module: initialize the small neural network Ei according to energy_config, the input dimension is the number of features of

p r e v_o u t p u t

, and the output is the scalar energy value
10:

C a l c u l a t e t h e a t t e n t i o n w e i g h t s

:

{energy}_{i} = i (prev_output) attention_{weights}_{i} = \frac{\exp ({energy}_{i})}{\sum_{j} \exp ({energy}_{i, j})}

11: Weighted fusion of previous level outputs:

w e i g h t e d_o u t p u t = p r e v_o u t p u t \times a t t e n t i o n_w e i g h t s i

12: Splicing Features:

c u r r e n t_i n p u t = C o n c a t e n a t e (w e i g h t e d_o u t p u t, D)

13: Train the model at level i:

m o d e l i = C a t B o o s t T r a i n (c u r r e n t_i n p u t)

14: Calculate the output of level i:

o u t p u t i = C a t B o o s t P r e d i c t (m o d e l i, c u r r e n t_i n p u t)

15: Updating of the previous level of outputs:

p r e v_o u t p u t = o u t p u t

16: end for
Output: the final prediction of the nth level model

o u t p u t_{n}

In implementing the attention mechanism, it is first necessary to design an energy computation module for generating the un-normalized energy values of the previous-level model outputs

e_{t, i}

. This energy computation module can be a small neural network or a simple linear transformation. Its central purpose is to measure the relevance of the previous-level model output to the target task of the current-level model. For example, in the problem of optimizing the pumping regime between wells, the energy value can be computed by a single-layer neural network whose inputs are the output features of the previous-level model and whose output is a scalar energy value.

After calculating the energy values, they are converted into attention weights by softmax function:

α_{t, i} = \frac{\exp (e_{t, i})}{\sum_{j = 1}^{t - 1} \exp (e_{t, j})}

(22)

The softmax function is able to normalize the energy values to a probability distribution that ensures that all the attention weights sum to 1. This makes the weighted fusion process more stable and rational.

Next, the output of the previous-level model is multiplied by the corresponding attentional weights to obtain a weighted feature representation. Then, these weighted features are spliced with the original input features to form the input features of the second-level model. During the training process, the second-level model is not only able to learn the information of the original input features, but also able to utilize the weighted features of the output of the previous-level model, so as to capture the complex relationships in the data in a more comprehensive way.

During the training process of the multi-level model, each level of the model can adopt a similar attention mechanism to gradually fuse the output information of the previous level of the model. This level-by-level feature fusion allows the model to gradually extract higher-level feature representations to further enhance the understanding and prediction of complex data.

Through the synergy of the weight adaptive updating mechanism and the attention mechanism, the improved CatBoost model is able to better mine the relationship between features, improve the prediction accuracy, and enhance the stability and adaptability of the model when dealing with complex oil well data. In practical applications, by reasonably setting the model structure and parameters, the improved CatBoost model can provide stronger support for the optimization of the pumping regime between oil wells, and help to improve the recovery rate and economic efficiency of oilfields.

6. Optimization Results of the Inter-Sampling System

In order to comprehensively evaluate the performance of the proposed method, this work selected various classical machine learning models such as K-Nearest Neighbors (KNN), Plain Bayes, Least Absolute Shrinkage and Selection Operator (Lasso), Random Forests, Support Vector Machines, XGBoost, etc., as well as combinations of LightGBM with models such as XGBoost and CatBoost, and compared them with the improved CatBoost model (Ours) in a comparative experiment. In the experiments, the focus is on two key metrics: the average absolute error and the accuracy of the models in both large and small-interval pumping scenarios. Mean absolute error reflects the average difference between the model predicted value and the real value, and the smaller its value is, the more accurate the model prediction is; the accuracy rate reflects the proportion of correct prediction of the model, and the higher accuracy rate means the better performance of the model. The specific formula is as follows:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(23)

Accuracy = \frac{1}{n} \sum_{i = 1}^{n} (1 - \frac{y_{i} - {\hat{y}}_{i}}{y_{i}})

(24)

where

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value, and n is the sample size.

6.1. Ablation Study

To validate the effectiveness of each component, we conducted ablation experiments.

The ablation results for both large-interval and small-interval production prediction tasks demonstrate that each proposed component contributes positively to model performance. Starting from the CatBoost baseline, the integration of cubic polynomial feature expansion consistently reduces MAE and increases accuracy, indicating that enhanced nonlinear feature interactions are beneficial for capturing production trends. The addition of F-value feature screening further improves performance by removing redundant variables and reducing noise, which stabilizes the learned model. Introducing the weight adaptation mechanism yields another notable performance gain in both MAE reduction and accuracy improvement. This confirms that differentiating the importance of high-error samples helps the model better handle complex or unstable production patterns. The subsequent attention mechanism further enhances the model’s ability to focus on the most influential multivariate relationships, providing clear incremental improvements in both large-interval and small-interval scenarios. Finally, Bayesian hyperparameter optimization produces the best overall results, demonstrating its effectiveness in finding optimal configurations across different data granularities.

Overall, the trends are consistent across both datasets: each component leads to monotonic, step-wise improvements, and the full model achieves the lowest MAE and highest accuracy. These results validate that the proposed enhancements are not merely heuristic additions but mutually reinforcing modules that substantially strengthen CatBoost’s predictive capability under different interval settings.

6.2. Comparison with Baseline and Ensemble Models

The performance of each model in different inter-pumping scenarios is shown in Table 2.

The comparative experimental results with other algorithms are presented in Table 3.

In order to deeply analyze the prediction effect of the improved model on the running time of each well number under different inter-pumping scenarios, we plotted the comparison of the running time and predicted running time of large inter-pumping wells and small-interval pumping wells. These visualization charts can intuitively present the dynamic changes and differences between the actual and predicted running times.

In the comparison chart for large-interval pumping wells (Figure 3), we can observe the correspondence and fluctuation trend between the running time and the predicted running time for different well numbers, and clearly judge the accuracy of the model in predicting each well number in the large-interval pumping wells scenario. The comparison graph of small-interval pumping wells (Figure 4) further shows the model’s time prediction ability for each well number in the small-interval pumping scenario. By comparing the trend of the two folded lines, the model’s ability to capture and predict the temporal changes in small-interval pumping can be effectively evaluated. These graphs provide an intuitive and powerful visual reference for model performance evaluation.

In terms of model performance, the improved CatBoost model (Ours) performs well through comparison experiments with a variety of classical machine learning models and model combinations. In the large-interval pumping scenario, its average absolute error is only 0.2659, with an accuracy of 0.9805; in the small-interval pumping scenario, the average absolute error is 1.0100, with an accuracy of 0.9422. This shows that the model has higher accuracy than other models in predicting the pumping operation time and recommending the pumping strategy, which can more accurately grasp the pumping pattern of oil wells and provide a reliable basis for actual production.

7. Limitations

Although the proposed data–physics dual-driven model demonstrates strong predictive capability and practical value, several limitations still remain and should be addressed in future work:

(1): Dependence on data quality and representativeness.

Although physical constraints reduce overfitting risk, the data-driven component still heavily relies on the accuracy and completeness of historical production data. Noise, sensor drift, and missing values may degrade model performance, especially in low-frequency or irregularly monitored wells.

(2): Simplified physical mechanism expressions.

The physical-driven part integrates seepage mechanics, pressure relationships, and pump efficiency, but these formulations inevitably simplify complex subsurface processes. Effects such as multiphase flow transitions, formation heterogeneity, wellbore multiphase interactions, and dynamic reservoir coupling are only partially represented, which may limit model applicability in highly heterogeneous reservoirs.

(3): Computational cost of enhanced CatBoost model.

The improved CatBoost incorporates polynomial features, attention-based multi-level modeling, Bayesian optimization, and weight adaptation mechanisms. These enhancements significantly increase computational demand during model training, which may not be suitable for low-resource environments or require longer training cycles for real-time deployment.

8. Conclusions

Against the backdrop of growing energy demand and increasingly stringent environmental protection requirements, this work addresses the challenges of mid-to-late-stage development in traditional oilfields by proposing a data–physics dual-driven well-to-well pumping regime optimization method and constructing an improved CatBoost model. The study combines data parameters with physical parameters as algorithm inputs. By introducing Bayesian optimization algorithms to dynamically adjust hyperparameters, designing a weight adaptation mechanism, and incorporating an attention mechanism to regulate multi-level model parameters, the performance of the model is optimized at the algorithmic level. Additionally, physics-driven parameters are used to impose physical constraints on the model, derive error boundaries, and analyze its generalization performance, thereby enhancing the theoretical foundation and significantly improving the model’s accuracy and generalization capabilities. Visualization analysis shows that the improved model can effectively fit the trend of changes in pumping operation time in both large- and small-interval pumping scenarios, providing reliable predictions for actual production. In practical applications, our model enables intelligent decision-making for interval pumping regimes, reduces equipment energy consumption, improves extraction efficiency, and promotes refined management and green intelligent transformation of oilfields, achieving significant benefits in multiple oilfields. This work provides a new path for intelligent optimization in petroleum extraction. Further exploration of multimodal data fusion and reinforcement learning applications, combined with real-time monitoring for dynamic optimization, can support the sustainable development of the petroleum industry and provide technical support for global energy development.

Author Contributions

Conceptualization, C.Z. (Chengming Zhang); methodology, F.F.; validation, C.Z. (Cong Zhang); formal analysis, S.L.; resources, J.X.; data curation, J.X.; writing—original draft preparation, C.Z. (Chengming Zhang); writing—review and editing, C.Z. (Cong Zhang).; visualization, S.L.; supervision, F.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We would like to express our gratitude to our colleagues and peers for their constructive feedback and thoughtful assistance, which have made significant contributions to the improvement of this work.

Conflicts of Interest

Author Chengming Zhang was employed by the company Xinli Oil Production Plant of Jilin Oilfield Branch of PetroChina. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KNN	K-Nearest Neighbors
Lasso	Least Absolute Shrinkage and Selection Operator
XGBoost	Extreme Gradient Boosting
LightGBM	Light Gradient Boosting Machine
MAE	Mean Absolute Error
MSE	Mean Square Error
EI	Expected Improvement
UCB	Upper Confidence Bound
CDF	Cumulative Distribution Function
PDF	Probability Density Function
CNN	Convolutional Neural Network
SNN	Spiking Neural Network
HDG	Hybrid Deep Learning Model
ADASYN	Adaptive Synthetic Sampling Approach
PSO	Particle Swarm Optimization
GP	Gaussian Process

References

Lv, H.; Liu, J.; Han, J.; Jiang, A. An energy saving system for a beam pumping unit. Sensors 2016, 16, 685. [Google Scholar] [CrossRef] [PubMed]
Liang, X.; Xing, Z.; Yue, Z.; Ma, H.; Shu, J.; Han, G. Optimization of energy consumption in oil fields using data analysis. Processes 2024, 12, 1090. [Google Scholar] [CrossRef]
Liu, P.; Ding, H.; Yan, D.; Sun, H.; Huang, T.; Li, J. Multi-Objective Approach for Optimizing Production Parameters of Low-Permeability Oil Well to Enhance Energy Efficiency. J. Shanghai Jiaotong Univ. (Sci.) 2024, 1–13. [Google Scholar] [CrossRef]
Liang, X.; Zhou, F.; Liang, T. A New Intermittent Pumping Design for Fluid Shortage Wells. Sci. Rep. 2020, 10, 2559. [Google Scholar] [CrossRef]
Sun, W.; Ren, T.; Zhang, X.; Song, H. Optimization of intermittent oil production pattern based on data mining technology. In Proceedings of the 2021 3rd International Conference on Intelligent Control, Measurement and Signal Processing and Intelligent Oil Field (ICMSP), Xi’an, China, 23–25 July 2021; pp. 361–364. [Google Scholar] [CrossRef]
Cai, Z.; Zhao, Q.; Chen, H.; Yang, Q.; An, Y.; Yue, J. Optimization of Intermittent Production Well Strategy in Jingbian Gas Field. Processes 2025, 13, 2170. [Google Scholar] [CrossRef]
Fan, Y.; Chen, J.; Xiang, J.; Ye, C.; Han, G. Intermittent Optimization of Shale Gas Wells Based on Reservoir–Wellbore Coupling. Processes 2025, 13, 247. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Xia, Y. Risk Prediction of Gas Hydrate Formation in the Wellbore and Subsea Gathering System of Deep-Water Turbidite Reservoirs: Case Analysis from the South China Sea. Reserv. Sci. 2025, 1, 52–72. [Google Scholar] [CrossRef]
Wu, J.; Ansari, U. From CO₂ Sequestration to Hydrogen Storage: Further Utilization of Depleted Gas Reservoirs. Reserv. Sci. 2025, 1, 19–35. [Google Scholar] [CrossRef]
Cao, L.; Lv, M.; Li, C.; Sun, Q.; Wu, M.; Xu, C.; Dou, J. Effects of Crosslinking Agents and Reservoir Conditions on the Propagation of Fractures in Coal Reservoirs During Hydraulic Fracturing. Reserv. Sci. 2025, 1, 36–51. [Google Scholar] [CrossRef]
Khormali, A.; Ahmadi, S.; Kazemzadeh, Y.; Karami, A. Evaluating the efficacy of binary benzimidazole derivatives as corrosion inhibitors for carbon steel using multi-modal analysis and optimization techniques. Results Eng. 2025, 26, 104671. [Google Scholar] [CrossRef]
Yao, J.; Han, G.; Gao, J.; Yang, Y.; Wang, M. Dynamic Load-Optimized Selection Charts for Flexible Ultra-Long Stroke Pumping Units in Low-Yield Oil Wells. Processes 2025, 13, 482. [Google Scholar] [CrossRef]
Hou, D.; Han, G.; Chen, S.; Zhang, S.; Liang, X. A Study on a Novel Production Forecasting Method of Unconventional Oil and Gas Wells Based on Adaptive Fusion. Processes 2024, 12, 2515. [Google Scholar] [CrossRef]
Zhu, Z.; Han, G.; Liang, X.; Chang, S.; Yang, B.; Yang, D. Rapid Classification and Diagnosis of Gas Wells Driven by Pro-duction Data. Processes 2024, 12, 1254. [Google Scholar] [CrossRef]
Sui, X.; Han, G.; Lu, X.; Xing, Z.; Liang, X. Energy Consumption Prediction and Optimization of the Electrical Submersible Pump Well System Based on the DA-RNN Algorithm. Processes 2025, 13, 128. [Google Scholar] [CrossRef]
Li, K.; Xu, W.; Han, Y.; Ge, F.; Wang, Y. A hybrid modeling method for interval time prediction of the intermittent pumping well based on IBSO-KELM. Measurement 2020, 151, 107214. [Google Scholar] [CrossRef]
Chen, B.; Gao, X. Soft sensor hybrid model of dynamic liquid level for sucker rod pumping wells. Trans. Inst. Meas. Control 2021, 43, 1843–1857. [Google Scholar] [CrossRef]
Artun, E. Machine learning assisted forecasting of reservoir performance. In Machine Learning Applications in Subsurface Energy Resource Management: State of the Art and Future Prognosis; CRC Press: Boca Raton, FL, USA, 2022; pp. 185–206. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, X.; Zhao, R.; Wang, C.; Shi, J.; Zhang, X.; Zhao, H.; Ma, G.; Yi, H.; Yu, C.; et al. Multi-Objective Schedule Optimization for Intermittent Pumping Wells Based on NSGA-II. In Proceedings of the Abu Dhabi International Petroleum Exhibition and Conference, Abu Dhabi, United Arab Emirates, 5 October 2023. SPE-216871-MS, D041S138R006. [Google Scholar] [CrossRef]
Chen, S.; Zhao, R.; Deng, F.; Zhang, D.; Chen, G.; Hao, H.; Shi, J.; Zhang, X. Research of Big Data Production Measurement Method for SRP Wells Based on Electrical Parameters. Processes 2023, 11, 2158. [Google Scholar] [CrossRef]
Yin, C.; Zhang, K.; Liu, J.; Wang, X.; Li, M.; Zhang, L.; Zhou, W. The real-time dynamic liquid level calculation method of the sucker rod well based on multi-view features fusion. Pet. Sci. 2024, 21, 3575–3586. [Google Scholar] [CrossRef]
Artun, E.; Canbolat, S.; Yildirim, E.C.; Acikgoz, C.; Yuruker, O. An integrated workflow for data analytics-assisted reservoir management with incomplete well log data. SPE J. 2025, 30, 486–506. [Google Scholar] [CrossRef]
Qiao, M.; Ren, F.; Li, Y.; Jia, D. Optimal on–off scheduling for intermittent pumping wells under grid-connected photovoltaic systems. Sci. Rep. 2025, 15, 39050. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Tan, C.; Chen, P.; Lu, M.; Feng, G.; Gao, X.; Liu, B.; Jing, L. Optimization of staggered peak intermittent pumping operation scheduling of pumping unit well clusters under wind, solar and energy storage microgrid with improved adaptive GAPSO hybrid algorithm. Geoenergy Sci. Eng. 2025, 213897. [Google Scholar] [CrossRef]
Canbolat, S.; Cicek, M.; Artun, E. Data-Driven Reservoir Performance Forecasting: Leveraging Machine Learning for Complex Reservoirs. In Proceedings of the SPE Europe Energy Conference and Exhibition, Vienna, Austria, 10–12 June 2025. SPE-225507-MS. [Google Scholar] [CrossRef]
Lu, C.; Zhang, S.; Xue, D.; Xiao, F.; Liu, C. Improved estimation of coalbed methane content using the revised estimate of depth and CatBoost algorithm: A case study from southern Sichuan Basin, China. Comput. Geosci. 2022, 158, 104973. [Google Scholar] [CrossRef]
Cheng, Q.; Ma, J.; Fomel, S.; Sacchi, M.; Wu, R.; Yi, Z.; Zou, Y. ADASYN-CatBoost method for lithology identification with imbalanced well-logging data: A case study of Zhaoxian Gold deposit in northwest Jiaodong Peninsula, China. In Proceedings of the SEG 4th International Workshop on Mathematical Geophysics, Virtual, 17–19 December 2021; pp. 147–151. [Google Scholar] [CrossRef]
Bo, Y.; Liu, Q.; Huang, X.; Pan, Y. Real-time hard-rock tunnel prediction model for rock mass classification using CatBoost integrated with sequential model-based optimization. Tunn. Undergr. Space Technol. 2022, 124, 104448. [Google Scholar] [CrossRef]
Liang, H.; Xiong, J. Research on Intelligent Recognition Technology in Lithology Based on Multi-parameter Fusion of Logging While Drilling. J. Innov. Dev. 2023, 5, 116–124. [Google Scholar] [CrossRef]
Fetkovich, M.J. Decline Curve Analysis Using Type Curves. JPT 1980, 32, 1065–1077. [Google Scholar] [CrossRef]
Govindarajan, S.K.; Mishra, A.; Kumar, A. Darcy- and pore-scale issues associated with multi-phase fluid flow through a petroleum reservoir. Earth Sci. Malays. 2020, 4, 108–117. [Google Scholar] [CrossRef]
Khalaf, M.S.; El-Banbi Ahmed, H.; Sayyouh, M.H. New technique revives direct deconvolution methods for wellbore storage removal in pressure transient analysis. Egypt. J. Pet. 2021, 30, 37. [Google Scholar] [CrossRef]
Lu, Q.; Wang, S.; Jiang, M.; Li, Y.; Dong, K. Main control factors affecting mechanical oil recovery efficiency in complex blocks identified using the improved k-means algorithm. PLoS ONE 2021, 16, e0248840. [Google Scholar] [CrossRef]
Liu, T.; Tian, X.; Liu, L.; Gu, X.; Zhao, Y.; Zhang, L.; Song, X. Modeling the submergence depth of oil well states and its applications. Appl. Sci. 2022, 12, 12373. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. Available online: https://dl.acm.org/doi/abs/10.5555/3327757.3327770 (accessed on 3 December 2025).
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. Available online: https://dl.acm.org/doi/10.5555/2999325.2999464 (accessed on 3 December 2025).
Ren, F.; Li, Y.; Hu, M. Multi-classifier ensemble based on dynamic weights. Multimed. Tools Appl. 2018, 77, 21083–21107. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Domingos, P.; Pazzani, M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017; Available online: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 3 December 2025).
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]

Figure 1. Correlation analysis of inter-sampling system factors.

Figure 2. Characteristic parameter correlation analysis.

Figure 3. Operating time vs. predicted operation time for large-interval pumping wells.

Figure 4. Operating time vs. predicted operation time for small-interval pumping wells.

Table 1. Comparison of optimal hyperparameters under large-interval and small-interval training data.

Model	Best Learning Rate	Best Depth	Best Iterations
Large-interval	0.0012	4	241
Small-interval	0.0015	8	125

Table 2. Ablation study of the CatBoost enhancements under both large-interval and small-interval datasets. Each row removes one component from the full configuration.

Model	Mean Absolute Error (Large-Interval Pumping)	Accuracy (Large-Interval Pumping)	Mean Absolute Error (Small-Interval Pumping)	Accuracy (Small-Interval Pumping)
Baseline CatBoost	0.8353	0.9395	1.1723	0.9300
w/o cubic polynomial features	0.4215	0.9588	1.1089	0.9342
w/o F-value screening	0.4697	0.9520	1.1374	0.9321
w/o weight adaptation	0.3778	0.9654	1.0836	0.9361
w/o attention mechanism	0.3345	0.9703	1.0581	0.9375
w/o Bayesian optimization	0.3012	0.9751	1.0324	0.9398
Ours	0.2659	0.9805	1.0100	0.9422

Table 3. Comparison of algorithms related to inter-sampling regimes.

Model	Mean Absolute Error (Large-Interval Pumping)	Accuracy (Large-Interval Pumping)	Mean Absolute Error (Small-Interval Pumping)	Accuracy (Small-Interval Pumping)
KNN [39]	1.125	0.9083	1.3652	0.9235
Plain Bayes [40]	1.3557	0.804	1.2351	0.9308
Lasso [41]	1.1920	0.8939	1.2866	0.9285
Random Forest [42]	0.5312	0.9541	1.3452	0.9251
Support Vector Machine [43]	1.2062	0.9107	1.2239	0.9298
LightGBM [44]	0.8725	0.9321	1.2984	0.9270
XGBoost [45]	0.8003	0.9363	1.3572	0.9230
LightGBM + XGBoost	1.2873	0.8966	1.3713	0.9214
LightGBM + CatBoost	0.9248	0.9239	1.3380	0.9240
CatBoost + LightGBM + XGBoost	1.13	0.90	1.31	0.92
CatBoost [35]	0.8353	0.9395	1.1723	0.9300
Improved CatBoost(Ours)	0.2659	0.9805	1.0100	0.9422

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Feng, F.; Zhang, C.; Li, S.; Xie, J. Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms. Processes 2025, 13, 4012. https://doi.org/10.3390/pr13124012

AMA Style

Zhang C, Feng F, Zhang C, Li S, Xie J. Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms. Processes. 2025; 13(12):4012. https://doi.org/10.3390/pr13124012

Chicago/Turabian Style

Zhang, Chengming, Fuping Feng, Cong Zhang, Shiyuan Li, and Junzhuzi Xie. 2025. "Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms" Processes 13, no. 12: 4012. https://doi.org/10.3390/pr13124012

APA Style

Zhang, C., Feng, F., Zhang, C., Li, S., & Xie, J. (2025). Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms. Processes, 13(12), 4012. https://doi.org/10.3390/pr13124012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Intermittent Pumping Duration with a Physics–Data Dual-Driven CatBoost Model Enhanced by Bayesian and Attention Mechanisms

Abstract

1. Introduction

2. Current Status of Research at Home and Abroad

2.1. Research on Traditional Technologies and Methods for Optimizing the Inter-Well Pumping System

2.2. Current Status of CatBoost Algorithm Improvement Research

3. Distinction Between Large-Interval and Small-Interval Models

4. Based on a Data–Physics Dual-Driven Approach

4.1. Physics-Driven Method

4.2. Data-Driven Method

4.3. Data–Physics Dual-Driven

5. Optimization Algorithm for Intermittent Sampling System Driven by Data–Physics Models

5.1. Data Preparation

5.2. Feature Engineering

5.3. Hyperparameter Tuning Based on Bayesian Optimization

5.4. Adaptive Updating of Weights

5.5. Dynamic Regulation of Parameters Under the Attention Mechanism

6. Optimization Results of the Inter-Sampling System

6.1. Ablation Study

6.2. Comparison with Baseline and Ensemble Models

7. Limitations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI