A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching

Qin, Yisen; Li, Huayu; Meng, Xiangling; He, Xiao; Zhang, Jinding; Zhang, Haijun

doi:10.3390/pr14101635

Open AccessArticle

A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching

by

Yisen Qin

¹,

Huayu Li

^1,*,

Xiangling Meng

²,

Xiao He

²,

Jinding Zhang

³

and

Haijun Zhang

²

¹

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

²

School of Computer and Communication, University of Science and Technology Beijing, Beijing 100083, China

³

School of Petroleum Engineering, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(10), 1635; https://doi.org/10.3390/pr14101635

Submission received: 18 April 2026 / Revised: 13 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Application of Artificial Intelligence in Oil and Gas Engineering)

Download

Browse Figures

Versions Notes

Abstract

Automatic history matching traditionally depends on a large number of time-consuming numerical simulations, which makes the overall workflow computationally expensive. Deep learning-based surrogate models provide an efficient alternative, but their predictive performance often relies on large labeled datasets, whose generation through reservoir simulation remains costly. To alleviate this issue, we propose a data-efficient surrogate modeling framework for automatic history matching. The framework consists of two components. First, the reservoir parameter field is reformulated as a flattened representation and processed using one-dimensional convolutions. This representation provides a direct connection between parameter encoding and production-sequence prediction while maintaining competitive forecasting accuracy. Second, an autoencoder is pre-trained on unlabeled parameter realizations, and the learned encoder is then used to initialize the surrogate model for supervised regression, thereby improving the utilization of inexpensive, unlabeled data. The proposed framework is evaluated on the three-dimensional Brugge benchmark reservoir model. Results show that the one-dimensional representation achieves competitive predictive accuracy with shorter training time. In addition, the pre-training strategy is particularly beneficial when labeled simulation data are limited. Overall, the proposed framework improves the data efficiency of surrogate-assisted automatic history matching and reduces the dependence on extensive labeled simulations in the Brugge benchmark.

Keywords:

history matching; surrogate model; deep learning; pre-training; autoencoder

1. Introduction

Data assimilation optimally merges observational data with numerical model simulations to reduce system uncertainty, thereby enhancing the model’s predictive capabilities [1]. Consequently, it is widely used in various fields such as meteorology, hydrology, and geology [2,3,4]. In petroleum engineering, this concept is applied through automatic history matching [5], which seeks to infer geological parameters by matching simulation results to historical production data. This process not only facilitates more reliable prediction of future reservoir dynamics but also enables computational evaluation of different development strategies, making it a key component of closed-loop reservoir management. Nevertheless, a comprehensive data assimilation workflow typically requires hundreds to thousands of numerical simulations in the forward process. For large-scale engineering problems, this computational burden is often prohibitively expensive, thereby limiting the broader application of such methods.

Surrogate models are an effective strategy for mitigating this computational burden. They establish fast input-output mappings that provide accurate approximations of expensive numerical simulations [6,7]. These techniques are primarily categorized into two groups: physics-driven and data-driven approaches. Physics-driven methods include projection-based reduced-order models [8,9] and multi-fidelity models [10,11]. Although these methods are grounded on strong simplifying assumptions, such assumptions may restrict their applicability. Conversely, data-driven approaches are designed to capture nonlinear input-output mappings by directly identifying patterns from the data [12]. Traditional methodologies like Gaussian processes [13,14], radial basis functions [15], and polynomial chaos expansions [16,17] are recognized for their computational efficiency. However, their utility is generally limited to low-dimensional and relatively simplistic reservoir models, as their performance often deteriorates when applied to high-dimensional data and strongly nonlinear challenges [18].

The rapid development of deep learning and data-driven methods in other engineering fields also provides useful context for surrogate modeling. In structural and infrastructure engineering, such methods have been applied to nonlinear response prediction, such as temperature-induced bearing displacement prediction of long-span bridges using DCNN-LSTM models [19]. They have also been used for response reconstruction, such as reconstructing structural acceleration responses under environmental temperature effects using CNN-BiGRU with a squeeze-and-excitation module [20]. In addition, recent studies have reviewed data processing and behavior monitoring techniques for dam health monitoring [21], and missing measurement data recovery methods have been investigated in structural health monitoring [22]. These studies illustrate the broad use of deep learning and data-driven techniques for prediction, reconstruction, monitoring, and data recovery in complex engineering systems. This broader progress provides useful methodological motivation for reservoir surrogate modeling, where the objective is to construct an efficient mapping from high-dimensional geological parameter fields to dynamic production responses.

In recent years, there has been a rapid development of data-driven, deep learning-based surrogate models [23,24,25,26]. Tang et al. [27] utilized a residual U-Net to establish a nonlinear mapping from geological parameters to saturation and pressure fields, thereby predicting reservoir dynamics. In a similar vein, Zhang et al. [28] introduced a dual-channel surrogate based on a densely connected encoder-decoder network to predict saturation and pressure fields from both spatial and vector data. However, these models do not directly provide the production data necessary for history matching, necessitating an additional step involving the application of Peaceman’s equation, which requires storing a substantial volume of intermediate data.

To overcome this limitation, Ma et al. [29] proposed an end-to-end framework. Their model initially extracts spatial features using two-dimensional densely connected convolutional layers and subsequently employs stacked recurrent neural networks (RNNs) to perform regression directly on the production time-series data, thus significantly simplifying the implementation process. Building upon this framework, Zhang et al. [30] adapted the Vision Transformer (ViT) architecture [31] by replacing the RNN with a multi-layer Transformer decoder [32] for the temporal regression module. More recently, Zhang et al. [33] designed a dual surrogate framework that explicitly models the surrogate predictive error and incorporates this metric into the prediction and optimization processes. Furthermore, Zhang et al. [34] enhanced the surrogate accuracy by developing a fully Transformer-based encoder-decoder, replacing the convolutional and RNN modules with a unified architecture for processing both spatial and temporal features.

Although existing deep learning-based surrogate models have shown promising performance in automatic history matching, several challenges still remain. One major limitation is their reliance on large labeled training datasets, whose generation through numerical simulation is computationally expensive. In addition, many existing architectures first extract spatial features using two-dimensional convolutions or Transformer-based encoders and then pass them to a separate temporal regression module. Such a design usually requires an additional feature transformation step between spatial encoding and production sequence prediction through, for example, flattening, pooling, or feature replication across time steps. As a result, feature extraction and temporal regression are often implemented as two relatively separate stages, which may increase model complexity and training cost. By contrast, generating reservoir parameter realizations is relatively inexpensive. These realizations therefore provide an abundant source of unlabeled data that can potentially be exploited to improve data efficiency.

In this work, we develop a data-efficient surrogate modeling framework to address these challenges. First, we reformulate the two-dimensional grid input into a flattened representation and employ one-dimensional convolutions for feature extraction. This representation provides a direct interface between parameter encoding and downstream production sequence prediction, avoiding an additional spatial-to-temporal feature transformation stage. Unlike feature-replication-based image-to-sequence designs, the proposed encoder directly generates a sequence of learned feature representations from the flattened parameter field, so that the temporal regression module receives non-replicated feature inputs before sequence modeling. Second, to better exploit inexpensive unlabeled parameter realizations, we introduce a self-supervised pre-training stage based on an autoencoder and use the pre-trained encoder to initialize the surrogate model for supervised regression.

The proposed framework is evaluated on the large-scale three-dimensional Brugge benchmark and further incorporated into a surrogate-assisted automatic history matching workflow with adaptive differential evolution. Experimental results demonstrate that the proposed architecture, when combined with autoencoder-based pre-training, improves data efficiency and achieves competitive predictive performance, particularly when labeled simulation data are limited.

The contributions of this work can be summarized as follows:

A flattened input representation combined with one-dimensional convolutions is introduced to provide a more direct spatial-to-temporal feature interface for production prediction. Compared with feature-replication-based designs, this architecture reduces intermediate transformation operations and provides non-replicated, learned feature inputs to the temporal regression module.
The proposed surrogate model achieves competitive predictive accuracy while requiring shorter training time than the baseline architecture in the Brugge benchmark.
A pre-training strategy based on unlabeled parameter realizations is incorporated to improve the data efficiency of surrogate modeling under limited labeled simulation data.

The remainder of this paper is organized as follows. Section 2 introduces the surrogate-based automatic history matching method. Section 3 presents the proposed surrogate architecture and pre-training strategy. Section 4 describes the experimental setup and reports the corresponding results. Section 5 and Section 6 provide the discussion and conclusions, respectively.

2. Surrogate-Based Automatic History Matching

2.1. Objective Function and PCA-Based Dimensionality Reduction

Automatic history matching encompasses both a forward and an inverse process. In the forward process, the observed data are expressed as follows in Equation (1):

d_{o b s} = G (m) + ε .

(1)

where m denotes the model parameter vector,

G (\cdot)

represents the forward simulation process, which can be conducted using a high-fidelity reservoir simulator, and

ε

symbolizes the observation error vector.

According to Bayesian theory [35], the posterior probability of the model parameters, denoted as

f (m | d_{o b s})

, is defined in Equation (2):

f (m | d_{o b s}) \propto f (d_{o b s} | m) f (m) .

(2)

where

f (d_{o b s} | m)

is the likelihood function that quantifies the misfit between the predictions of the forward model and the observed data and

f (m)

represents the prior probability density function of the model parameters.

Assuming that both the observation errors and the prior probability density are governed by Gaussian distributions, the posterior probability density function can be expressed as shown in Equation (3):

\begin{matrix} f (m | d_{o b s}) \propto e x p [- \frac{1}{2} {(d_{o b s} - G (m))}^{T} C_{D}^{- 1} (d_{o b s} - G (m)) - \frac{1}{2} {(m - m_{p r})}^{T} C_{M}^{- 1} (m - m_{p r})] . \end{matrix}

(3)

where

C_{D}

is the observation error covariance matrix,

C_{M}

is the prior model covariance matrix, and

m_{p r}

is the prior mean vector. Maximizing the posterior probability as stated in Equation (3) is equivalent to minimizing the objective function, which is detailed in Equation (4):

\begin{matrix} O (m) = \frac{1}{2} {(d_{o b s} - G (m))}^{T} C_{D}^{- 1} (d_{o b s} - G (m)) + \frac{1}{2} {(m - m_{p r})}^{T} C_{M}^{- 1} (m - m_{p r}) . \end{matrix}

(4)

In Equation (4), the first term minimizes the misfit between the observed data and the forward model predictions and the second term regularizes the solution to adhere to the prior model distribution.

In the context of optimization algorithms applied to the history matching problem, it is customary to initially reduce the dimensionality of high-dimensional reservoir parameters. The reduced-dimensional vector is termed the latent vector. Throughout the iterative optimization, updates are applied exclusively to these latent vectors, which are subsequently reconstructed back to the original, full-dimensional space. In this study, we employed principal component analysis (PCA) to achieve this dimensionality reduction. To ensure that the analysis is insensitive to the mean of the data ensemble, the

N_{r}

prior realizations of the reservoir parameters are first centered as demonstrated in Equation (5):

M_{c} = \frac{1}{\sqrt{N_{r} - 1}} [m_{1} - \bar{m} m_{2} - \bar{m} \dots m_{N_{r}} - \bar{m}] .

(5)

where

M_{c} \in R^{N_{m} \times N_{r}}

is the centered ensemble matrix,

m_{i} \in R^{N_{m} \times 1}

represents the i-th prior realization, and

\bar{m} \in R^{N_{m} \times 1}

indicates the mean of the prior realizations.

Given that the number of prior samples

N_{r}

is typically far smaller than the parameter dimension

N_{m}

, we employed singular value decomposition (SVD) on the matrix

M_{c}

, which can be expressed as

M_{c} = U Σ V^{T} .

(6)

where

U \in R^{N_{m} \times N_{m}}

and

V \in R^{N_{r} \times N_{r}}

denote the left and right singular matrices, respectively, and

Σ \in R^{N_{m} \times N_{r}}

is the diagonal matrix containing the singular values.

Subsequent generations of new realizations can be derived as detailed in Equation (7):

m^{s a m p l e} = U_{l} Σ_{l} ξ + \bar{m} .

(7)

where

U_{l}

consists of the first l columns of U,

Σ_{l}

is the diagonal matrix composed of the l largest singular values, and

ξ

is the low-dimensional latent vector for

m^{s a m p l e}

, following a standard normal distribution. Accordingly, the objective function is reformulated in terms of the latent vector

ξ

, as indicated in Equation (8):

O (m) = \frac{1}{2} {(d_{o b s} - G (m))}^{T} C_{D}^{- 1} (d_{o b s} - G (m)) + \frac{1}{2} ξ^{T} ξ .

(8)

2.2. Adaptive Differential Evolution

Upon completing the forward process, the automatic history matching inversion is conducted using an optimization algorithm. In this study, we utilize the adaptive differential evolution with optional external archive (JADE) algorithm [36], which has demonstrated efficacy in previous automatic history matching studies [33,37].

JADE dynamically adjusts the scaling factor F and the crossover rate

C R

. For each individual within a generation, distinct pairs of

(F, C R)

are generated by sampling from respective distributions:

\begin{matrix} C R_{i} = r a n d n_{i} (μ_{C R}, 0.1), \end{matrix}

(9)

\begin{matrix} F_{i} = r a n d c_{i} (μ_{F}, 0.1) . \end{matrix}

(10)

where

C R_{i}

is sampled from a normal distribution with mean

μ_{C R}

and standard deviation 0.1. Concurrently,

F_{i}

is derived from a Cauchy distribution with location parameter

μ_{F}

and scale parameter 0.1. The parameters

μ_{C R}

and

μ_{F}

are initially set at the commencement of the optimization and are adaptively updated at the end of each generation:

\begin{matrix} μ_{C R} = (1 - c) \cdot μ_{C R} + c \cdot m e a n_{A} (S_{C R}), \end{matrix}

(11)

\begin{matrix} μ_{F} = (1 - c) \cdot μ_{F} + c \cdot m e a n_{L} (S_{F}) . \end{matrix}

(12)

where

S_{C R}

and

S_{F}

represent the set of parameters corresponding to individuals that were successfully updated during the selection operation,

c \in (0, 1)

is a positive constant, and

m e a n_{A} (\cdot)

and

m e a n_{L} (\cdot)

denote the arithmetic and Lehmer means, respectively.

Regarding the archive operation, an initially empty archive set A is established. During each selection step, any parent solution replaced by a more successful trial vector is added to A. Should the archive size surpass a predefined threshold (equivalent to the population size in this study), individuals are randomly removed from A. The mutation operator in the JADE algorithm is formulated as in Equation (13):

v_{i, g} = x_{i, g} + F_{i} \cdot (x_{b e s t, g}^{p} - x_{i, g}) + F_{i} \cdot (x_{r 1, g} - {\tilde{x}}_{r 2, g}) .

(13)

where

x_{i, g}

denotes the i-th individual in the g-th generation,

v_{i, g}

indicates the corresponding mutant vector,

F_{i}

is the scaling factor associated with the i-th individual, and

x_{b e s t, g}^{p}

is an individual selected randomly from the top p percentile of the current population based on fitness. Given the current population, expressed as the set P,

x_{r 1, g}

is randomly selected from the set P, ensuring that it is distinct from

x_{i, g}

, and

{\tilde{x}}_{r 2, g}

is chosen from the augmented set

P \cup A

, also ensuring that it is distinct from both

x_{i, g}

and

x_{r 1, g}

.

The automatic history matching workflow comprising the surrogate model preparation and the optimization iteration is shown in Algorithms 1 and 2.

Algorithm 1 Data generation and model training.

Input: the prior reservoir parameter $m^{p r i o r}$ , the number of training samples $N_{s a m p l e}$ , the latent vector dimension l, the initialized surrogate model $f_{s u r r o g a t e}^{i n i t}$
1. Apply Equation (6) on the centered matrix of $m^{p r i o r}$ .
2. Store the $U_{l}, Σ_{l}, \bar{m}$ from step 1.
for i = 1, …, $N_{s a m p l e}$ do
3. Sample a random vector $ξ_{i}$ from a standard normal distribution.
4. Generate $m_{i}^{s a m p l e}$ by $U_{l}, Σ_{l}, \bar{m}, ξ_{i}$ using Equation (7).
5. Run numerical simulator with $m_{i}^{s a m p l e}$ to get production data $d_{i}^{s a m p l e}$ .
end for
6. Construct the training dataset $D_{t r a i n} = {m^{s a m p l e}, d^{s a m p l e}}$ .
7. Train the $f_{s u r r o g a t e}^{i n i t}$ using $D_{t r a i n}$ to obtain $f_{s u r r o g a t e}$ .
Output: $U_{l}$ , $Σ_{l}$ , $\bar{m}$ , $f_{s u r r o g a t e}$

Algorithm 2 History matching based on offline surrogate with JADE.

Input: $U_{l}$ , $Σ_{l}$ , $\bar{m}$ , $f_{s u r r o g a t e}$ , the observed data $d_{o b s}$ , the number of iterations $N_{i t e r}$ , the population size $N P$ , the initial $μ_{C R}$ , $μ_{F}$ , A
1. Generate initial population $ξ^{p r i o r} = {ξ_{1}, \dots, ξ_{N P}}$ .
for i = 1, …, $N_{i t e r}$ do
2. Get the coefficients $C R_{i}, F_{i}$ using Equations (9) and (10).
for j = 1, …, $N P$ do
3. Generate $m_{j}^{p r i o r}$ by $U_{l}$ , $Σ_{l}$ , $\bar{m}$ , $ξ_{j}^{p r i o r}$ using Equation (7).
4. Get $d_{j}^{p r e d}$ using $f_{s u r r o g a t e} (m_{j}^{p r i o r})$ .
5. Calculate the objective value $O_{j}^{p r i o r}$ by $d_{o b s}$ .
end for
6. Generate the $ξ_{t r i a l}$ through mutation and crossover and reconstruct $m^{t r i a l}$ using Equation (7).
7. Use $f_{s u r r o g a t e} (m^{t r i a l})$ to obtain the prediction and calculate $O^{t r i a l}$ by $d_{o b s}$ .
8. Obtain $ξ_{p o s t e r i o r}$ through $O^{p r i o r}$ and $O^{t r i a l}$ .
9. $ξ_{p r i o r} = ξ_{p o s t e r i o r}$ .
10. Update $μ_{C R}$ , $μ_{F}$ and A.
end for
Output: $ξ_{p o s t e r i o r}$

3. Methodology

3.1. Simplified Feature Extraction Design

The end-to-end surrogate model for the reservoir simulator can be conceptualized as a mapping function f, as expressed in Equation (14):

f : X \in R^{H \times W \times (D \times N_{p})} \to Y \in R^{T \times N_{f}} .

(14)

where

X

and

Y

denote the input and output of the surrogate model, respectively. The variables H, W, and D represent the discrete spatial resolutions in three dimensions. The symbol

N_{p}

refers to the number of parameter field types, such as permeability and porosity, T denotes the total number of simulation timesteps, and

N_{f}

represents the number of production data types, including oil and water production rates.

A common strategy in existing studies is to represent the three-dimensional parameter field as a two-dimensional image. In such models, the x and y directions are treated as spatial dimensions, while the z direction is regarded as the channel dimension of the surrogate model input. Accordingly, two-dimensional convolutional networks are commonly used for feature extraction. One advantage of this representation is its flexibility, as it allows parameter fields of different sizes to be processed within a unified framework. However, a practical challenge arises when linking the extracted spatial features to the dynamic production sequences. To handle this issue, existing models often introduce an additional transformation step between spatial encoding and temporal prediction, such as flattening, pooling, or replicating spatial features across timesteps. As a result, feature extraction and time series regression are usually implemented as two relatively separate stages, which may increase model complexity and training costs.

To improve the efficiency of surrogate modeling for automatic history matching, we build upon the architecture of [29] and introduce a flattened one-dimensional feature extraction design together with an autoencoder-based pre-training strategy. The overall structure of the proposed model is shown in Figure 1, with our primary modifications detailed as follows:

The two-dimensional spatial grid defined on the x-y plane is reshaped into a one-dimensional sequence. For example, a parameter field of size $(139, 48, 9)$ is reformulated as an input matrix of size $(139 \times 48, 9)$ . This transformation provides a direct interface between the input parameter representation and the temporal regression module. It also avoids a separate spatial-to-temporal feature conversion stage. Although such reshaping may weaken the explicit preservation of local spatial adjacency, its effectiveness for production forecasting is evaluated empirically in Section 4. It should also be noted that the proposed representation does not explicitly model geological structure continuity in the same way as two-dimensional or three-dimensional convolutional encoders. In this study, the surrogate model is designed for well-production forecasting, where the target variables are integrated dynamic responses of the reservoir rather than spatial pressure or saturation fields. Therefore, the flattened representation is used as a practical feature interface for production-sequence prediction. Nevertheless, for reservoirs with stronger channelized structures or more complex geological continuity, the loss of explicit spatial adjacency may have a greater impact and should be further investigated.
Based on the flattened representation, one-dimensional convolution is employed for feature extraction. In our implementation, the number of convolutional filters is set to the number of simulation time steps, T. Together with the flattened input representation, this design provides a direct feature interface for the subsequent time-series regression module. For clarity, a feature replication strategy can be conceptually written as

$S_{rep} = [z, z, \dots],$

where $z$ denotes a global spatial feature vector extracted by the spatial encoder. In such a design, the temporal regression module receives repeated copies of the same global feature before sequence modeling. In the proposed design, the feature sequence is instead generated as

$S_{p r o p} = [s_{1}, s_{2}, \dots],$

where each $s_{i}$ is produced by the one-dimensional convolutional encoder from the flattened parameter representation. Therefore, the temporal regression module receives non-replicated learned feature inputs, while the direct feature interface also reduces the need for an additional spatial-to-temporal transformation stage. The overall effectiveness of this one-dimensional feature extraction design is evaluated through the comparative experiments presented in Section 4.

Overall, the proposed design focuses on a production-forecasting surrogate in which the parameter representation is directly connected to the temporal regression module. The goal is not to replace spatial encoders in all reservoir modeling tasks but to examine whether a flattened representation can provide an effective and practical alternative for history matching.

After feature extraction, a dropout layer [38] is applied for regularization. The resulting feature representation is then fed into a recurrent time series regression module. In this study, we adopt an LSTM [39] because of its effectiveness in modeling long production sequences. Finally, a time-distributed dense layer is used to generate the predicted production data at all time steps.

3.2. Pre-Training Strategy

As discussed in Section 2, the PCA-based method enables the generation of new parameter realizations at relatively low computational cost, thereby providing a large amount of unlabeled data without corresponding simulation outputs. To make use of these data, and inspired by the work of [40], we introduce a self-supervised pre-training stage based on an autoencoder. In this stage, the autoencoder is trained on the generated parameter realizations before the subsequent supervised training of the surrogate model.

3.2.1. Encoder and Decoder

The architecture of the autoencoder is illustrated in Figure 2. Before being fed into the network, the input data undergoes two preprocessing steps: (1) it is resized following the strategy of [33] and (2) it is then reshaped into a one-dimensional form along the x and y directions, which serves as the final input to the autoencoder.

To maintain consistency with the surrogate model, the autoencoder is also constructed using one-dimensional convolutional layers. After the dimensionality adjustment, the data is fed into the encoder, which has

D \times N_{p}

input channels. For a single-layer autoencoder, the t-th feature map of the encoder, denoted by

h_{t}

, is defined by Equation (15):

h_{t} = σ (\sum_{c = 1}^{C} x_{c} * W_{c, t} + b_{t}) .

(15)

where

x_{c}

denotes the c-th input channel,

W_{c, t}

denotes the weight of the t-th convolutional filter applied to the c-th input channel, ∗ represents the convolution operator,

b_{t}

is the bias term associated with the t-th filter, and

σ

denotes the nonlinear activation function, which is chosen as the rectified linear unit (ReLU) in this study. In our implementation, the number of filters is set to the number of timesteps, and the filter index t therefore ranges from 1 to T.

After the input data is transformed into a low-dimensional latent representation by the multi-layer encoder, it is passed to the decoder for reconstruction. The decoder restores the latent representation to the original data dimensions through a series of transposed convolutional layers. For a specific decoder layer, the t-th output feature map

y_{t}

is defined by Equation (16):

y_{t} = σ (\sum_{c = 1}^{C} h_{c} * {\tilde{W}}_{c, t} + b_{t}) .

(16)

where

h_{c}

denotes the latent feature generated by the encoder and

{\tilde{W}}_{c, t}

denotes the weight of the transposed convolutional kernel applied to the c-th input channel to produce the t-th output channel.

3.2.2. Transition Block

Because the output of the autoencoder does not have the same dimensions as the original input, the reconstruction loss cannot be computed directly. Therefore, a transition block is introduced to map the reconstructed features back to the original input space. This block reverses the preprocessing procedure through three steps: (1) reshaping the one-dimensional feature vector back into a two-dimensional spatial representation; (2) applying a two-dimensional convolution with the same padding to restore the original channel dimension without changing the spatial size; and (3) resizing the result to recover the original parameter dimensions.

3.2.3. Training Framework

An overview of the training framework is shown in Figure 3. The training process consists of two stages: (1) the autoencoder is pre-trained on a large set of prior realizations to learn useful feature representations and (2) the weights of the pre-trained encoder are used to initialize the feature extraction layers of the surrogate model, which is then fine-tuned for the downstream time series regression task.

Different from conventional convolutional autoencoders, the proposed model uses strided convolutions with a stride of 2 for downsampling instead of pooling layers. Before each downsampling operation, a resizing step is applied to standardize the feature dimensions, and the feature size is then reduced through strided convolution. This implementation keeps the encoder structure consistent with the surrogate model described in Section 3.1.

Pre-training the encoder on a large set of readily generated parameter realizations allows the model to make use of inexpensive unlabeled data before the downstream supervised regression stage [41]. In practice, this strategy is simple to implement and introduces only limited additional computational cost, while providing performance improvements in the experiments.

4. Case Study

4.1. Experimental Setup

This subsection describes the training settings and implementation details used in this study. We apply min-max normalization separately to the parameter fields and the production data, as defined in Equation (17):

\hat{D} = \{\begin{matrix} \frac{D - \min (D)}{\max (D) - \min (D)} & if \max (D) \neq \min (D), \\ 0 & otherwise . \end{matrix}

(17)

where D denotes the original data,

\hat{D}

denotes the normalized data, and

\min (\cdot)

and

\max (\cdot)

denote the minimum and maximum operators, respectively. For the production data, these values are computed over the entire time series of each sample.

The surrogate model and the autoencoder were implemented using the TensorFlow framework [42]. The detailed network architectures are listed in Table 1 and Table 2, respectively.

The models were trained using the Adam optimizer [43] with a batch size of 32 for 200 epochs. The training objective was to minimize the mean squared error (

M S E

) loss, as defined in Equation (18):

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{s i m} - y_{i}^{s u r})}^{2} .

(18)

where N denotes the total number of samples,

y_{i}^{s i m}

denotes the i-th high-fidelity simulation output, and

y_{i}^{s u r}

denotes the corresponding prediction produced by the surrogate model.

The predictive performance of the model was evaluated using two standard metrics: the coefficient of determination (

R^{2}

) and the root mean squared error (

R M S E

), defined as follows:

\begin{matrix} R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i}^{s i m} - y_{i}^{s u r})}^{2}}{\sum_{i = 1}^{N} {(y_{i}^{s i m} - \bar{y})}^{2}}, \end{matrix}

(19)

\begin{matrix} R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{s i m} - y_{i}^{s u r})}^{2}} . \end{matrix}

(20)

where

\bar{y}

denotes the mean of

y^{s i m}

. A higher

R^{2}

value and a lower

R M S E

value indicate better predictive accuracy of the surrogate model.

The experiments were executed on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core i9-13900HX CPU (Intel Corporation, Santa Clara, CA, USA).

4.2. Dataset

The case study is based on a three-dimensional Brugge reservoir model [44], as shown in Figure 4. The dataset contains an ensemble of 104 geological realizations, each consisting of

9 \times 48 \times 139

= 60,048 gridblocks, of which 44,550 are active. The uncertainty is represented by several key reservoir parameters, including porosity; permeability in the x, y, and z directions; and the net-to-gross ratio (NTG).

The reservoir model contains 10 injection wells and 20 production wells. The production data cover a 10-year period and include the oil production rate (OPR) and water production rate (WPR) for the 20 production wells, as well as the bottomhole pressure (BHP) for all operating wells. The simulations were performed using the Schlumberger Eclipse black oil simulator under a two-phase (oil-water) flow setting with 253 timesteps.

In our experiments, the output data include only the oil and water production rates from the 20 production wells. Therefore, the term well refers to production wells only in the following discussion since injection wells are not included in the surrogate outputs. This leads to an output vector of

253 \times (20 + 20)

= 10,120 values for each sample. The observational data are generated by adding Gaussian noise with a standard deviation equal to 3% of the corresponding true production value.

To generate the dataset for offline training, we adopt the PCA-based method described in Section 2.1. The latent dimension of each parameter field is set to 100 according to the energy criterion, under which the selected principal components retain 95% of the total variance of the original parameter ensemble. In total, 2400 samples are generated, among which 200 samples are used for validation and another 200 are reserved for testing. The training sets are constructed from the remaining 2000 samples. Before dataset partitioning, all generated samples were randomly shuffled. Since the present task is a regression problem and no discrete class labels are available, stratified sampling was not applied. We conducted five repeated random splits using different random seeds. For each split, the validation and test sets contained 200 samples, respectively, and the remaining samples were used to construct training sets with different sizes. The mean and standard deviation of RMSE and

R^{2}

over the five splits were reported. In addition, 10,000 extra realizations are generated for pre-training.

4.3. Performance of the Proposed Surrogate Architecture

To evaluate the performance of the proposed surrogate architecture, we conducted a comparative study between two models. The baseline model described in [29] is denoted as Model1, while the proposed model is denoted as Model2.

For training, the initial learning rate was set to 0.001 and was adjusted dynamically according to the

R^{2}

score on the validation set. To ensure a fair comparison, all other experimental settings and hyperparameters were kept the same.

The representative validation curves of the two surrogate models are presented in Figure 5, and the statistical comparison over five repeated random data splits is summarized in Table 3. The training curves show that both

R M S E

and

R^{2}

gradually converge within 200 epochs. For all evaluated training set sizes, Model2 achieves better performance than Model1. The relatively small standard deviations in Table 3 indicate that the observed performance differences are not strongly dependent on a specific random data partition. In particular, according to the mean values in Table 3, Model2 trained with 1000 samples achieves performance comparable to Model1 trained with 2000 samples. These results suggest that the proposed model can achieve competitive predictive accuracy with fewer computationally expensive reservoir simulations.

Figure 6 compares the total training time, including validation, of the two models under different training set sizes. Model2 consistently requires less training time than Model1. This saving is mainly attributed to the use of one-dimensional convolutions and the removal of intermediate spatial-to-temporal transformation operations in Model2. To further support the comparison of computational efficiency, Table 4 reports the number of trainable parameters and FLOPs of the two surrogate models. The number of parameters denotes the total trainable weights in the surrogate model, reflecting the model size and storage requirement. FLOPs denote the approximate number of floating-point operations required for one forward pass with a batch size of 1, reflecting the theoretical computational cost of model inference. Compared with Model1, Model2 reduces the number of parameters from 0.328 M to 0.074 M and decreases the FLOPs from 0.267 G to 0.058 G. These results provide additional evidence that the proposed architecture has lower model complexity.

The test performance of Model1 and Model2 is presented in Figure 7. Across the three training set sizes (500, 1000, and 2000), the median (P50)

R^{2}

values of Model1 are 0.9786, 0.9873, and 0.9911, while those of Model2 are consistently higher, reaching 0.9850, 0.9913, and 0.9931. These results indicate that Model2 achieves better predictive performance than the baseline model under all evaluated data settings.

To further strengthen the comparison with structurally different surrogate architectures, we additionally conducted a controlled experiment based on the improved Vision Transformer surrogate model (IVIT) in [30]. Two Transformer-based variants were compared. The first one is the original IVIT model, while the second one, denoted as IVIT_seq, replaces the original feature interface of IVIT with the proposed flattened one-dimensional convolutional encoder. The Transformer-based temporal regression module and other training settings were kept the same. Therefore, this comparison is intended to isolate the influence of the proposed encoder interface within a Transformer-based surrogate framework.

As shown in Table 5, replacing the original IVIT feature interface with the proposed encoder consistently improves the Transformer-based surrogate under all evaluated training set sizes. Compared with IVIT, IVIT_seq achieves higher mean

R^{2}

and lower mean

R M S E

. For example, when 1000 labeled samples are used, the mean

R^{2}

increases from

0.9902

to

0.9927

, while the mean

R M S E

decreases from

0.0357

to

0.0308

. Since the Transformer-based temporal regression module and training settings are kept unchanged, this controlled comparison indicates that the proposed encoder interface can improve the feature representation before temporal regression.

4.4. Performance of the Pre-Training Strategy

When a pre-trained encoder is transferred to the downstream regression task, two common strategies can be considered: freezing the encoder parameters or fine-tuning them [45]. As shown in Figure 8, directly freezing the encoder leads to inferior regression performance. Therefore, the fine-tuning strategy is adopted in all subsequent experiments.

Before evaluating the downstream regression performance, the reconstruction behavior of the autoencoder is first examined. Figure 9 shows the reconstruction loss curves under different numbers of pre-training samples. The loss decreases rapidly during the early epochs and then gradually converges, indicating that the autoencoder can effectively learn compact representations of the generated parameter fields. In addition, larger pre-training sets generally lead to lower reconstruction loss, suggesting improved reconstruction capability with more unlabeled samples. However, a lower reconstruction loss does not necessarily guarantee better downstream prediction accuracy, because the autoencoder is optimized to reconstruct input parameter fields rather than directly predict production sequences. Therefore, the effect of the pre-training sample size on the final surrogate performance is further evaluated in Figure 10.

Figure 10 shows the effect of the pre-training dataset size on the final model performance, evaluated using a fixed training set of 1000 labeled samples. As the number of pre-training realizations increases from 1000 to 5000, the validation

R^{2}

score improves from 0.9898 to 0.9913. However, when the pre-training dataset is further expanded to 10,000 samples, the validation

R^{2}

score decreases to 0.9896. This result suggests that the benefit of pre-training is not strictly monotonic with respect to the size of the unlabeled dataset. One possible explanation is that the autoencoder is optimized for input reconstruction rather than production sequence prediction. Therefore, a larger pre-training dataset may not always translate into more useful feature representations for the downstream surrogate task. This behavior can be interpreted as a form of over-specialization to the reconstruction objective rather than direct overfitting in the supervised regression stage.

Table 6 summarizes the effect of pre-training under different training set sizes, where the mean and standard deviation are reported over five repeated random data splits. In general, pre-training improves model performance, as reflected by the improved

R M S E

and

R^{2}

values compared with the corresponding models trained from scratch. For example, the pre-trained model trained with 1000 labeled samples (

R M S E = 0.0346

,

R^{2} = 0.9913

) achieves performance close to that of the model trained from scratch with 1500 labeled samples (

R M S E = 0.0343

,

R^{2} = 0.9914

). This result suggests that pre-training helps narrow the performance gap caused by using fewer labeled simulation samples.

Table 7 reports the corresponding pre-training time. The results show that the pre-training time increases approximately linearly with the dataset size. Overall, the additional computational cost of the pre-training stage remains limited, which supports the practical applicability of incorporating this strategy into the surrogate modeling workflow.

4.5. Result of History Matching

We implemented the surrogate-assisted history matching workflow described in Section 2 and compared its results with those of the conventional simulation-based workflow. In the JADE algorithm, both

μ_{C R}

and

μ_{F}

were initialized to 0.5 [36], while the population size and external archive size were both set to 100. The optimization was performed for 100 generations. During the inverse modeling process, five uncertain parameter fields were updated simultaneously, each with a latent dimension of 100, resulting in a total optimization dimension of

5 \times 100 = 500

.

In the final stage of the history matching workflow, we used the proposed surrogate model, which was pre-trained on 3000 unlabeled realizations and then trained on 1000 labeled samples. Figure 11 compares the history matching results of the surrogate-assisted workflow and the simulation-based workflow for individual well production data (WOPR and WWPR). In both workflows, the posterior ensemble (blue) moves closer to the observations (yellow), while the uncertainty of the prior ensemble (gray) is substantially reduced. In addition, the two workflows produce similar posterior ranges for most wells. Figure 12 presents the corresponding field-level production results (FOPR and FWPR), where the two workflows also show highly similar behavior. These results indicate that the proposed surrogate-based method can serve as an effective alternative to simulation-based history matching in this case.

Figure 13 shows the porosity fields of Layers 1, 5, and 9 in the Brugge model, comparing the prior ensemble with the posterior ensemble. In both workflows, the posterior porosity fields are closer to the reference field than the corresponding prior fields. Moreover, the surrogate-assisted and simulation-based history matching results are generally comparable.

Although the surrogate-assisted history matching results are close to those of the simulation-based approach, a small performance gap still remains because of the approximation error of the surrogate model. Figure 14 presents the final objective function values of the posterior ensembles obtained from the simulation-based workflow and the surrogate-based models. Overall, the proposed models generally achieve lower objective values than the baseline model, and the models with pre-training generally perform better than those without pre-training. In particular, the proposed model with pre-training and 1000 training samples achieves better results than the baseline model with 1500 training samples.

The additional cost of pre-training includes both the generation of pre-training samples and the pre-training process itself. As discussed in Section 3, the generation of parameter realizations is computationally inexpensive, and the pre-training stage can be completed within 30 min. Considering the observed performance improvement, this additional cost remains acceptable in practice. In this experiment, the simulation-based history matching workflow would require

100 \times 100

= 10,000 numerical simulations, whereas the proposed method only requires the 1000 simulations used for surrogate training to achieve comparable history matching results.

5. Discussion

In this study, we developed a data-efficient surrogate modeling framework with autoencoder-based pre-training for automatic history matching. The framework addresses the high cost of labeled simulation data by combining two components: a flattened one-dimensional representation for production forecasting and a self-supervised pre-training stage based on unlabeled parameter realizations. The former provides a direct route from parameter encoding to production-sequence prediction, while the latter improves the use of readily generated prior realizations before supervised regression.

It should be noted that the proposed architecture is not intended to claim universal superiority over CNN-RNN or Transformer-based surrogate models. Transformer-based models may have advantages in capturing complex temporal dependencies. The main focus of this study is instead the spatial-to-temporal feature interface before the temporal regression module. By generating a non-replicated feature sequence from the flattened reservoir parameter representation, the proposed model provides a lightweight alternative to feature-replication-based image-to-sequence surrogate designs. This architectural design is further combined with autoencoder-based pre-training to improve data efficiency when labeled simulation data are limited.

The experimental results indicate that these two components are beneficial in the Brugge benchmark setting. In particular, the proposed framework achieves competitive surrogate accuracy with fewer intermediate feature transformation operations, and its advantage becomes more evident when the number of labeled training samples is limited. The pre-training strategy provides a useful initialization for the downstream supervised regression task, especially when labeled simulation data are scarce. For history matching, the resulting surrogate-assisted workflow achieves results comparable to those of the baseline model trained with a larger dataset. These findings suggest that improving data efficiency can help reduce the number of expensive simulation samples required for surrogate construction. More broadly, the proposed framework may also be combined with other strategies for further improving surrogate-assisted history matching.

Nevertheless, the proposed method still has several limitations. First, the relationship between the size of the pre-training dataset and the final model performance is not strictly monotonic, which makes the choice of pre-training sample size less straightforward. A possible explanation is that the unlabeled pre-training samples are generated from the same PCA-based prior distribution. Although increasing the number of such samples enlarges the pre-training dataset, it may not proportionally increase the diversity of geological patterns represented in the data. When the pre-training dataset becomes excessively large, the encoder may become more specialized to the reconstruction objective of the autoencoder, rather than learning representations that are most transferable to the downstream production-regression task. This mismatch between the self-supervised reconstruction objective and the supervised production-prediction objective may lead to diminishing returns or slight performance degradation after fine-tuning. Therefore, the pre-training sample size should be regarded as a tunable factor rather than a parameter that always improves performance when increased. Further investigation of pre-training data diversity, reconstruction objectives, and their relationship with downstream regression performance will be considered in future work. Second, because the model is entirely data-driven, its performance remains dependent on the distribution of the available training data, which may affect its robustness in practical history matching applications. Third, although the flattened representation is effective for production forecasting in the Brugge benchmark, its applicability to tasks that require stronger preservation of local spatial structure, such as direct prediction of spatial pressure or saturation fields, still requires further investigation. Therefore, future work will further investigate the sensitivity of the proposed representation to geological structure continuity and explore hybrid encoders that combine the proposed one-dimensional feature interface with spatial-structure-preserving modules.

6. Conclusions

In this work, we developed a data-efficient surrogate modeling framework for automatic history matching by combining a flattened, one-dimensional representation with an autoencoder-based pre-training strategy using unlabeled parameter realizations. The main conclusions of this study are summarized as follows:

A flattened input representation together with one-dimensional convolutions provides a direct connection between reservoir parameter encoding and temporal production prediction.
The proposed surrogate model achieves competitive predictive performance for production forecasting in the Brugge benchmark while requiring shorter training time and lower model complexity than the baseline model.
Pre-training on inexpensive unlabeled parameter realizations provides a useful initialization for supervised surrogate modeling and improves data efficiency, especially when the amount of labeled simulation data is limited.
In the Brugge history matching setting, the proposed surrogate-assisted workflow achieves results comparable to those of the baseline model while requiring fewer labeled simulation samples for surrogate construction.

Future work will investigate whether this framework can be extended to settings with stronger spatial structure requirements and whether it can be combined with other strategies, such as semi-supervised learning or physics-informed constraints, to further improve robustness and generalization.

Author Contributions

Conceptualization, H.L.; Methodology, Y.Q., H.L., X.M. and J.Z.; Software, H.Z.; Validation, H.Z.; Investigation, J.Z.; Resources, X.H.; Writing—original draft preparation, Y.Q.; Writing—review and editing, X.M. and X.H.; Visualization, H.Z.; Supervision, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China, grant number 2023YFB3002903.

Data Availability Statement

Data is contained within the article.

Acknowledgments

The authors would like to acknowledge the technical assistance provided by the Engineer Research Center of Intelligent Supercomputing (Ministry of Education of China), which offered valuable guidance on experimental equipment operation and data processing. Additionally, we would like to thank all colleagues who provided constructive suggestions during the discussion of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Law, K.; Stuart, A.; Zygalakis, K. Data Assimilation; Springer: Cham, Switzerland, 2015. [Google Scholar]
Reichle, R.H. Data assimilation methods in the Earth sciences. Adv. Water Resour. 2008, 31, 1411–1418. [Google Scholar] [CrossRef]
Liu, Y.; Weerts, A.H.; Clark, M.; Hendricks Franssen, H.-J.; Kumar, S.; Moradkhani, H.; Seo, D.-J.; Schwanenberg, D.; Smith, P.; van Dijk, A.I.J.M.; et al. Advancing data assimilation in operational hydrologic forecasting: Progresses, challenges, and emerging opportunities. Hydrol. Earth Syst. Sci. 2012, 16, 3863–3887. [Google Scholar] [CrossRef]
Ghorbanidehno, H.; Kokkinaki, A.; Lee, J.; Darve, E. Recent developments in fast and scalable inverse modeling and data assimilation methods in hydrology. J. Hydrol. 2020, 591, 125266. [Google Scholar] [CrossRef]
Oliver, D.S.; Chen, Y. Recent progress on reservoir history matching: A review. Comput. Geosci. 2011, 15, 185–221. [Google Scholar] [CrossRef]
Asher, M.J.; Croke, B.F.W.; Jakeman, A.J.; Peeters, L.J.M. A review of surrogate models and their application to groundwater modeling. Water Resour. Res. 2015, 51, 5957–5973. [Google Scholar] [CrossRef]
Samadian, D.; Muhit, I.B.; Dawood, N. Application of data-driven surrogate models in structural engineering: A literature review. Arch. Comput. Methods Eng. 2025, 32, 735–784. [Google Scholar] [CrossRef]
He, J.; Xie, J.; Wen, X.-H.; Chen, W. An alternative proxy for history matching using proxy-for-data approach and reduced order modeling. J. Pet. Sci. Eng. 2016, 146, 392–399. [Google Scholar] [CrossRef]
Xiao, C.; Lin, H.-X.; Leeuwenburgh, O.; Heemink, A. Surrogate-assisted inversion for large-scale history matching: Comparative study between projection-based reduced-order modeling and deep neural network. J. Pet. Sci. Eng. 2022, 208, 109287. [Google Scholar] [CrossRef]
Thenon, A.; Gervais, V.; Le Ravalec, M. Multi-fidelity meta-modeling for reservoir engineering: Application to history matching. Comput. Geosci. 2016, 20, 1231–1250. [Google Scholar] [CrossRef]
Santoso, R.; He, X.; Alsinan, M.; Figueroa Hernandez, R.; Kwak, H.; Hoteit, H. Multi-fidelity Bayesian approach for history matching in reservoir simulation. In Proceedings of the SPE Middle East Oil and Gas Show and Conference, Sanabis, Bahrain, 28 November–1 December 2021; SPE: Richardson, TX, USA, 2021. [Google Scholar]
Xue, L.; Li, D.; Dou, H. Artificial intelligence methods for oil and gas reservoir development: Current progresses and perspectives. Adv. Geo-Energy Res. 2023, 10, 65–70. [Google Scholar] [CrossRef]
Hamdi, H.; Couckuyt, I.; Sousa, M.C.; Dhaene, T. Gaussian processes for history-matching: Application to an unconventional gas reservoir. Comput. Geosci. 2017, 21, 267–287. [Google Scholar] [CrossRef]
Rana, S.; Ertekin, T.; King, G.R. An efficient assisted history matching and uncertainty quantification workflow using Gaussian processes proxy models and variogram based sensitivity analysis: GP-VARS. Comput. Geosci. 2018, 114, 73–83. [Google Scholar] [CrossRef]
Jeong, J.; Park, E. Theoretical development of the history matching method for subsurface characterizations based on simulated annealing algorithm. J. Pet. Sci. Eng. 2019, 180, 545–558. [Google Scholar] [CrossRef]
Patel, R.G.; Jain, T.; Trivedi, J.J. Polynomial-chaos-expansion based integrated dynamic modelling workflow for computationally efficient reservoir characterization: A field case study. In Proceedings of the SPE Europec Featured at EAGE Conference and Exhibition, Paris, France, 12–15 June 2017; SPE: Richardson, TX, USA, 2017. [Google Scholar]
Oladyshkin, S.; Class, H.; Nowak, W. Bayesian updating via bootstrap filtering combined with data-driven polynomial chaos expansions: Methodology and application to history matching for carbon dioxide storage in geological formations. Comput. Geosci. 2013, 17, 671–687. [Google Scholar] [CrossRef]
Mo, S.; Zabaras, N.; Shi, X.; Wu, J. Deep autoregressive neural networks for high-dimensional inverse problems in groundwater contaminant source identification. Water Resour. Res. 2019, 55, 3856–3881. [Google Scholar] [CrossRef]
Huang, M.; Zhang, J.; Hu, J.; Ye, Z.; Deng, Z.; Wan, N. Nonlinear modeling of temperature-induced bearing displacement of long-span single-pier rigid frame bridge based on DCNN-LSTM. Case Stud. Therm. Eng. 2024, 53, 103897. [Google Scholar] [CrossRef]
Huang, M.; Wan, N.; Zhu, H. Reconstruction of structural acceleration response based on CNN-BiGRU with squeeze-and-excitation under environmental temperature effects. J. Civ. Struct. Health Monit. 2025, 15, 985–1003. [Google Scholar] [CrossRef]
Deng, Z.; Gao, Q.; Huang, M.; Wan, N.; Zhang, J.; He, Z. From data processing to behavior monitoring: A comprehensive overview of dam health monitoring technology. Structures 2025, 71, 108094. [Google Scholar] [CrossRef]
Zhang, J.; Huang, M.; Wan, N.; Deng, Z.; He, Z.; Luo, J. Missing measurement data recovery methods in structural health monitoring: The state, challenges and case study. Measurement 2024, 231, 114528. [Google Scholar] [CrossRef]
Tang, M.; Liu, Y.; Durlofsky, L.J. Deep-learning-based surrogate flow modeling and geological parameterization for data assimilation in 3D subsurface flow. Comput. Methods Appl. Mech. Eng. 2021, 376, 113636. [Google Scholar] [CrossRef]
Wang, N.; Chang, H.; Zhang, D. Inverse modeling for subsurface flow based on deep learning surrogates and active learning strategies. Water Resour. Res. 2023, 59, e2022WR033644. [Google Scholar] [CrossRef]
Jiang, Z.; Tahmasebi, P.; Mao, Z. Deep residual U-net convolution neural networks with autoregressive strategy for fluid flow predictions in large-scale geosystems. Adv. Water Resour. 2021, 150, 103878. [Google Scholar] [CrossRef]
Wang, S.; Xiang, J.; Wang, X.; Feng, Q.; Yang, Y.; Cao, X.; Hou, L. A deep learning based surrogate model for reservoir dynamic performance prediction. Geoenergy Sci. Eng. 2024, 233, 212516. [Google Scholar] [CrossRef]
Tang, M.; Liu, Y.; Durlofsky, L.J. A deep-learning-based surrogate model for data assimilation in dynamic subsurface flow problems. J. Comput. Phys. 2020, 413, 109456. [Google Scholar] [CrossRef]
Zhang, K.; Wang, X. The prediction of reservoir production based proxy model considering spatial data and vector data. J. Pet. Sci. Eng. 2022, 208, 109694. [Google Scholar] [CrossRef]
Ma, X.; Zhang, K.; Wang, J.; Yao, C.; Yang, Y.; Sun, H.; Yao, J. An efficient spatial-temporal convolution recurrent neural network surrogate model for history matching. SPE J. 2022, 27, 1160–1175. [Google Scholar] [CrossRef]
Zhang, D.; Li, H. Efficient surrogate modeling based on improved vision transformer neural network for history matching. SPE J. 2023, 28, 3046–3062. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhang, J.; Zhang, K.; Zhang, L.; Zhou, W.; Liu, C.; Liu, P.; Fu, W.; Chen, X.; Bian, Z.; Yang, Y.; et al. An offline data-driven dual-surrogate framework considering prediction error for history matching. Comput. Geosci. 2024, 192, 105680. [Google Scholar] [CrossRef]
Zhang, J.; Kang, J. An efficient transformer-based surrogate model with end-to-end training strategies for automatic history matching. Geoenergy Sci. Eng. 2024, 240, 212994. [Google Scholar] [CrossRef]
Bernardo, J.M.; Smith, A.F.M. Bayesian Theory; John Wiley & Sons: Chichester, UK, 2009. [Google Scholar]
Zhang, J.; Sanderson, A.C. JADE: Adaptive differential evolution with optional external archive. IEEE Trans. Evol. Comput. 2009, 13, 945–958. [Google Scholar] [CrossRef]
Zhang, L.; Cui, C.; Ma, X.; Sun, Z.; Liu, F.; Zhang, K. A fractal discrete fracture network model for history matching of naturally fractured reservoirs. Fractals 2019, 27, 1940008. [Google Scholar] [CrossRef]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked convolutional auto-encoders for hierarchical feature extraction. In Proceedings of the International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Li, P.; Pei, Y.; Li, J. A comprehensive survey on design and application of autoencoder in deep learning. Appl. Soft Comput. 2023, 138, 110176. [Google Scholar] [CrossRef]
TensorFlow Developers. TensorFlow. Zenodo 2022. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Peters, E.; Arts, R.J.; Brouwer, G.K.; Geel, C.R.; Cullick, S.; Lorentzen, R.J.; Chen, Y.; Dunlop, K.N.B.; Vossepoel, F.C.; Xu, R.; et al. Results of the Brugge benchmark study for flooding optimization and history matching. SPE Reserv. Eval. Eng. 2010, 13, 391–405. [Google Scholar] [CrossRef]
Asano, Y.M.; Rupprecht, C.; Vedaldi, A. A critical analysis of self-supervision, or what we can learn from a single image. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; University of Oxford: Oxford, UK, 2020. [Google Scholar]

Figure 1. Architecture of the proposed model featuring 1D convolutional layer.

Figure 2. Architecture of autoencoder with encoder, decoder, and transition block.

Figure 3. Training framework with pre-training strategy.

Figure 4. Permeability field for Brugge model (x direction).

Figure 5. Comparison of (a)

R M S E

and (b)

R^{2}

for baseline model (Model1) and proposed model (Model2) across different training set sizes on the validation set.

Figure 5. Comparison of (a)

R M S E

and (b)

R^{2}

for baseline model (Model1) and proposed model (Model2) across different training set sizes on the validation set.

Figure 6. Comparison of training time (including validation) for baseline model (Model1) and proposed model (Model2) across different training set sizes.

Figure 7. Comparison of prediction results (WWPR) for baseline model (left) and proposed model (right) across different training set sizes. (a) 500 samples; (b) 1000 samples; (c) 2000 samples. Black line denotes simulation results; red line represents predictions of surrogate.

Figure 8. Comparison of (a)

R M S E

and (b)

R^{2}

for frozen encoder strategy and fine-tuned encoder strategy (using 1000 training samples and 3000 pre-training samples).

Figure 8. Comparison of (a)

R M S E

and (b)

R^{2}

for frozen encoder strategy and fine-tuned encoder strategy (using 1000 training samples and 3000 pre-training samples).

Figure 9. Reconstruction loss curves of the autoencoder under different numbers of pre-training samples.

Figure 10. Comparison of

R M S E

and

R^{2}

values for different numbers of pre-training samples (1000, 3000, 5000 and 10,000), with surrogate model trained on fixed 1000 samples.

Figure 10. Comparison of

R M S E

and

R^{2}

values for different numbers of pre-training samples (1000, 3000, 5000 and 10,000), with surrogate model trained on fixed 1000 samples.

Figure 11. History matching results of production data from surrogate-based workflow (left) and simulation-based workflow (right). Gray region represents prior range; blue region represents posterior range; yellow dots represent observed data. (a–c) show OPR for wells 5, 10, and 15, respectively; (d–f) show WPR for same wells, respectively.

Figure 12. History matching results of production data in field from surrogate-based workflow (left) and simulation-based workflow (right). (a) FOPR; (b) FWPR. Gray region represents prior range; blue region represents posterior range; yellow dots represent observed data.

Figure 13. History matching results of porosity fields. (a) Layer 1; (b) layer 5; (c) layer 9. First column represents reference porosity fields; second column represents prior mean of porosity fields; third column represents posterior mean of porosity fields using surrogate-based method; fourth column represents posterior mean of porosity fields using simulation-based method.

Figure 14. Comparison of objective function values for posterior ensemble across different surrogate models and training sample sizes. Model1 denotes baseline; Model2 represents proposed model. Number outside parentheses indicates number of training samples for each surrogate model; number inside parentheses denotes number of pre-training samples.

Table 1. Architecture of the proposed surrogate model.

Parts	Layers	Output Size
Input	–	$(B, 45, 48, 139)$
	resize	$(B, 45, 64, 256)$
	reshape	$(B, 45, 16384)$
FE	Convolution 1D	$(B, 253, 8192)$
	Convolution 1D	$(B, 253, 4096)$
	Convolution 1D	$(B, 253, 2048)$
	Convolution 1D	$(B, 253, 1024)$
	Linear Layer	$(B, 253, 512)$
	Dropout	$(B, 253, 512)$
TSR	LSTM block	$(B, 253, 100)$
	Linear Layer	$(B, 253, 40)$

Table 2. Architecture of autoencoder.

Parts	Layers	Output Size
Input	–	$(B, 45, 48, 139)$
	resize	$(B, 45, 64, 256)$
	reshape	$(B, 45, 16384)$
Encoder	Convolution 1D	$(B, 253, 8192)$
	Convolution 1D	$(B, 253, 4096)$
	Convolution 1D	$(B, 253, 2048)$
	Convolution 1D	$(B, 253, 1024)$
Decoder	Transpose Convolution 1D	$(B, 253, 2048)$
	Transpose Convolution 1D	$(B, 253, 4096)$
	Transpose Convolution 1D	$(B, 253, 8192)$
	Transpose Convolution 1D	$(B, 253, 16384)$
Transition	reshape	$(B, 253, 64, 256)$
	Convolution 2D	$(B, 45, 64, 256)$
	resize	$(B, 45, 48, 139)$

Table 3. Accuracy comparison between the baseline model (Model1) and the proposed model (Model2) across different training set sizes. The values are reported as mean ± standard deviation over five repeated random data splits. Best mean results are highlighted in bold.

Sample Number	RMSE (Model1)	RMSE (Model2)	$R^{2}$ (Model1)	$R^{2}$ (Model2)
500	$0.0563 \pm 0.0005$	$0.0495 \pm 0.0007$	$0.9777 \pm 0.0003$	$0.9824 \pm 0.0004$
1000	$0.0473 \pm 0.0006$	$0.0396 \pm 0.0009$	$0.9843 \pm 0.0003$	$0.9897 \pm 0.0004$
1500	$0.0429 \pm 0.0004$	$0.0343 \pm 0.0008$	$0.9872 \pm 0.0002$	$0.9914 \pm 0.0003$
2000	$0.0391 \pm 0.0005$	$0.0321 \pm 0.0007$	$0.9895 \pm 0.0003$	$0.9929 \pm 0.0003$

Table 4. Comparison of model complexity between the baseline model (Model1) and the proposed model (Model2).

Metric	Model1	Model2
Number of parameters (M)	0.328	0.074
FLOPs (G)	0.267	0.058

Table 5. Controlled comparison between the original IVIT model and the IVIT model with the proposed encoder interface across different training set sizes. The values are reported as mean ± standard deviation over five repeated random data splits. Best mean results are highlighted in bold.

Sample Number	RMSE (IVIT)	RMSE (IVIT_seq)	$R^{2}$ (IVIT)	$R^{2}$ (IVIT_seq)
500	$0.0414 \pm 0.0004$	$0.0393 \pm 0.0006$	$0.9869 \pm 0.0002$	$0.9882 \pm 0.0003$
1000	$0.0357 \pm 0.0005$	$0.0308 \pm 0.0009$	$0.9902 \pm 0.0003$	$0.9927 \pm 0.0004$
1500	$0.0334 \pm 0.0003$	$0.0267 \pm 0.0008$	$0.9914 \pm 0.0001$	$0.9945 \pm 0.0003$

Table 6.

R M S E

and

R^{2}

results with different numbers of surrogate model training samples and pre-training samples. Values are reported as mean ± standard deviation over five repeated random data splits. Best mean results are highlighted in bold.

Table 6.

R M S E

and

R^{2}

results with different numbers of surrogate model training samples and pre-training samples. Values are reported as mean ± standard deviation over five repeated random data splits. Best mean results are highlighted in bold.

Sample Number		Pre-Training Sample Number
Sample Number		0	1000	3000	5000	10,000
500	$R M S E$	$0.0495 \pm 0.0007$	$0.0447 \pm 0.0006$	$0.0452 \pm 0.0007$	$0.0459 \pm 0.0008$	$0.0468 \pm 0.0009$
500	$R^{2}$	$0.9824 \pm 0.0004$	$0.9854 \pm 0.0003$	$0.9850 \pm 0.0002$	$0.9846 \pm 0.0004$	$0.9839 \pm 0.0005$
1000	$R M S E$	$0.0396 \pm 0.0009$	$0.0372 \pm 0.0007$	$0.0371 \pm 0.0008$	$0.0346 \pm 0.0007$	$0.0378 \pm 0.0009$
1000	$R^{2}$	$0.9897 \pm 0.0004$	$0.9899 \pm 0.0004$	$0.9900 \pm 0.0004$	$0.9913 \pm 0.0003$	$0.9896 \pm 0.0005$
1500	$R M S E$	$0.0343 \pm 0.0008$	$0.0328 \pm 0.0007$	$0.0334 \pm 0.0007$	$0.0342 \pm 0.0008$	$0.0332 \pm 0.0008$
1500	$R^{2}$	$0.9914 \pm 0.0003$	$0.9922 \pm 0.0002$	$0.9920 \pm 0.0003$	$0.9916 \pm 0.0004$	$0.9920 \pm 0.0003$
2000	$R M S E$	$0.0321 \pm 0.0007$	$0.0310 \pm 0.0006$	$0.0315 \pm 0.0008$	$0.0303 \pm 0.0006$	$0.0311 \pm 0.0006$
2000	$R^{2}$	$0.9928 \pm 0.0002$	$0.9930 \pm 0.0003$	$0.9929 \pm 0.0004$	$0.9934 \pm 0.0003$	$0.9930 \pm 0.0001$

Table 7. Training time of autoencoder with different pre-training samples.

Sample Number	1000	3000	5000	10,000
time (s)	218.05	511.69	847.98	1564.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, Y.; Li, H.; Meng, X.; He, X.; Zhang, J.; Zhang, H. A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching. Processes 2026, 14, 1635. https://doi.org/10.3390/pr14101635

AMA Style

Qin Y, Li H, Meng X, He X, Zhang J, Zhang H. A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching. Processes. 2026; 14(10):1635. https://doi.org/10.3390/pr14101635

Chicago/Turabian Style

Qin, Yisen, Huayu Li, Xiangling Meng, Xiao He, Jinding Zhang, and Haijun Zhang. 2026. "A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching" Processes 14, no. 10: 1635. https://doi.org/10.3390/pr14101635

APA Style

Qin, Y., Li, H., Meng, X., He, X., Zhang, J., & Zhang, H. (2026). A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching. Processes, 14(10), 1635. https://doi.org/10.3390/pr14101635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Data-Efficient Surrogate Model via Simplified Feature Extraction and Pre-Training for Automatic History Matching

Abstract

1. Introduction

2. Surrogate-Based Automatic History Matching

2.1. Objective Function and PCA-Based Dimensionality Reduction

2.2. Adaptive Differential Evolution

3. Methodology

3.1. Simplified Feature Extraction Design

3.2. Pre-Training Strategy

3.2.1. Encoder and Decoder

3.2.2. Transition Block

3.2.3. Training Framework

4. Case Study

4.1. Experimental Setup

4.2. Dataset

4.3. Performance of the Proposed Surrogate Architecture

4.4. Performance of the Pre-Training Strategy

4.5. Result of History Matching

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI