According to the petrochemical analysis, the reservoir of the Feixianguan Formation is characterized by complex lithology and poor physical property. It can be seen from the logging response analysis that, normally, the complete conventional logging curve could indicate the lithologies and the physical properties of the reservoir better. In the light of the lithologic characteristics, the rock volume physical model can be simplified to five parts, namely, clay, gypsum, calcite, dolomite and porosity [
1,
4]. In order to solve five unknown quantities, we needed at least five curves to set up a system of simultaneous equations. However, out of a total of twelve wells in this area, only five old wells (X1–X5) had complete conventional logging curves; the remaining seven old wells (X6–X12) just had AC, GR and RT, RXO curves. Obviously, these old wells could only establish one underdetermined equation, including four equations and five unknown quantities, which could not meet the basic requirements of the quantitative solution of parameters of the complex carbonate reservoir. Therefore, it was unrealistic to expect the quantitative processing of complex lithology unless the pseudo-curve could be constructed and the underdetermined equation extended to the positive definite equation.
3.1. Selection of the Reconstruction Curves
In order to reduce the error transfer and the uncertainty of the pseudo-curve, we had to make a choice between two logging curves, CNL and DEN, during the curve construction [
5,
6,
7]. In this section, we will demonstrate the selection of the reconstruction curves. Firstly, by analyzing the crossplot of the core porosity of the coring section of well X1 and the 3-porosity curve (
Figure 2), we found that the CNL curve was most significantly correlated with the core porosity, with the correlation coefficient of 0.71, less with the AC curve, and the least with the DEN curve, only 0.4. Obviously, the CNL curve could reflect the physical properties of reservoir better.
Secondly, we analyzed the relationship between the logging curves. It is known that the GR curve is sensitive to the clay, the AC, DEN and CNL curve can reflect the lithologies and physical properties of the reservoir simultaneously, and the RT curve can reflect the oil/gas-bearing properties [
8,
9]. We used the CNL curve and the DEN curve of well X1, respectively, to make a correlation analysis with other curves (
Figure 3). It turned out that the correlation coefficients of the GR, AC and RT curve with the CNL curve were significantly higher than that with the DEN curve. The reason is that the high-density drilling fluid caused expansion of hole diameter and induced fracture in some sections of wells, which affected the AC curve and the DEN curve [
10,
11,
12]. Therefore, it was more appropriate to reconstruct the CNL curve than the DEN curve.
3.2. Selection of Construction Method
Regression is the essence of the old well curve reconstruction problem. At present, there are a lot of research and reports on the reconstruction methods for curves, including curve fitting, BP neural network, support vector machine, etc. [
5,
13,
14,
15]. These traditional reconstruction methods for curves basically have the same idea: Firstly, a certain number of the parameter eigenvalues are extracted, which form the training sample data set. Secondly, by using the mathematical algorithm, the corresponding curve prediction model is established [
16]. The key of this technology is the convergence rate and the error analysis.
It is easy to find that, for the training data set, these traditional methods can obtain a good effect in self-judging ability but not in prediction effect [
17]. Yet it is worth noting that, even if some methods can achieve good results, these have low accuracy of the curve construction to spoil the logging interpretation, which leads to multi-solution or even wrong interpretation [
18,
19]. Such a status quo is caused for the following reasons: Firstly, lack of the feature information leads to loss of details when we extract the parameter eigenvalue; secondly, lack of the training data leads to model over-fitting. Only a machine learning method with high generalization performance can break this dilemma [
20]. Given that the logging data of unknown depth should be close within the nearby range when reconstructing curves, we chose the depth model BI-LSTM with a facility of drawing out the context sequence information to construct the pseudo-curve.
BI-LSTM is an extension of the recurrent neural network (RNN), which is a new method based on the common LSTM. By using BI-LSTM, we could divide one-way LSTM into two directions, namely the positive timing directions and the negative timing directions, and connect two networks to the same output by putting the forward information and the reverse information as the output of the current node at the same time [
21,
22]. BI-LSTM compensated for the limitations of the common neural network, solved the problem of “context semantic loss” caused by the lack of operation connection between each input layers, added the reverse operation based on the RNN and achieved the final results of the training data, namely the stacking of the forward RNN and the reverse RNN [
23,
24,
25]. Beyond that, owing to the presence of LSTM, the imperfection of the RNN that cannot handle the long-term dependency information was also eliminated [
26,
27].
Figure 4 shows the schematic diagram of BI-LSTM network structure.
The algorithm can be expressed as:
where,
is the input value of the BI-LSTM network at time t;
is the output value of the BI-LSTM network at time
t;
is the value of the forward hidden layer at time
t;
is the value of the forward hidden layer at time
t − 1;
V is the weight of the output layer in the forward calculation;
g is the corresponding activation function;
U is the weight of the input layer;
f is the corresponding activation function;
W is the corresponding weight of the hidden layer;
V′ is the weight of the output layer in reverse calculation;
U′ is the weight of the output layer;
W′ is the weight of the hidden layer;
is the output of LSTM memory unit in reverse calculation at time t; and
is the output of LSTM memory unit in reverse calculation at time
t − 1. The output value of the BI-LSTM network at time
t was obtained by the forward and reverse output.
3.3. Pseudo-Curve Reconstruction Experiment
Given the influence of the reservoir lithologies, the physical properties and the oil/gas-bearing properties on the logging curves, we used three logging curves, GR, AC and RT, as the input curves for CNL reconstruction. Taking the logging data of well X1 as the input sample and well X5 as the test sample, experiments were conducted by using multiple linear regression (MLR), BP neural network, support vector machine (SVM, and BI-LSTM [
5,
14,
28]. According to the experiment results, the advantages of the reconstruction method for the curve can be discussed through three test parameters: the correlation coefficient (R), root mean square error (RMSE) and determination coefficient (R
2). The closer R
2 is to 1, the more accurate estimate is.
The concrete calculating methods are as follows.
SSres is the sum of squares of residuals, which can be calculated by Equation (4):
where,
is real sample data,
is sample average value and
is estimated data.
SStot is the total sum of squares, which can be calculated by Equation (5):
R
2 is the determination coefficient, which can be calculated by Equation (6):
The experimental results of well X1 and well X5 are shown in
Table 3.
Figure 5 shows the effect of regression algorithms in well X1. Combined with
Table 3, the self-judging curve was found to be consistent with the measured curve in the training set. Through contrastive analysis, we drew the conclusion: To the correlation coefficient (R), R of these regression algorithms were all greater than 0.95, showing good results; to the root mean square error (RMSE), BI-LSTM was the smallest, which was only 0.153, the multiple logistic regression (MLR) was the largest, which was 0.799, BP neural network and support vector machine lay halfway in between; to the determination coefficient (R
2), SVM and BI-LSTM were the best, MLR was the worst. Briefly, all three parameters indicated that BI-LSTM was the optimal algorithm for the pseudo-curve construction.
Figure 6 shows the prediction effect of different algorithms in well X5. The main contributions are as follows: For most well sections, the predicted curve was found to be consistent with the measured curve in trend and, for some well expansion sections, the predicted curve of the multiple regression and the BP algorithm was diametrically opposite to the measured curve. Three-parameter analysis showed that the prediction effect of these two algorithms was poor: R was less than 0.7, RMSE was greater than 1.2 and R
2 was less than 0.6. In comparison, though the prediction effect of the support vector was rather ideal, BI-LSTM showed greater advantages. For BI-LSTM, R was significantly higher, RMSE was only 0.96 and R
2 was up to 0.71. Through the BI-LSTM algorithm, we achieved a pseudo-neutron curve. For such a pseudo-neutron curve, the frequency band distribution range was narrower than the original curve, the high-gamma segment was smaller than the original curve and the low-gamma segment was larger than the original curve. After standardization, the pseudo-neutron curve was found to be consistent with the original curve in the low-gamma segment, but there was still a large error in the high-gamma segment. Significantly, the processing results did not interfere with the identification of the reservoirs and the calculation of the reservoir parameters, although large errors existed. The main reason for this is that the high-gamma segment was not the favorable reservoir.
Above all, the results show that the BI-LSTM network was the best method for pseudo-curve construction, which improved the prediction precision and the accuracy greatly. Through the BI-LSTM algorithm, the internal relationship of different logging curves was reconciled, the dependency relationship of the sample sequences in the logging domain was considered, the long-term dependence problem was solved and the perfect advantage of making full use of data information was reflected.
3.4. The Impact of Data Volume on Predicted Results
For the machine learning algorithm, in general, the more training samples we have, the more regularity information of the data set we achieve [
4,
29]. With the implementation of the training tasks and the accumulation of the experience, the model acquired a strong generalization ability. We took the data of well X1, X2, X3 and X4 as the training data in turn and took the data of well X5 as the effect test. This was done to discuss the influence of the data information on the prediction result of BI-LSTM.
It can be seen from the
Figure 7 and
Table 4 that, with the increase of the training data, there was a rising trend of the pseudo-curve prediction result in R and R
2 but not for RMSE, suggesting that the prediction accuracy of BI-LSTM was improving. Yet it is worth noting that BI-LSTM was not affected by the data amount solely; when we took well X1 and well X2 as the training data set, R
2 increased significantly from 0.71 to 0.76 but, with the addition of the training data of well X3 and well X4, the growth trend of R
2 slowed down gradually.