Well Logging Based Lithology Identification Model Establishment Under Data Drift: A Transfer Learning Method

Recent years have witnessed the development of the applications of machine learning technologies to well logging-based lithology identification. Most of the existing work assumes that the well loggings gathered from different wells share the same probability distribution; however, the variations in sedimentary environment and well-logging technique might cause the data drift problem; i.e., data of different wells have different probability distributions. Therefore, the model trained on old wells does not perform well in predicting the lithologies in newly-coming wells, which motivates us to propose a transfer learning method named the data drift joint adaptation extreme learning machine (DDJA-ELM) to increase the accuracy of the old model applying to new wells. In such a method, three key points, i.e., the project mean maximum mean discrepancy, joint distribution domain adaptation, and manifold regularization, are incorporated into extreme learning machine. As found experimentally in multiple wells in Jiyang Depression, Bohai Bay Basin, DDJA-ELM could significantly increase the accuracy of an old model when identifying the lithologies in new wells.


Introduction
Well logging data have grown dramatically over the past few decades due to the widespread deployment of oil wells and the rapid development of sensing technology. A large amount of data not only brings us more opportunities to understand the underground but also brings more great challenges to the interpretation of logging data [1,2]. Well logging provides an objective and continuous method with which to observe the properties of the rocks through which the drill bit passes and to describe the deposition process quantitatively. As a bridge between surface geophysical survey and subsurface geology, well logging is an effective and irreplaceable method to understand reservoir characteristics. As conventional reservoirs dry up, oil and gas companies are turning to unconventional exploration and development in shale and low-permeability reservoirs, posing more challenges for logging interpretation [3].
In traditional logging interpretation, lithology determination, porosity, and permeability calculations are performed by specialists in exploration and geology with specialized knowledge.
Lithology identification is a fundamental problem in well logging interpretation and is of considerable significance in petroleum exploration engineering. It is the basis of reservoir parameter calculation (such as porosity, shale volume, permeability) and geological research (such as formation correlation, sedimentary modeling, favorable zone prediction, etc.). Due to the complexity of reservoir geological conditions, the uncertainty of exploration data and the inconsistency of expert experience, the results of lithology identification are mainly dependent on expertise. With the increasing diversity of logging data, the traditional logging interpretation methods which rely on human experience have some shortcomings and limitations. As a result, researchers are turning to more advanced data analysis methods for breakthroughs in lithology identification.
In recent years, with the rapid development of machine learning technology, its application in lithology identification has also attracted full attention [4]. In order to accurately and effectively determine the lithology type, a large amount of application work based on different machine learning began to emerge [5]. For example, G. Askari [6] used satellite remote sensing data to find the lithological information in Deh-Molla sedimentary succession by principal component analysis. Al-Anazi et al. [7] proposed a support vector machine-based classification feature method and selection method based on fuzzy theory to realize the recognition of potential features and the improvement of lithology recognition performance. Wang et al. [8] proposed a novel back propagation (BP) model by modifying the self-adapting algorithm and activation function, which is proven to be effective in predicting the lithologies of the Kela-2 gas field. The experimental results show that, compared with discriminant analysis and probabilistic neural network, support vector machines can identify different lithology of heterogeneous sandstone reservoirs more accurately. Xie et al. [9] evaluated five typical machine learning methods of naive Bayes, support vector machines, artificial neural networks, random forest and gradient tree enhancement based on the formation lithology identification data of the Danudui gas field and the Hangjinqi gas field. Experimental results show that the integrated method, including random forest and gradient tree enhancement, has a lower prediction error. At the same time, the gradient tree enhancement method has the highest accuracy compared with the other four methods. In order to reduce exploration uncertainty, Bhattacharya et al. [10] compared different methods of facies division and prediction of mudstone in the current conventional logging data, and applied them to the Devonian Bakken and Mamentago-Marsilus formations in North America. In this study, support vector machines (SVM), artificial neural networks (ANN), self-organizing mapping (SFM), and multi-resolution map based clustering (MRMC) are compared experimentally. Dev et al. [11] analyzed the data from China's Danuci gas field and Hangjinqi gas field by the gradient lifting decision tree system (i.e., XGBoost, LightGBM, and CatBoost) to study formation lithology classification, and compared their performance with random forest, AdaBoost, and gradient boosting machines. Experiments show that LightGBM and CatBoost are the preferred algorithms for lithology classification by using well logging data. A large number of similar studies can be seen in [12][13][14][15][16].
In addition to these directly-applied studies, more and more scholars are focusing on how to improve existing machine learning tools to solve practical problems in lithology identification. Due to the different distribution of underground lithology, there is a severe class imbalance problem in the training data set. Furthermore, Deng et al. [17] introduced a borderline-stroke technique for dealing with unbalanced data, and the results showed that this method could effectively improve the classification accuracy of SVM, especially for the minority classes. There are many free hyper-parameters in machine learning algorithms, and the settings of these hyper-parameters will have significant influences on the performance of lithology identification. Therefore, Saporetti et al. [18] adopted an evolutionary parameter tuning strategy, and combined gradient boosting (GB) with differential evolution (DE) to achieve the optimization of super parameters, thereby making the lithology identification more stable. In the study of [19], the wavelet decomposition was used to construct multi-channel images of logging data, and then the lithology identification problem based on the logging curve was skillfully transformed into an image segmentation problem. Finally, the feasibility of this method was verified by the application in the Daqing oilfield. Aiming at the issue of data drift between different wells, Ao et al. [20] proposed a hybrid algorithm for lithology identification that combines the mean drift algorithm and the random forest algorithm in the prototype similarity space. It is pointed out that a more accurate lithology identification model can be obtained by transforming the classification problem into prototype similarity space. Li et al. [21] proposed a semi-supervised algorithm based on Generated Adversarial Network, which uses logging curves as the labeled data and seismic data as the unlabeled data. The model was trained by Adam algorithm and uses the discriminator to identify the lithologies. Compared with the experimental results of various supervised methods, the model can effectively use unlabeled data to achieve higher prediction accuracy. Similar work can be found in [22][23][24].
Although much work has been done, a practical problem, i.e., the well loggings from different wells are different in the probability distribution, is not taken into consideration. Hence, the model trained on old wells might not perform well on new wells. As illustrated in Figure 1, the phenomenon of data drift occurs between two wells, even when they are geospatially near. In particular, applying the model trained on well W b to well W a has many errors, as shown in Figure 1a. To suppress the data drift-induced accuracy decrease, we propose a transfer learning method named the data drift joint adaptation extreme learning machine (DDJA-ELM) to increase the accuracy of the old model applying to new wells. By incorporating the project mean maximum mean discrepancy, joint distribution domain adaptation, and manifold regularization into the extreme learning machine, we realize the knowledge transfer from old wells to new wells. As experimented in multiple wells in Jiyang Depression, Bohai Bay Basin, DDJA-ELM could increase the accuracy of an old model significantly when identifying the lithologies in new wells. In the remainder of the paper, Section 2 expatiates on the proposed DDJA-ELM, which is evaluated in Section 3. The last section concludes the paper.  [25] visualization of data distribution for W a and W b . Gray, yellow, and green indicate mudstone, sandstone, and dolomite, respectively. Point and hollow square indicate data from W a and W b , respectively. W a and W b are geospatially near.

Notation
are the numbers of samples and classes, respectively. In well logging-based lithology identification, d generally means the number of loggings types, and c denotes the number of lithology types. A sample is composed of the logging values at a certain depth. Considering the problem of data drift, we use D S and D T to differentiate the datasets with drift; i.e., D S and D T represent the source dataset for training and the target dataset to predict, respectively. Specifically, the source dataset is labeled and the target dataset is unlabeled, thereby denoting with n s and n t samples accordingly, and n s + n t = n s+t , where x s i and x t i are source and target-dataset samples, respectively; and y s i is a source-dataset label of x s i . By defining X S = [x s 1 ; . . . ; x s n s ], Y S = [y s 1 ; . . . ; y s n s ], and X T = [x t 1 ; . . . ; x t n t ], then the D S = {(X S , Y S )} and D T = {X T }.

Problem Definition and Formulation
In general, a classifier f (x) trained on D classifies well on those samples that satisfied the independent identically distributed (i.i.d.) assumption. However, the classifier f (x) trained on D S might not perform well on D T because the data drift means that the i.i.d. assumption does not hold; i.e., the knowledge learned from D S cannot transfer to D T . In this case, it is necessary to add some new constraint conditions to achieve expected performance while learning the classifier f . Exactly, the expected performance can be concluded as the following constraint conditions: (i) minimizing the structural risk to avoid overfitting; (ii) minimizing the data drift between the source dataset and the target dataset; (iii) minimizing the prediction inconsistency within the target dataset. If considering the ELM as the basic classifier f , According to the structural risk minimization (SRM) and regularization techniques [26], the classifier f can be denoted as where the first term represents the empirical loss on samples (i.e., describing the fitness of applying the model to predict the training data); the second term represents the regularization term (i.e., representing the formulation of constraint conditions). Thus, combined with the constraint conditions mentioned above, the objective function is formulated mathematically as follows where the empirical loss term ( f (x; β), y) and the structural risk regularization term β 2 compose the model accuracy and complexity of ELM. The term Ω(D S , D T ; β) indicates the extent of data drift between D S and D T . Additionally, we introduce the manifold regularization term M(D T ; β) to improve the prediction consistency within the target dataset. λ and γ are the regularization parameters accordingly.
In the remainder of this section, we will expatiate on each term of the objective function.

ELM
Recent years have witnessed the development of a promising machine learning model; i.e., extreme learning machine (ELM). ELM is actually an artificial neural network with a single hidden layer, which was first proposed by Huang et al. [27] and found its application in many domains, such as robotic perception, hyperspectral image classification, lithology identification, and human activity recognition [28][29][30][31][32]. Compared with support vector machine and other artificial neural networks, ELM has significant superiority in generalization performance and training time. In addition, many variants for ELM have been investigated, including semi-supervised ELM, multi-kernel ELM, rough ELM, one-class ELM, etc. [33][34][35][36].
According to the objective function (2), we first describe the mathematical model of ELM as follows [37], where e i ∈ R c represents the error vector with respect to the i-th training sample, and ∑ n s+t i=1 e i 2 is the sum of prediction errors. The tradeoff coefficient C is used to balance the contribution between the two terms. Since the target dataset D T is unlabeled, j i = 1 if i ≤ n s ; otherwise, j i = 0; and y i equals a zero vector if i > n s . The output of hidden layer is equal to h( z is the number of hidden neurons; b j ∈ R are the j-th random generated weight vector and bias constant; g(·) is a piecewise continuous nonlinear activation function, such as the sigmoid function . In this paper, the sigmoid function is used as the activation function.

Weighted ELM
Additionally, since each lithology in dataset D usually contains the samples of different amounts, appropriate weights should be assigned to each error vectors for the samples imbalance issue, so (3) is rewritten as arg min where ω i = ν i where W = blockdiag(W S , 0 n t ×n t ) ∈ R n s+t ×n s+t ; W S ∈ R n s ×n s is a diagonal matrix with each item ω i ; and Y = [Y S ; 0 n t ×c ] ∈ R n s+t ×c , H = [h(x 1 ); . . . ; h(x n s+t )] ∈ R n s+t ×z , J = blockdiag(I n s ×n s , 0 n t ×n t ) ∈ R n s+t ×n s+t , where 0 and I are the zero matrix and the identity matrix with appropriate dimension, respectively. By setting the gradient of (5) over β to zero, we have where represents the transpose of matrix. There are two forms of an optimal solution to the β * . When the number of the training data is larger or smaller than the number of hidden neurons, then H has more or fewer rows than columns, thereby resulting in an underdetermined or overdetermined least squares problem. The closed-form solution for (6) can be described as where I n s+t and I z are the identity matrix of n s+t and z dimensions.

Data Drift Adaptation
In general, the data distribution properties can be used to statistically describe the correlation among the samples (x i , y i ) ∈ D, thus the data drift-induced distribution discrepancy can be estimated by statistical criteria. In this paper, we utilize the projected maximum mean discrepancy (PMMD) criterion for distribution discrepancy measure [38]. Thus, the regularization term Ω(D S , D T ; β) with PMMD is formulated as Further, (8) can be rewritten to a matrix form; that is, where tr(·) denotes the trace of a matrix. The elements of matrix M P ∈ R n s+t ×n s+t are calculated by According to (8), the samples x s i and x t j are transformed from the feature space to the mapping space by h(·), where the distribution discrepancy between the source and target datasets can be reduced by adjusting β. Thus, the modified WELM named data drift adaptation WELM (DDA-WELM) classifier can be achieved by introducing the regularization term Ω P (D S , D T ; β) to adapt the data drift.
However, the overall accuracy is increased with sacrificing the accuracies of some classes, because DDA-WELM is a method of reducing the distribution discrepancy between two datasets as a whole. Consequently, we modify the PMMD to propose the joint PMMD (JPMMD) that aims to reduce the distribution discrepancy between classes of two datasets, respectively. The regularization term Ω(D S , D T ; β) with JPMMD is formulated as where y s i is the real label of x s i andỹ t j is the pseudo label of x t j which is generated by the DDA-WELM classifier; n (k) s and n (k) t are the number of source-dataset samples which belong to the class k and the number of target-dataset samples whose pseudo labels are k, respectively.
Similarly, (11) can be rewritten to the matrix form; that is, where the elements of matrix M J ∈ R n s+t ×n s+t are computed by if p, q ≤ n s and y s p , y s , if p ≤ n s , q > n s and y s p ,ỹ t q−n s = k Thus, the data drift joint adaptation WELM (DDJA-WELM) classifier can be obtained by introducing the regularization term Ω J (D S , D T ; β) and the DDA-WELM classifier to improve the WELM.

Manifold Regularization
Manifold regularization is widely used to improve the smoothness of predictions and suppress the classifier cutting through the high-density regions [39]. Specifically, we introduce the manifold regularization M(D T ; β) to assign smooth labels within the target dataset and make the classifier more adaptable to the target dataset.
The formulation of the manifold regularization M(D T ; β) is that where the similarity a i,j between the sample x t i and x t j is calculated by where N (x t j ) is the set of κ-nearest neighboring samples of x t i under the metric of Euclidean distance in feature space, and σ > 0 is the width of Guassian kernel. Additionally, (14) can be rewritten to a matrix form M(D T ; β) = tr(β H LHβ). (16) where L = diag(0 n s ×n s , L T ) ∈ R n s+t ×n s+t , L T = D − A is the target dataset Laplacian matrix, and Thus, the classifiers DDA-WELM and DDJA-WELM can be upgraded to the DDA-S2WELM and DDJA-S2WELM by introducing the manifold regularization term M (D T ; β). Moreover, the regularization term Ω J (D S , D T ; β) with pseudo labels generated by the DDA-S2WELM classifier can further improve the data drift adaptation performance of the DDJA-WELM and DDJA-S2WELM classifiers.

Solution of Objective Function
The solution of the objective function (2) is introduced in this section. Especially, the regularization term Ω(D S , D T ; β) is divided into Ω P (D S , D T ; β) with PMMD and Ω J (D S , D T ; β) with JPMMD so that we will discuss them in the followings, respectively.

Solution of Objective Function with Ω P
By incorporating (5), (9), and (16) into (2), we have H LHβ). (17) By setting the gradient of f with respect to β to be zero According to (18), the closed-form solution of the optimal β is where z is the number of hidden neurons.

Experimental Verification
In this section, we conduct extensive experiments to verify the effectiveness of our method, using the well-logging data collected from multiple wells in Jiyang Depression, Bohai Bay Basin. The experimental datasets and settings will be described first. Then, the performance and impact of each regularization terms are shown in detail. The analysis of hyper-parameters sensibility is presented at last.

Experimental Settings
As shown in Table 1, the experimental datasets are composed of three datasets collected from different regions in the Jiyang Depression, Bohai Bay Basin and contain 3, 2 and 2 wells, respectively (The wells in Figure 1 do not appear in the experimental datasets). Their relative positions are shown in Figure 2. Table 2 describes the data drift statistically by maximum mean discrepancy. In one experiment, we set the well A as the training data and the well B as the testing data, thereby verifying the effectiveness of our method through A → B. In this case, the other experiments are denoted as A → C, B → A, B → C, C → A, C → B, D → E, E → D, F → G, and G → F, respectively. Considering the different ranges of measurements, the value of samples are transformed to [0, 1] by a min-max normalization method. In the experiments, we adopt the index Recall (i.e., the number of classify correctly divided by the number of samples) to calculate the classification accuracy, thus denoting the average recall (Macro-R) to represent the overall classification accuracy.      [1] MMD is short for maximum mean discrepancy, which can be found in [40].  Figure 3 exhibits the logging curves, core (i.e., label), classification performance before B → A transfer (i.e., using model trained on data B to directly predict data A), and classification performance after B → A transfer. It is observed that transfer learning could eliminate the data drift-induced accuracy loss. As shown in Figures 4, 5, and 6, the classification performances of applying the classifiers ELM, S2ELM, DDA-ELM, DDA-S2ELM, DDJA-ELM, and DDJA-S2ELM to datasets 1, 2, and 3 are presented, respectively. It can be obviously observed from Figures 4a-c,g-i, 5a,c, and 6a,c, that the accuracy of each class is gradually increased with the introducing of DDJA regularization terms, especially when it comes to the accuracy of Si and Sh that increase from 0 to more than 80 in Figure 4i, 6a,c, respectively. Additionally, the ELM and S2ELM classifiers without DDJA regularization term are insufficient to the tasks with data drift.

Experimental Results
Considering the ELM-based classifier with the random weights and biases parameters, we conduct multiple experiments by setting different random seeds to generate different weights and biases parameters which yield the experimental results in Figures 4d-f,j-l, 5b,d and 6b,d. According to the results, we have the following observations: (i) The overall Macro-R shows a trend of increasing step by step and the DDJA-S2ELM classifiers achieve the highest accuracy. Moreover, the Macro-R of DDJA-S2ELM are increased to 88% at least compared with the ELM on dataset 1 (B → C and C→ A), 80% on dataset 2 (D → E), and 69% on dataset 3 (F →G and G → F). Additionally, compared the ELM classifier with the DDJA-S2ELM classifier, the Macro-R of our method are increased 52% on dataset 1 (C→ B), 10% on dataset 2 (E→ D), and 42% on dataset 3 (F→ G). (ii) Although the Macro-R is not increased on dataset 1 (A → C) and dataset 2 (D → E), the standard deviation can be kept at a lower level. Thus, the stability can be improved by introducing the DDJA-S2ELM. (iii) Comparing the experimental results on dataset 3 with datasets 1 and 2, the ELM classifiers only achieve 27% (F→ G) and 35% (G→ F). The performance is significantly enhanced with our method regarding the data drift-induced accuracy decrease surges.

Parameters' Sensitivity
In this section, we aim to analyze the sensitivity on the key hyper-parameters of the DDJA-S2ELM: the trade-off coefficient C, the contribution coefficient of DDJA regularization term λ, and the contribution coefficient of semi-supervised regularization term γ. By analyzing the Macro-R over different settings of hyper-parameters, the configuration of these parameters is given. To avoid repetition, we only show the results of B → C.  In the Figure 7, we set the hyper-parameter C range from 100 to 10,000,000. It can be seen from these figures in Figure 7 that the maximum and minimum Macro-R are increased first and then decreased with the increasing of C, indicating that a small C incurs under-fitting in classification and a high C causes the over-fitting. Additionally, when C is set too small or too big, the overall accuracies are preserved in the large variation range highest 94.9% in Figure 7a and lowest 41.1% in Figure 7a or the overall just more than 70% sightly in Figure 7f. According to Figure 7c,d with C = 1000 and C = 10000, respectively, the overall Macro-R are held steady highest 96.9% in Figure 7c and lowest 73.8% in Figure 7c and the accuracies are presented a trend of increasing gradually. Therefore, it is important to configure a moderate C first. Figure 6. The results of performing our method on dataset 3. In (a) and (b), we show the recalls for the classes of Mu, Sa, and Sh and their macro average recall by transferring F to G. In (c) and (d), we show those by transferring G to F. The influence of adjusting λ and γ are given by fixing the C. It can be seen from Figure 7 that the maximum Macro-R is achieved 96.9% in Figure 7c when set λ/C = 10, 000 and γ/C = 100. Additionally, the Macro-R almost increase first and then decrease with γ getting larger under fixed λ showing in Figure 7a-c. Moreover, these results show that the maximum is achieved almost when setting the λ larger than γ by two to four orders of magnitude. Since the γ control the contribution of semi-supervised regularization term that based on the manifold assumption, but the assumption is invalid when the dataset with data drift. We introduce the DDJA regularization term to suppress the data drift, so the coefficient λ should be more than γ.
(c) λ max /C = 10000, γ max /C = 100, λ min /C = 10, γ min /C = 0.1.  According to the analysis of setting C, λ, and γ mentioned above, it can be concluded that: (i) C should be set a relatively larger range first to configure a moderate value; (ii) the setting of λ should be bigger than γ by two to four orders of magnitude. Here are some suggested settings: C ∈ [10 3 , 10 6 ], λ C ∈ [10 3 , 10 4 ], γ C ∈ [10 2 , 10 3 ]. Furthermore, we use Sobol method (In our experiment, the python tool "SALib" is employed, which can be found at https://salib.readthedocs.io/en/latest/index.html). Ref. [41] to implement a global sensitivity analysis where all parameters are varied simultaneously over the entire parameter space. The sensitivities shown in Table 3 demonstrate that: (i) C and λ contribute mainly compared with γ, (ii) C contributes more than λ slightly, and (iii) the interactions between these parameters are weak, so they are relatively independent.

Conclusions
In this paper, we have investigated the well logging-based lithology identification under data drift, thus proposing a transfer Extreme Learning Machine method to handle it. According to the projected maximum mean discrepancy (PMMD) criterion and extreme learning machine, we introduce a new PMMD criterion and then propose the DDJA-ELM to minimize the data drift between the source dataset and the target dataset. Additionally, in order to improve the prediction consistency within the target dataset, the manifold regularization is introduced to promote the DDJA-ELM to the DDJA-S2ELM. Extensive experiments have validated the high and stable accuracy of our method.