A New Scheme to Improve the Performance of Artiﬁcial Intelligence Techniques for Estimating Total Organic Carbon from Well Logs

: Total organic carbon (TOC), a critical geochemical parameter of organic shale reservoirs, can be used to evaluate the hydrocarbon potential of source rocks. However, getting TOC through core analysis of geochemical experiments is costly and time-consuming. Therefore, in this paper, a TOC prediction model was built by combining the data from a case study in the Ordos Basin, China and core analysis with artiﬁcial intelligence techniques. In the study, the data of samples were optimized based on annealing algorithm (SA) and genetic algorithm (GA), named SAGA-FCM method. Then, back propagation algorithm (BPNN), least square support vector machine (LSSVM), and least square support vector machine based on particle swarm optimization algorithm (PSO-LSSVM) were built based on the data from optimization. The results show that the intelligence model constructed based on core samples data after optimization has much better performance in both training and validation accuracy than the model constructed based on original data. In addition, R 2 and MRSE in PSO-LSSVM are 0.9451 and 1.1883, respectively, which proves that models established with optimal dataset of core samples have higher accuracy. This study shows that the quality of sample data affects the prediction of the intelligence model dramatically and the PSO-LSSVM model can present the relationship between well log data and TOC; thus, PSO-LSSVM is a powerful tool to estimate TOC.


Introduction
The correct evaluation of source rock plays an important role in oil and gas exploration and study, among which the evaluation of the abundance of the organic matter in the source rock is an essential part. The source rock evaluation involves many parameters that reflect the physical characteristics of the source rock, and the total organic carbon (TOC) content is identified as a basic and important index, which can represent the abundance of the organic matter [1][2][3]. Despite the most direct method of obtaining the TOC content, the laboratory core analysis is costly and time-consuming, by which limited TOC data can be obtained, and it is difficult to meet the current demands of source rock evaluations. With the rapid development of unconventional exploration of oil and gas, the continuous and accurate study on the TOC is very necessary. Well logging is characterized by high longitudinal resolution and the continuity of the data. Therefore, the TOC content predictions based on the logging parameters have been given priority by more and more researchers [4][5][6].
Many achievements have been made with the continuous improvements made by researchers in the predictions of the TOC data content [7][8][9]. The experiments and analyses, from which limited TOC data can be collected, are still necessary for the evaluation of the source rock [10,11]. Source rock is characterized by special logging response, and therefore organic matter has some specific geophysical logging responses. Moreover, there is a certain relationship between the TOC content and the logging parameters, such as the neutron, natural gamma, density, resistivity, and acoustic time difference. Beers et al. [12] and Schmoker J [13] calculated the TOC content by using the natural gamma well log, which was found to be suitable for calculating TOC from the source rock, which is rich in radioactive elements. Schmoker and Hester [14], Meyer and Nederl [15], and Decker et al. [16] calculated the TOC content using the density log curve, which could not predict the TOC content accurately, because there was no strong correlation between the density log curve and the TOC content. Autric and Dumesnil [17] calculated the TOC content using the acoustic time difference well log, which showed a better prediction when there was a strong correlation between the acoustic time difference and the TOC content. The establishment of TOC content prediction equations based on the single well log was greatly influenced by the physical differences in study areas, and it was found that the use of an empirical Equation is unfavorable to the correct prediction of the TOC content. Guo et al. [7], Hu et al. [5], Wang et al. [18], and Zhao et al. [19] calculated the TOC content by combining the resistivity with neutron porosity well logs, which was characterized by simplicity and convenience. However, the source rock maturity and the background value of the TOC content were different in different researcher areas and were found to leave a significant impact on the prediction.
Zhao et al. [20] and Kamali et al. [8] defined the clay content curve with the density and neutron porosity well logs, and then overlaid this curve with the natural gamma curve in order to calculate the TOC content. This method was found to be better than ∆logR in the same study field. In order to improve the single well log's prediction of the TOC content, Heidari et al. [4] selected multiple well logs to establish a multiple linear regression equation for the prediction of the TOC content. However, it was found that it was difficult to determine the related parameters due to the non-linear relationships among the well logs. There is a complicated non-linear function relationship between the logging information and the TOC content. Therefore, it is difficult to make simple linear regression be approximate to the real function relationship, and as a result, it is impossible to predict the TOC content accurately with the well log. In recent years, artificial intelligence has attracted researchers' wide attention and has also been involved in many areas of research [21][22][23]. Actually speaking, the research showed that the artificial intelligence methods were quite close to nonlinear implicit functions. Otherwise, the existing research revealed that artificial intelligence methods were practical in terms of the prediction of TOC content based on well logs. The relational model between the logging parameters and TOC content has been established by using the neural network in order to predict the TOC content more accurately [6,8,9,[24][25][26]. The involved algorithm and kernel function were closely related to the prediction accuracy. In fact, the prediction models of TOC content based on neural network method could not be established without real data. Due to the fact that the function relationship was established with real data, the prediction accuracy of the established models largely depended on the real data. In summary, no matter how perfect the artificial intelligence learning algorithm is, it could not be close to the function relationship between the real logging parameters and TOC content without the real data.
Considering the influence of the sample data on artificial intelligence modeling, it is necessary to process the sample data before modeling. It is required to delete the fuzzy or inauthentic data, and retain the sample data, which can accurately reflect the real function relationship and the modeling of the artificial neural network. Fuzzy c-means clustering (FCM) is a type of clustering algorithm, in which the grade of membership is employed to determine the degree that each data point belongs to a certain cluster [27,28]. In order to improve the classification accuracy of the fuzzy c-means clustering algorithm, it was proposed that the simulated annealing algorithm should be combined with the genetic algorithm to analyze the fuzzy c-means clustering, so as to classify sample data, and then to obtain the high-quality data. In the actual production, there are limited TOC content data from coring experiments. However, for the small sample data, in this study, the TOC content prediction model was established with a least square support vector machine (SVM) based on the sample data before and after the optimization. Then, in order to improve the prediction accuracy, a least square support vector machine model was established based on the particle swarm optimization (PSO-LSSVM) for the prediction of the TOC content. At the same time, a BP neural network model for the contrastive analysis was established, and a new method for the predictions of TOC content was proposed. In this study, the X = {x 1 , x 2 , · · · , x n } was assumed to be data samples, c (2 ≤ c ≤ n) was the number of types of data samples; {A 1 , A 2 , · · · , A c } were the types; U was its similar classification matrix; {v 1 , v 2 , · · · , v c } were the cluster center of each type; and µ k (x i ) was the degree of membership of x i to A k , abbreviated as µ ik . Then, the expression of the objective function J b was as follows:

Theory and Methodology
represents the Euclidean distance, which is used to measure the distance between x i in the ith sample and central point of the kth, m indicates the number of the sample's characteristics, b is the weighted parameter, and its value range A fuzzy c-means clustering method was used to find a new optimal type of classification; it made this classification get the minimum function value J b . It required that the sum of the membership values of one sample to each cluster be 1, which confirmed the following equations: Equations (3) and (4) were used to calculate the grade of membership x i to the A k and the cluster centers {v i } of c, respectively, which was as follows: If it was assumed that I k = {i|2 ≤ c < n; d ik = 0 }; then, for all "i"s, i ∈ I k , µ ik = 0: The cluster center and grade of membership of the data were repeatedly adjusted with Equations (3) and (4), and then were classified. In the case of algorithm convergences, the theoretical value of the grade of membership between each cluster center and sample to each model was obtained, and the division of the fuzzy clusters was completed. Although the FCM is identified as high-speed searching, it was limited to searching something locally, and it was also found to be particularly sensitive to the initial value of the cluster center [29]. Therefore, if the initial value could not be properly selected, then it fell into the local minimum.

Methodology of the Simulated Annealing Algorithm (SA)
A simulated annealing algorithm was successfully applied to combining optimization, based on the fact that the global optimal solution, or nearly global optimum, can be searched by simulating the annealing process of the high-temperature objects [30]. The process of the simulated annealing algorithm was as follows: (1) S 0 was chosen as the initial state, and S(0) = S 0 was set; it was assumed that the initial temperature was T, and i = 0 was set; (2) T = T i was set; T and S i referred to the Metropolis sampling algorithm; the state S was returned as the current solution of this algorithm, and S i = S; (3) The temperature was lowered with a certain method, for example: (4) The termination conditions were checked, and if they were suitable, then came to step (5); or came back to step (2); (5) If the current solution S i was the optimum solution, the result was gained as output, and the process was terminated.

Methodology of the Genetic Algorithm (GA)
(1) Encoded mode: In the genetic clustering algorithm, the parameters to be optimized were c initial cluster centers with binary coding. Each chromosome consisted of c cluster centers. For the m-dimensional sample vector, the number of variables to be optimized was c × m. Assuming that each variable used the k-bit binary coding, and the length of chromosome was the binary code string of c × m × k; (2) Fitness function: This was the scale that was used to weigh the pros and cons of the individuals. Its function was like weighing the organism adaptation to environment. Each individual regarded the J b as the objective function from Equation (1), and the smaller the J b was, the larger the adaptation value of the individuals was. Therefore, in the fitness function, the distribution function of fitness values: FintV = ranking(J b ) was employed; (3) Selection of the operator: The stochastic universal sampling was used; (4) Crossover operator: The single-point crossover operator was used; (5) Mutation operator: The number of variant genes will appear with a certain probability, and the variant genes will be selected out using a stochastic method. If the selected gene is encoded as 1, it will be 0, or it will be 1.

Flowchart of the SAGA-FCM Algorithm
This flow of fuzzy c-means clustering algorithm based on a simulated annealing genetic algorithm is shown in Figure 1. Logging parameters were used as the characteristic index of sample data in order to classify the pros and cons of the sample data. The procedures were as follows: (1) The various control parameters were initialized, including the weighted index b in the fuzzy c-means clustering algorithm, maximum iterations N, termination tolerance D of the objective function, size of population individual sizepop, maximum of evolutional generation MAXGEN, crossing probability p c , variation probability p m , initial temperature of annealing T 0 , cooling efficient k of temperature, and terminal temperature T end ; (2) The c clustering centers were randomly initialized, and the initial population Chrom was generated; for each clustering center, Equation (3) was used to calculate the grade of membership of each sample and the adaptation value f i of each individual, here i = 1, 2, · · · , sizepop; (3) The loop count variable was set as gen = 0; (4) The abnormal operations were implemented in the sample Chrom, such as the selections, crossovers, variations, etc., and Equations (3) and (4) were used to calculate the c clustering centers and grade of membership of each sample for the newly generated individuals, as well as the adaptation value f i of each individual. If f i > f i , then the old individuals were replaced by new individuals; or, the new individuals were accepted with a probability of p = exp f i − f i T , and the old individuals were abandoned; (5) If gen < MAXGEN, and gen = gen + 1, then come to Step (4); or come to Step (6); (6) If T i < T end , the algorithm would be successfully completed, and the global optimal solution would be gotten; or, the cooling operation T i+1 = kT i would be implemented, leading to Step (3).

Methodology of the Least Square Support Vector Machine
A SVM (support vector machine) is a type of new machine learning method proposed by Vapnik [31]. For SVM, based on the statistical learning theory, the minimization principle of structural risk was adopted to improve the generalization ability of small sample data, and defects such as the long training time of the neural network, randomness of the training results, over-learning, etc., were gotten rid of. Therefore, SVMs can be widely used to build complicated non-linear models.
LSSVM is a derivative method of SVM, which was proposed by Suykens [32] and successfully introduced the least square estimation into the SVMs. Compared with the inequality constraints and quadratic programming of the standard SVMs, it solves the linear equation problem, simplifies the operation process, and improves the calculation of speed and accuracy. In the LSSVM, the square-error item is regarded as the optimization target, and the equality constraint is regarded as the constraint condition. This study utilized the regression form of LSSVM, and its main principles are as follows: For the training sample {(x i , y i )} N i=1 (with size N), in which x i ∈ R was regarded as the input and the output is y i ∈ R, its linear regression function in the low-dimensional space was as follows: in which ω is the weight vector and b is the offset. The regression function of this sample in the high-dimensional feature space was as follows: in which the nonlinear transformation ϕ(x) is the mapping from the low-dimensional space to the high-dimensional space. For LSSVM, in accordance with the structural risk minimization principle, the square-error loss function was selected from the optimization targets; the regression problem changed into a quadratic optimization problem as follows: min in which ξ i is the slack variable and c is the regularization parameter. Its constraint condition was as follows: In order to solve the problem about optimization, a Lagrange function was introduced: in which α i represents the Lagrange multiplier. The following Equation could be obtained according to the KKT (Karush-Kuhn-Tucker) optimization conditions: The definition kernel is as follows: By eliminating the ω and ξ i in Equation (10), the quadratic optimization problem could be transformed into solving the linear Equation (12) as follows: The above α and b in linear equations could be solved with a least square; then, the regression function of LSSVM could be obtained as follows:

Methodology Parameter Optimization of the LSSVM
Particle swarm optimization (PSO) is a global optimization algorithm [33]. The optimal solution is gotten by using the indirect communication among individuals with this method based on the simulation of the foraging process of bird flocks.
In the D-dimensional solution space, the possible solution of each optimization is regarded as one "particle" in the space, and m particles compose a community. Also, is the current flying speed of the particle i; p i = (p i1 , p i2 , · · · , p iD ) are the optimal positions up to current iteration.p g = (p g1 , p g2 , · · · , p gD ) is the optimal position searched by the entire particle swarm up to current iteration. Each particle follows the optimal position to search in the solution space. The renewal equation for the speed and position of the particle i is as follows: in which v k id and x k id are the speed and position of the particle i in the kth iteration of the d-dimension, respectively; c 1 and c 2 are the positive acceleration coefficients (or learning factors); r 1 and r 2 are two random numbers in [0, 1]; and p k id and p k gd are the optimal position of the individuals, and the global optimal position of the entire community of the particle i in the D-dimension, respectively.
In order to improve the learning ability and generalization ability of LSSVM, this study used a PSO algorithm to realize the global optimization of the LSSVM parameters, and the optimization process is shown in Figure 2.

Methodology of the Back-Propagation Neural Network
The artificial neural network is made up of artificial neurons connecting with each other. The network essentially realizes a mapping function from input to output. Also, mathematical theory has proven that the artificial neural network has the ability to realize any complicated non-linear mapping [34][35][36][37][38][39]. A BP neural network is essentially an error back-propagation BP learning algorithm. It has the ability to correct the error of connection weights and thresholds in various layers of the network from back to front according to the differences between the actual output and expected output and then from front to back. Repeating this process can minimize the errors to end [34,37,38]. For this method, the unknown system is regarded as a black box, in which the input and output data of a system sample are used to train the BP neural network to express the unknown function. The essence of defining this unknown function is to solve the minimum value of the error function. Training was based on applying core samples data repeatedly until the minimum of the error was obtained. At this time, the connection weights of the various layers and the threshold of nerve cells in each layer, along with other information obtained through the training were saved as knowledge, and then the training ended. Then, output of the system was predicted with the trained BP neural network.

Figure 2.
Flow diagram of LSSVM's parameters optimization by using Particle Swarm Optimization Algorithm.

Study Area
In recent years, PetroChina, Sinopec, and Shell, among others, have conducted many studies and production processes in the Ordos Basin. Some studies have shown that the organic matter in most areas of the basin is at the mature stage [40][41][42][43][44]. The Ro (vitrinite reflectance) value ranges from 0.85% to 1.20%; Types I and II of kerogen are the main types in this study area, and the TOC content of the organic-rich shale ranges from 0.23% to 32.86%. The position of the Ordos Basin in China is marked with a red rectangle in Figure 3a, and Well E1 is located in the north-central part of the Ordos Basin as is shown in Figure 3b.  Figure 3b. Logging equipment is from the logging system of COSL (China Oilfeld Services Limited, Beijing, China); logging method is from Well E1 and includes the caliper logging, spontaneous potential, natural gamma-ray spectroscopy, array acoustic, dual lateral resistivity, lithology density, and neutron porosity logs, of which all the logging curves displayed excellent qualities. Table 1 shows the logging parameter data of Well E1, along with the TOC content data that were analyzed in the core experiment.

Data Analysis
The relationship between the logging parameters and the TOC content differs dramatically in different areas. Also, it cannot be guaranteed that the empirical Equation established in previous studies to calculate TOC content can get the same prediction effects [12,17]. In different study areas, the function relationship between the logging parameters and TOC content was also found to be different [5,7].
Therefore, in order to define the relationship between the well log and TOC content, this study obtained a simple linear regression relationship between the well log and TOC content through a cross-plot analysis for the TOC content of the samples and well log. The coefficient (R 2 ) was regarded as the index to judge whether the correlation between each well log and TOC content was strong or not. Figure 4 shows the cross plots between the logging parameter and TOC content of Well E1. It can be seen that there is a positive correlation among the well logs of the spontaneous potential, gamma ray, acoustic time difference, resistivity, uranium and neutron porosity, and TOC content, and their coefficients of determination are 0.3713, 0.1809, 0.4212, 0.5075, 0.2668, and 0.3330, respectively. However, there was a negative correlation among the well logs of the potassium, thorium, and neutron porosity and the TOC content, and their coefficients of determination are 0.1296, 0.0465, and 0.3962, respectively. After comparison, when the simple linear regression of the well log and TOC content is made, the resistivity curve has the largest coefficient of determination, while the thorium curve has the smallest coefficient of determination. In addition, this study conducted a correlation analysis for the well log and TOC content, to calculate the correlation between each well log and the TOC content. The calculation Equation is as follows: in which r is the correlation coefficient; x and y are the average value of logging parameters, respectively; x i and y i are the corresponding logging observation values of the ith coring sample point, respectively. Table 2 shows the correlation coefficient matrix obtained by calculation. It can be seen that the correlation coefficient between the resistivity curve and the TOC content was high (0.7124). Otherwise, there is a high coefficient of correlation among the spontaneous potential, acoustic time difference, density, neutron porosity, uranium curve, and gamma ray curve, and the TOC content.
In summary, it was found from the analysis for the cross-plot and coefficient of correlation between the well log and TOC content that there was no one-to-one correspondence function relationship between any aforementioned well logs and the TOC content. However, the sensibility of different well logs to the TOC content was significantly different. This study analyzed the log response characteristics of the TOC content, as is shown in Table 3.

Well Logs Physical Interpretation
Spontaneous potential and resistivity (1) Due to the fact that the stratum that was rich in organic carbon had a higher degree of mineralization than the surrounding rock, the potential differences resulting from the diffusion and adsorption between the drilling fluid and interlayer water increased.
(2) The organic matter contained in the source rock consisted of non-conductive media, and the enrichment of the organic content led to the growth of the resistivity.
Natural gamma ray and spectral gamma (1) The TOC content influenced the logging value of the natural gamma ray because of the source rock's fine grains, large specific surface areas, and strong adsorption of organic matter into the radioactive elements.
(2) The content of the potassium and thorium is associated with clay minerals. So, there is a weak correlation between the well logs of the potassium and thorium and the TOC content.

Sonic logs
The organic matter in the source rock with a high acoustic time difference led to the abnormal high value of the acoustic time difference.

Density logs
Since solid-state organic matter is characterized by light weight in terms of the surrounding rock, and its density is close to the density of water. Strata with high TOC generally have low density.

Compensated neutron logs
The hydrocarbon in the source rocks is rich in hydrogen element, which leads to an abnormally high neutron log value. Thus, the total organic carbon content in the source rock was closely related to the neutron log value.

Data Optimization
The 70 sample data of the wells were divided into two types according to the fuzzy c-means clustering analysis. One type was the data that best reflected the function relationship between the TOC content and the log curve, which was called the high-quality sample point. The other type was the data that were named the low-quality sample point, because they could not reflect the function relationship between the TOC content and log curve. The aforementioned nine types of log curves were regarded as the sample classification index. Since the log data had different dimensions and orders of magnitude, it was necessary to preprocess through normalization to guarantee the classification effect. The normalization processing Equation was as follows: 9 (17) in which x * n×l is the index value after normalization; x n×l is the lth index of the nth sample; and x lmax and x lmin represent the maximum and minimum value of the sample of the lth index, respectively.
Following the normalization, the sample data were classified using a fuzzy c-means clustering method based on a genetic simulated annealing algorithm. The algorithm in this study involved the control parameters that are shown in Table 4. In Table 4, b is the weighted index and controls the distribution of the grade of membership and the fuzzy degree of clusters; N represents the maximum number iterations; D represents the termination tolerance of the objective function; sizepop indicates the population size; MAXGEN represents the maximum number of evolution; P c is the crossover probability; P m represents the mutation probability; T 0 is the initial annealing temperature; k represents the cooling coefficient; and T end represents the end temperature.
The samples were classified into high and low-quality samples. Also, the grade of membership of each sample to these two classes was obtained through calculation. Comparing the grades of membership of these two classes, the class with a larger grade of membership was the grade of the sample. Table 5 shows the matrix of the grade of membership for the samples. The values of HQ and PQ for each sample point were calculated by SAGA-FCM method as list in Table 5. If the HQ value less than the PQ value, the data is classified as low-quality data. As can be seen from Table 5, there were 61 high-quality sample points in total, while the remaining nine sample points were low-quality sample points.
A cross-plot and a coefficient of correlation analysis were constructed for the 61 high-quality sample data. As is shown in Table 6, comparing the analysis results of the sample data before optimization ( Figure 4 and Table 2), it was found that the coefficient of determination and that of correlation were both greatly improved. This study analyzed the coefficient of determination before and after the optimization of the sample data. The change rate of R 2 were calculated by the following equation: in which R 2 a is the R 2 of optimization samples, R 2 b is the R 2 of original samples, and G is the change rate of R 2 before and after the optimization of the sample data.
When the G > 0, it means that the optimization is effective. While the G < 0, it means that the optimization is invalid. As shown in Figure 5, it was obvious that the coefficient of determination (R 2 ) for the well log and the simple linear regression of the TOC content after optimization of sample data were greatly improved. For example, the natural gamma ray curve had the largest change in coefficient of determination (the change rate of R 2 for GR-TOC is 45.4395%). Since there was almost no correlation between the thorium curve and the TOC content, the coefficient of determination showed a negative change (the change rate of R 2 for TH-TOC is −31.6129%). The coefficient of determination (R 2 ) for the spontaneous potential, acoustic time difference, resistivity, uranium curve, potassium curve, density curve, and compensate neutron curve all showed positive change. From Figure 5, taking SP-TOC for example, the R 2 of SP-TOC with original samples data is 0.3713, and the R 2 of SP-TOC with optimization samples data is 0.4146. Then, the change rate of R 2 for SP-TOC is calculated by Equation (18), which is 11.6617%. Finally, the change rates of R 2 for DTC-TOC, RT-TOC, U-TOC, KTH-TOC, DEN-TOC, and CNL-TOC were calculated, respectively, and they were 12.7018%, 15.9803%, 9.8951%, 17.2068%, 16.4311%, and 11.5315%, respectively. The results illustrate that the sample data optimization is effective generally. Abbreviations: HQ = high-quality; PQ = poor-quality; Y = yes, which means the sample belongs to the high-quality type; N = no, meaning the sample belongs to the poor-quality type.  Figure 5. Change rate scattering of the R 2 after sample data optimization.

Model Establishment
After the analysis on the correlation between the TOC content and the log curve, as well as the optimization of sample data, for comparison, this study established three types of TOC content prediction models based on the sample data. These three models were the least square support vector machine (LSSVM) model, the least square support vector machine model based on the particle swarm optimization (PSO-LSSVM), and the back propagation neural network (BPNN) model, respectively.

LSSVM and PSO-LSSVM Models
The nine types of log curves were regarded as the related logging characteristic parameters of the TOC content, and then the LSSVM method was used to establish the non-linear model between the logging parameters and the TOC content. Meanwhile, the input data (x i ) and output data (y i ) of the model were the logging parameters and TOC content, respectively. Finally, this non-linear model was used to predict the TOC content.
The non-linear model structure was established between the logging parameters and TOC content with the LSSVM method as follows: The Gaussian radial basis function (RBF) was chosen to be the kernel function of the model, and its expression Equation was as follows: (20) in which x is the center of the kernel function and σ 2 is the shape parameter of the kernel function.
Then, if the regularization parameter c in the structural risk calculation expression in Equation (7), and the width parameter σ of the kernel function in Equation (20) is calculated with the PSO algorithm; this will become the PSO-LSSVM model.

Back-Propagation Neural Network Model
The Kolmogorov theorem stated that one three-layer neutral network can approximate a continuous function at an arbitrary precision. Therefore, this study established a three-layer BP neutral network containing only one hidden layer. This network consisted of an input layer, hidden layer, and output layer. The input-dependent variable was the nine types of well logs, and therefore the number of nerve cells at the input layer was nine. The input-dependent variable was the TOC content, and therefore the number of nerve cells at the output layer was one. The optimum value range [3,13] of the number of nerve cells in the hidden layer can be determined by the empirical Equation (Equation (21)). Then, it was determined through the traversal method that the number of the hidden neurons was 8. As shown in Figure 6, this study established the BP neutral network model of the 9-8-1 structure as follows: in which H is the number of nerve cells in the hidden layer, I is the number of nerve cells in the input layer, O is the number of nerve cells in the output layer, and ε is the constant. A hyperbolic tangent function was selected as the excitation function, and a learning method with a dynamic learning rate was used. The training error converged quickly after the iteration. The established BP neutral network prediction model for calculating the TOC content was as follows: in which x i represents the nine well logs in the network input layer; w ij and a are the weight coefficient and threshold from the network input layer to the hidden layer, respectively; v jk and b k are the weight coefficient and threshold from the hidden layer to the output layer, respectively; tanh is the hyperbolic tangent function, which is the excitation function of the hidden layer in the network, and its domain of definition and value range are (−∞, +∞) and (−1, +1), respectively; n = 9 is the number of feature vectors in the input layer; m = 8 is the number of nodes in the hidden layer; and k = 1 is the number of nodes in the output layer.

Model Performance
This study implemented the training and trial analysis for the aforementioned three types of TOC content prediction models on the basis of the original sample data and the optimized sample data, respectively. Also, this study randomly divided the data set into the training subset and testing subset, where the training subset accounted for three quarters, while the testing subset took up one quarter of the total. The cross-plots were made about the actually measured TOC content and the predicted TOC content to analyze the model's prediction effect. The coefficient of determination was used as the evaluation index. Figure 7 shows the model prediction effect based on the original sample data, while Figure 8 shows the model prediction effect based on the optimized sample data. Comparing the prediction effects of models in Figures 7 and 8, it was noticed that the effects of TOC prediction with the optimized sample data had improved dramatically. The determined coefficients of training part and test part in the model of LSSVM increased from 0.8706 and 0.8715 to 0.9457 and 0.9427, while those in BPNN increased from 0.8464 and 0.8857 to 0.9307 and 0.9324. The comparison proved the importance of favourable data set to artificial intelligence learning machine. Otherwise, in general, for the determined coefficients of the three types of TOC content prediction models in the study, the optimized model based on PSO-LSSVM had the largest coefficient and followed by the model that is based on LSSVM, while the BPNN had the smallest coefficient.
In addition to the coefficient of determination, the root-mean-square error (RMSE) and the variance accounted for (VAF) were used as indexes to compare the model's prediction effects. These indexes could be used to measure the degree of closeness between the model's prediction result and the actual value. The root-mean-square error (RMSE) was used to weigh the deviation between the model prediction value and the actual value. The more favourable the model prediction effect was, the smaller the root-mean-square error (RMSE) was, and its calculation Equation was Equation (23). The VAF was usually used to evaluate the accuracy of models by comparing the model prediction value and the actual value. Its calculation Equation is Equation (24), as follows: In Equations (23) and (24), pt i represents the model prediction data, mt i represents the actually measured data, and n indicates the number of samples that were used for the network training or testing. Table 7 showed the results of the RMSE and VAF through calculations. Based on the optimized sample data training, RMSE became smaller, while VAF became larger, which indicated that the prediction effect was better. In addition, RMSE in the PSO-LSSVM model was the smallest regardless of the basis of the original sample data or optimized sample data, which indicated that this model had a better prediction effect compared with LSSVM and BPNN models.

Model Validation
It was found from the aforementioned analysis that the prediction effect of the intelligent model was greatly influenced by the quality of the trained data set. This study used the optimized sample data set to train the LSSVM, PSO-LSSVM, and BPNN models and then used each model to predict the TOC of Well E1.
In addition, in order to make a visual comparison of the pre-quality, Figure 9 details the comparison between the TOC prediction value and the actually measured TOC value of the different models. The left three curves in Figure 9 are the well logs, while the right three curves are the corresponding model prediction the TOC curves. The purple dash represents the TOC prediction result of the LSSVM model, the green dash represents the TOC prediction result of the PSO-LSSVM model, and the blue dash represents the TOC prediction result of the BPNN model. Then, comparing the right three curves in Figure 9, it could be concluded that the prediction result of the PSO-LSSVM model was more consistent with the actual measured TOC result of the core samples. Figure 10 shows the comparison between the actual measured TOC and the model predicted TOC. It can be clearly noticed that the PSO-LSSVM model in this study had a better prediction effect. In addition, according to the R 2 , RMSE, and VAF results calculated in the Table 8, it was proven that the prediction result of the PSO-LSSVM model was more consistent with the actual measured TOC (R 2 = 0.9451, RMSE = 0.3383). Also, the PSO-LSSVM model was more suitable for the TOC prediction when compared with the other methods.

Conclusions
In recent years, artificial intelligence technique has become an effective tool in oil and gas exploration, which makes up the defects of the traditional methods of evaluating TOC. In the study, at first, the high-quality samples data were distinguished from the samples data set from Ordos Basin, China by using the fuzzy c-means clustering algorithm (FCM) in combination with the simulated annealing algorithm (SA) and the genetic algorithm (GA), named the SAGA-FCM method. Then, original samples data and optimization samples data (high quality data) were analyzed by using correlation analysis and the linear regression method, in which R 2 was regarded as the index of evaluation. The results showed that the relativity between well logging parameters of optimization samples data and TOC was better. Next, TOC prediction models, which were based on original samples data and optimization samples data, respectively, included the LSSVM, PSO-LSSVM, and BPNN models. According to error analysis using R 2 , RMSE, and VAF criteria, the obtained results showed that the intelligence model based on optimization samples data had much better performance in both training and validation accuracy, because it could reflect the functional relationship between the well logging parameters and TOC genuinely. Finally, the models established in this study were comparable. It can be seen that TOC could be predicted more accurately with PSO-LSSVM model than with LSSVM and BPNN models, and it had a more favorable effect from visual comparison between the prediction results and the data of measured TOC, as well as error analysis (R 2 , RMSE, and VAF).