Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm

Hasan, Md Monjur Ul; Hasan, Tanzeer; Shahidi, Reza; James, Lesley; Peters, Dennis; Gosine, Ray

doi:10.3390/en16248116

Open AccessArticle

Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm

by

Md Monjur Ul Hasan

,

Tanzeer Hasan

,

Reza Shahidi

,

Lesley James

^*

,

Dennis Peters

and

Ray Gosine

Faculty of Engineering and Applied Science, Memorial University of Newfoundland, St. John’s, NL A1C 5S7, Canada

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(24), 8116; https://doi.org/10.3390/en16248116

Submission received: 15 October 2023 / Revised: 4 December 2023 / Accepted: 15 December 2023 / Published: 17 December 2023

(This article belongs to the Special Issue Artificial Intelligence/Machine Learning Applications in the Oil and Gas Industry)

Download

Browse Figures

Versions Notes

Abstract

:

Stratigraphic identification from wire-line logs and core samples is a common method for lithology classification. This traditional approach is considered superior, despite its significant financial cost. Artificial neural networks and machine learning offer alternative, cost-effective means for automated data interpretation, allowing geoscientists to extract insights from data. At the same time, supervised and semi-supervised learning techniques are commonly employed, requiring a sufficient amount of labeled data to be generated through manual interpretation. Typically, there are abundant unlabeled geophysical data while labeled data are scarcer. Supervised and semi-supervised techniques partially address the cost issue. An underutilized class of machine-learning-based methods, unsupervised data clustering, can perform consonant classification by grouping similar data without requiring known results, presenting an even more cost-effective solution. In this study, we examine a state-of-the-art unsupervised data clustering algorithm called piecemeal clustering to identify lithofacies from wire-line logs, effectively addressing these challenges. The piecemeal clustering algorithm groups similar wire-log signatures into clusters, determines the number of clusters present in the data, and assigns each signature to one of the clusters, each of which represents a lithofacies. To evaluate the performance, we tested the algorithm on publicly released data from ten wells drilled in the Hugoton and Panoma fields of southwest Kansas and northwest Oklahoma, respectively. The data consist of two major groups: marine and non-marine facies. The study herein is centered around addressing two fundamental research questions regarding the accuracy and practicality of the piecemeal clustering algorithm. The algorithm successfully identified nine distinct clusters in our dataset, aligning with the cluster count observed in previously published works employing the same data. Regarding mapping accuracy, the results were notable, with success rates of 81.90% and 45.20% with and without considering adjacent facies, respectively. Further detailed analysis of the results was conducted for individual types of facies and independently for each well. These findings suggest the algorithm’s precision in characterizing the geological formations. To assess its performance, a comprehensive comparative analysis was conducted, encompassing other data clustering algorithms, as well as supervised and semi-supervised machine learning techniques. Notably, the piecemeal clustering algorithm outperformed alternative data clustering methods. Furthermore, despite its unsupervised nature, the algorithm demonstrated competitiveness by yielding results comparable to, or even surpassing, those obtained through supervised and semi-supervised techniques.

Keywords:

lithofacies identification; unsupervised learning; data clustering; wire-line log interpretation

1. Introduction

Lithofacies identification is an essential tool in geologic investigation, where a rock type or class is assigned to a specific rock sample based on measured properties [1]. The accuracy of this identification process directly affects reservoir evaluation and development plan formulation [2,3]. The traditional approach to lithofacies identification uses core- and thin-section analysis from cored-well matched with well-logs [4,5,6]. However, the collection of core samples and thin sections is often constrained by time and cost. As such, indirect estimation is often required to identify lithofacies [1]. Wire-line log measurements, such as gamma ray, neutron porosity, and resistivity are widely used to identify lithofacies for non-cored wells and intervals. However, the manual interpretation of wire-line log measurements constitutes a massive data analysis challenge for skilled interpreters and is therefore difficult to manage for many wire-line logs [7]. The process often misses valuable information and results in increased cost and decreased efficiency. Additionally, the process often introduces errors and multiplicities [8]. Therefore, a fast, accurate, and automated approach, reducing the need for experts’ involvement, would be beneficial. To address these challenges, different AI- and non-AI-based computational techniques and algorithms have long been used. These approaches include support vector machines (SVMs) [9], k-nearest neighbors (k-NNs) [1], fuzzy logic [10,11], artificial neural networks (ANNs), and machine learning [4,12,13,14]. The fundamental goal of these methods is to use quick, repetitive calculations with complex equations to find the spatial and mathematical correlations between wire-line log data and lithofacies.

Artificial neural networks (ANN) and machine learning (ML) techniques aid researchers in extracting valuable insights from datasets. These techniques are particularly effective in handling non-linearities inherent in wire-line log data [15,16]. ML, leveraging data-driven decision-making, efficiently captures information from wire-line logs without manual rule creation. With increasing computational power, these techniques have become popular for identifying lithofacies. Machine learning accuracies hinge on training phases, categorized into supervised, semi-supervised, and unsupervised techniques [4,17]. In supervised learning, where models are trained with annotated data, challenges arise due to limited annotated data for lithofacies identification [18,19]. Training with a lower percentage of annotated data introduces random noise and hampers generalization [20]. Semi-supervised learning, utilizing both annotated and non-annotated data, offers improved accuracy when annotated data are scarce [21,22]. Unsupervised machine learning models, unlike their counterparts, require no annotated data for training, autonomously extracting hidden patterns [23,24]. Popular unsupervised techniques, such as deep convolutional autoencoders (DCAEs), k-means, and t-distributed stochastic neighbors (t-SNE), have gained prominence in lithofacies identification and broader geological investigations [25,26].

Cluster analysis, widely employed in data analysis and statistics, finds significant application in lithofacies identification due to its efficacy without annotated data [8,27,28,29]. Data clustering techniques facilitate the identification of clusters within wire-line log data, with the subsequent assignment to specific rock samples based on representative log signatures or core data from core- and thin-section observations, avoiding the need to analyze every detail of each well. Clustering can be executed with or without machine learning techniques. Popular non-machine learning clustering algorithms include k-means [30], hierarchical clustering [31], and density-based clustering (e.g., DBSCAN) [32]. In contrast, self-organizing maps (SOM), a widely used unsupervised machine-learning-based algorithm, stands out for its advantages in lithofacies identification [15,33].

Lithofacies identification through data clustering is crucial for geological investigation, especially when dealing with the absence of annotated data. However, challenges arise as algorithms like k-means, hierarchical clustering, and SOM necessitate a predefined number of clusters, requiring prior knowledge about the dataset. This imposes the need for human interpretation before applying clustering algorithms, particularly when accurate cluster numbers are vital. Density-based clustering, while capable of identifying clusters, often struggles with noise and scalability issues [34]. An ideal data clustering algorithm for lithofacies identification should not only derive cluster numbers from data distribution but also be efficient, scalable, and work with large-scale high-dimensional data without dimension reduction [5]. The piecemeal clustering algorithm [35], an unsupervised learning-based method that adapts the SOM training technique and integrates it with agglomerative hierarchical and density-based clustering, is employed to addressing these needs. The algorithm utilizes Euclidean distance and cosine similarity, considering all properties of each input data sample. Crucially, it automatically identifies the number of lithofacies and maps data points to their best-matched lithofacies, eliminating the need for a priori knowledge.

This study explores lithofacies identification from wire-line logs using the piecemeal clustering algorithm, focusing on two key research questions: (a) Can the piecemeal clustering algorithm identify lithofacies without prior knowledge of their number? and (b) Does this algorithm produce comparable results to other data clustering methods, as well as supervised and semi-supervised machine learning techniques? To find the answers of these questions, the piecemeal clustering algorithm was applied to wire-line data from ten wells in the Hugoton and Panoma fields in southwest Kansas and northwest Oklahoma. These wells were part of an earlier investigation into the geological and reservoir properties of the Anadarko Basin, resulting in a detailed stratigraphic framework [36] and insights into the basin’s depositional history and hydrocarbon implications [37]. Nine lithofacies were previously identified in the dataset [36], consisting of five wire-line logs recorded at half-foot depth intervals, including gamma ray (GR), resistivity (ILD_log10), photoelectric effect (PE), neutron-density porosity difference (DeltaPHI), and average neutron-density porosity (PHIND). Additional geologic variables augment the dataset, with a comprehensive description provided in Section 3. The dataset was made publicly available through an SEG competition held in 2016 [19,38]. Gaussian naïve Bayes (GNB) [39], support vector machines (SVM) [40], XGBoost [38], level propagation (LP) [4,22], and self-trained LP [4,41] are some of the methods previously applied on the dataset to determine the lithofacies, and showed promising results. The result from piecemeal clustering is compared to those results in this study, besides comparing it with other data clustering methods from the literature.

The outline of this paper is as follows. The paper begins by outlining the piecemeal clustering algorithm. It then introduces the wire-line log dataset and frames lithofacies identification as a clustering problem. Then, the results of our study are provided and compared with other algorithms mentioned above. Finally, the implications of the results are discussed followed by conclusion.

2. Methodology

Data clustering systematically arranges data into distinct groups. The distinct groups contain data that have a high degree of similarity among the elements chosen to describe them. Elements are discernibly different between groups [34]. Mathematically interpreting data clusters in the context of lithofacies identification from wire-line logs requires the mathematical definition of the terms data point and cluster, along with other relevant concepts such as distance, representative cluster model, and membership mapping.

A data point, in this context, represents the physical properties at a certain depth in the well. Each data point is considered to be an n-dimensional vector

v

in hyperspace as represented in Equation (1), where each of the scalar components

v_{i}

of the vector represents one of the physical properties measured by wire-line logs, such as gamma-ray (GR), resistivity (ILD_log10), and average neutron-density porosity (PHIND).

v = (v_{1}, v_{2}, \dots, v_{n})

(1)

The likelihood is high that two vectors are similar if the depths from which the vectors are drawn are part of the same lithofacies. One of the ways to determine whether two vectors are similar is to calculate the distance between them. The distance between two vectors can be measured in a variety of ways. Euclidean distance,

d (p, q)

, is one of the most common methods of measuring the distance between two n-dimensional vectors

p

and

q

, where the differences between respective components are used as in Equation (2):

\begin{matrix} d (p, q) & = | | p - q | | = \sqrt{\sum_{i = 1}^{n} {(p_{i} - q_{i})}^{2}} \end{matrix}

(2)

where

\begin{matrix} p & = (p_{1}, p_{2}, \dots, p_{n}) \\ q & = (q_{1}, q_{2}, \dots, q_{n}) \end{matrix}

Euclidean distance measures the distance solely based on the magnitudes of the differences between the corresponding vector components, and discounts the vectors’ similarity in their hyper-planes. Cosine similarity,

θ_{c} (p, q)

, on the other hand, uses the hyperplane on which two vectors lie to define the similarity between two vectors as shown in Equation (3). Here, the similarity can be used as a proxy for the distance between two vectors: the higher the similarity, the lower the distance, and vice versa.

\begin{matrix} θ_{c} (p, q) = \frac{p \cdot q}{| | p | | | | q | |} = \frac{\sum_{i = 1}^{n} p_{i} q_{i}}{\sqrt{\sum_{i = 1}^{n} p_{i}^{2}} \sqrt{\sum_{i = 1}^{n} q_{i}^{2}}} \end{matrix}

(3)

Figure 1 shows an example of how the two distance measurements, Euclidean distance and cosine similarity, vary in two-dimensional Euclidean space. The vector

A

is closer to

B

than to vector

C

if the distances are measured using Euclidean distance. Alternatively, vector

C

is closer to

A

than to vector

B

if the cosine similarity is used instead, since

A

and

C

have a very small angular difference compared to the angle between

A

and

B

.

When using data clustering to identify lithofacies from wire-line logs, one or more clusters may be part of the same lithofacies. Each cluster is often represented using its respective cluster center, a representative model of the cluster defined by a vector with the same number of components as the data point. The data points closest to the cluster center, based on a defined distance measurement, are considered members of the cluster represented by the cluster center.

2.1. Self-Organizing Maps (SOM)

The self-organizing maps (SOM) clustering algorithm starts with an initial user-defined number of clusters with their respective cluster models, which are often randomly selected [42]. The unsupervised learning iterations then move the cluster models towards the high-density regions of the data points closest to them using an iterative learning algorithm. The learning algorithm takes a cluster model and one data point at a time to measure the degree of adjustment using two factors: Euclidean distance between the pair and the learning rate. The learning rate is initially selected by the user and exponentially decreases with each iteration. At the same time, the learning algorithm uses only those data points within a certain radius from the cluster models in terms of the distance. The initial radius value typically commences at half of the maximum Euclidean distance between any two data points in the dataset, gradually and exponentially decreasing over iterations. At the end of the iterations, each of the data points is mapped to the cluster model it is closest to.

2.2. Piecemeal Clustering

The key disadvantage of self-organizing maps is that the maximum number of clusters it can produce is limited to be equal to the initially selected number of cluster models. The result is also heavily dependent on how the initial models are selected. Piecemeal clustering solves this problem using a three-phase approach. The first phase, pre-clustering, selects the initial set of cluster models and their representative models (cluster centers). This phase initially produces a larger number of clusters using the agglomerative hierarchical clustering technique. Agglomerative clustering uses the concept that each data point is itself initially a cluster and therefore by extension, also a cluster center, which can be used to model its cluster. The closest clusters are iteratively merged to form new clusters with new cluster centers defined by the centroids of the member data points of newly merged clusters.

Piecemeal clustering uniquely combines Euclidean distance and cosine similarity to measure the distance between two data points or between a data point and its corresponding cluster center. Therefore, both the magnitudes of the different vector components and their alignment in the hyperplane are accounted for in the measurement. This type of measurement is a novel feature of this clustering algorithm. Figure 2 shows the pre-clustering flowchart and the description of the algorithm for this phase which are from Hasan et al. [35].

The pre-clustering phase uses a parameter T, called the cutoff threshold, to define when the phase will stop. The users can choose a value of T as an input to the algorithm, based on their domain knowledge about the dataset, to define the minimum natural or expected variation in the members that a cluster may have. The input parameter T is measured in terms of the percentage of the maximum Euclidean distance between any two data points in the dataset. It is recommended to use a lower value of T when presented with a range of natural or expected intra-cluster dissimilarities or if an actual measurement is difficult to determine. The value of the cutoff threshold is only used by the algorithm to stop the pre-clustering phase and not to reach the globally optimal clustering. At best, this phase can in general only reach a locally optimum clustering. A lower value of T ensures that the globally optimal clustering is not excluded from consideration in the later phases of the algorithm. In the later phases, the nearby small clusters will be merged to form the final set of clusters. Therefore, the algorithm is robust to small changes in the value of T.

In the second phase, training, the algorithm uses a very similar approach to self-organizing maps (SOM) to move the cluster centers towards the high-density data regions closest to them. Unlike SOM, the cluster centers are already defined in the pre-clustering phase based on the local density of the data points, making the training phase more accurate and effective. The flowchart for this phase is shown in Figure 3 and the algorithm details are given in Hasan et al. [35].

The training phase can be adjusted by selecting the learning rate and the stopping iteration number. These two parameters carry the same meanings and purposes as for the self-organizing maps (SOM) algorithm, and a similar approach to that used in SOM can be used to find the correct values. The piecemeal algorithm suggests a trial-and-error approach to select values where the sum of the distances from the data points to their respective cluster centers is the lowest for the entire dataset. The learning algorithm of this phase disregards the perceived membership of data points (local data density) found in the pre-clustering phase and considers the global data density to adjust the positions of the cluster centers. Therefore, at the end of this phase, the clusters are different in terms of their shapes and sizes. If there is more than one cluster center found within a dense set of data points, as shown in Figure 4, the training algorithm positions the cluster centers in such a way that the next phase, post-processing, merges them together to form a single cluster. Further detail of the algorithm and mathematical formulation can be found in Hasan et al. [35].

3. The Dataset

The Hugoton and Panoma field dataset has been used to investigate the geological and reservoir properties of the Anadarko Basin, formed during the late Paleozoic and early Mesozoic eras, which contains sedimentary rocks deposited in various environments. Dubois et al. [36] developed a detailed stratigraphic framework for the Basin using this dataset, allowing for the identification of potential reservoirs and the reconstruction of depositional environments. Avseth et al. [37] used the dataset to investigate the depositional history of the basin and its implications for hydrocarbon exploration. Hall and Hall [38] examined the reservoir properties of the rocks in the region, such as porosity and permeability, while Hall [19] developed a depositional model for the Hugoton and Panoma fields. Dunham et al. [4] investigated the depositional history and stratigraphy of the region using the same dataset, demonstrating the value of the Hugoton and Panoma dataset for understanding the geology and hydrocarbon potential of the Anadarko Basin.

The wire-line log data in this dataset were collected from ten wells from the Hugoton and Panoma fields in southwest Kansas and northwest Oklahoma, respectively. The dataset includes wire-line log data and core samples recorded at half-foot depth increments for 4137 total data points. Dubois et al. [36] determined that there are nine lithofacies or classes in this dataset. The first three lithofacies are non-marine, while the remaining six are marine lithofacies. Each well in the dataset is accompanied by five wire-line logs, including gamma ray (GR), resistivity (ILD_log10), photoelectric effect (PE), neutron-density porosity difference (DeltaPHI), and average neutron-density porosity (PHIND). Additionally, there are two geologic variables provided for each well: the non-marine/marine (NM_M) indicator and relative formation position (RELPOS). The descriptions of these lithofacies and their depositional environments are provided in Table 1.

The NM_M indicator in the dataset is used to separate the non-marine and marine lithofacies, indicating that they may exist on separate manifolds. The relative formation position variable provides information on the vertical position of the lithofacies within the well [36]. The lithofacies identified in the dataset reflect a diverse range of depositional environments, including shallow marine, fluvial, deltaic, and aeolian. The lithofacies include sandstone, shale, siltstone, and limestone [36].

The wire-line logs in the dataset provide information on the formation properties, such as mineral content, porosity, density, and any fluid saturations (water, oil, and gas). The Gamma ray log measures the natural gamma radiation emitted by the rocks, which can be used to identify different lithologies. The resistivity log measures the electrical resistance of the rocks, which can provide information on the fluid content and mineralogy. The photoelectric effect log measures the absorption of gamma rays by the rocks, which can be used to estimate the mineral composition [45,46]. The neutron-density porosity difference log measures the difference between the neutron and density porosities, which can be used to estimate the porosity and lithology. Finally, the average neutron-density porosity log measures the average porosity of the rocks.

To interpret the subsurface geology and evaluate reservoir quality using the traditional method, it is important to understand the GR and DeltaPHI values in various lithofacies and depositional environments. Geologists can better comprehend the characteristics of rock formations and make more precise predictions about their suitability for hydrocarbon exploration and production by evaluating wire-line logs and other data. The higher maximum value of the GR logs for marine data points suggests that marine sedimentary sequences are more likely to contain highly radioactive minerals, such as clay minerals, organic matter, and glauconite [47]. The higher levels of inherent radioactivity of the material can contribute to increased GR measurements. On the other hand, the lower means and higher standard deviations of the GR values for the marine data points may suggest that the lithology and depositional environments of marine sedimentary sequences are more varied, resulting in more variable GR readings [48,49]. As a result of more uniform sedimentation and depositional processes, non-marine settings may have lower overall maximum gamma-ray values but more consistent GR measurements. The similar ranges of values for DeltaPHI indicate that porosity fluctuations in marine and non-marine sedimentary sequences may be comparable. The porosity values may be more variable in marine environments, as indicated by the discrepancies in mean and standard deviation between marine and non-marine data points. This may be because the lithology and depositional histories of marine sedimentary sequences might be more complex, resulting in more varied porosity properties [50]. On the other hand, non-marine settings might have more homogeneous sedimentation and depositional circumstances, which would lead to more consistent porosity characteristics [51].

4. Results

The input for the piecemeal clustering algorithm groups wire-line logs at the same well depth as a single data vector. The value for PE was missing for some depths; therefore, all logs from those depths were eliminated from the dataset. Statistical analysis was then performed on each of the wire-line logs. The analysis was initially performed for all of the depths and subsequently for data based on splitting the logs into two groups using the

N M_M

values: non-marine and marine. Table 2 summarizes the results. This statistical analysis was used to determine how piecemeal clustering can be applied to lithofacies identification on this dataset.

A significant difference in the distribution between non-marine and marine data points is shown in Table 2. GR has a higher maximum value for marine data points than non-marine points, while marine data points have lower mean values with higher standard deviations. Meanwhile, DeltaPHI values are spread over a similar range, and have different means and standard deviations.

As a result of this difference in the data distributions of the datasets, the piecemeal clustering algorithm was applied separately to the wire-line logs from non-marine and marine depositions. Before applying the algorithm to the datasets, each of the logs was normalized to values between 0 and 1. This prevents Piecemeal Clustering from biasing one particular log based on its reading magnitude.

The cutoff threshold, T, for marine data points was set to 0.045 (4.5%). The number of training iterations was set to 40, and the learning rate was set to

1 \times 10^{- 6}

. The parameters were set by following the guidelines provided in the piecemeal clustering algorithm as described in [35]. The

N M_M

indicator and relative position (RELPOS) were excluded from this analysis since these are derived from the formation tops information and constrained geology. Principal component analysis (PCA) with centered and variance-based variable weights was applied before passing the data to the algorithm. The degree of separation of the signature was very narrow for marine signatures. PCA helps spread out the data [52]. The clustering produced six clusters.

The non-marine lithofacies are well separated, and consequently, it was easier to perform clustering on these data. A much higher value for T, 0.24 (24%), was used with ten iterations and a learning rate close to

1 \times 10^{- 8}

for the clustering. Piecemeal clustering produced three clusters.

To understand the accuracy of each of the predicted clusters by piecemeal clustering, each cluster representing a lithofacies was labeled using its dominating members. This was achieved by cross-referencing the members of each cluster with their respective known labeling found in the dataset. For example, the first predicted cluster contained a total of 636 members. Three-hundred and nine (309) members were labeled as non-marine sandstone (SS), 168 were labeled as non-marine coarse siltstone (CSiS), and 159 were labeled as non-marine fine siltstone (FSiS) in the known results. Therefore, SS was considered to be the dominating lithofacies; and the cluster was labeled with SS. This procedure was performed for all nine clusters.

To understand the mapping accuracy of the algorithm, two types of accuracy calculations were considered: with and without adjacency facies. All the data points that were dominating cluster members in their respective clusters were considered to be correctly mapped when calculating the accuracy without considering the adjacent facies. When calculating the accuracy with the consideration of adjacent facies, the cluster members that were not part of the dominating lithofacies but part of the adjacent facies, as found in Table 1, were also included as correctly mapped data points. For example, in the first predicted cluster mentioned above, only 309 data points were considered to have been correctly mapped when calculating the accuracy without considering adjacent facies. In contrast, when calculating the accuracy by considering the adjacent facies, an additional 168 data points were added to the set of correctly mapped data points (total 477), which were found to be CSiS-adjacent to SS. In this case, the clustering algorithm accurately mapped 45.20% of the data points when not considering adjacent facies and had an accuracy of 81.90% when considering adjacent facies.

The number of elements from each lithofacies formed within a predicted cluster corresponding to a given true clusters is shown in a confusion matrix in Figure 5. Each column represents one predicted cluster, while each row represents the number of cluster members from the known lithofacies found in each predicted cluster. For example, the aforementioned cluster is presented in column 1 in the confusion matrix. The numbers of mapped data points in each cluster are represented using gray shades, with higher numbers presented by darker gray, and lower numbers represented by lighter gray. One predicted cluster was found for each of the nine lithofacies. Therefore, the clustering algorithm accurately predicted the number of lithofacies from the dataset without using any labeled data.

Using the confusion matrix, it was also investigated whether the clustering algorithm favors isolating specific facies solely based on the wire-line log signature. In other words, how good is the algorithm at grouping the signatures from the same facies into a single cluster. The extent to which the algorithm able to accurately determine the mapping of each log signature to each of the respective facies was calculated, with and without considering the adjacent facies. The numerical result is shown in Table 3. The confusion matrix demonstrates that the clustering performed reasonably well for some facies types but performed poorly on others. For instance, with only a few misclassifications, piecemeal clustering was successful in identifying dolomite (D) and phylloid-algal bafflestone (BS). This can be seen in the confusion matrix for rows D and BS—they are filled with lower-intensity gray for off-diagonal elements. On the other hand, for packstone–grainstone (PS) and Wackestone (WS), a large number of false positives was generated, which can also be seen in the confusion matrix, where corresponding rows have darker off-diagonal elements. A further detailed discussion of the accuracy and effectiveness of this approach is included in Section 5.

It is important to understand how the clustering algorithm performs for each well. Table 4 shows the name of all the wells with the breakdown of the prediction results. The names and the details of each well are found in the dataset [19,38]. Table 4 presents the quantity of wire-line logs, the number of logs of marine and non-marine types, and the corresponding predicted log for each well. The predicted mapping accuracy was calculated with and without considering adjacent facies. Predicted logs for non-marine facies are shown in the “predicted non-marine log” column, separated into predictions with and without considering adjacent facies. The same is performed for the marine logs in the “predicted marine logs” column. The overall accuracy is also calculated in the “Overall Accuracy” column with consideration with and without adjacent facies. It needs to be noted that, the “Recruit F9” well is a synthetic well that was specifically created for the SEG competition.

As illustrated in Table 4, each well has a different number of logs. “Cross H Cattle” has the highest number of total logs and the highest number of non-marine logs, while “Churchman Bible” has the lowest number of total logs as well as the lowest number of non-marine logs. On the other hand, “Newbay” has the highest number of marine logs while “Shankle” and “Cross H Cattle” have the lowest.

Analyzing the known result, only “Kimzey A” and “Churchman Bible” contain all nine lithofacies. Without considering the adjacent facies, the piecemeal clustering performed best on the “Churchman Bible” and worst on “Shankle”. On the other hand, considering the adjacent facies, the algorithm performed best on “Kimzey A” and worst on “Luke G U”. Optimal results in both calculation scenarios emerged from wells with all nine facies. Figure 6 shows an example well, “Kimzey A”, with the lithofacies predicted by previous studies along with the predictions from piecemeal clustering. In the Figure, the wire-line logs are plotted vertically with depth varying alone the y axis representing the depth of the well.

5. Discussion

5.1. Prediction Effectiveness

To assess the overall effectiveness of the facies classification of the wire-line logs using piecemeal clustering, a number of measures can be considered, such as accuracy, precision, recall, and F1-score. The accuracy of the algorithm was demonstrated in the previous section, where the accuracies were calculated with and without consideration of adjacent facies.

For calculating precision, recall, and F1-score, predictions excluding adjacent facing were used. The precision is the percentage of all correct positive predictions of a class out of all actual positive predictions. Precision is a useful metric when the cost of false positive predictions is very high. Non-marine coarse siltstone (CSiS) was found to give the result with the highest precision, 55.18%, while Non-marine sandstone (SS) was found to give the result with the lowest precision, 48.58% among the non-marine facies. Among the marine facies, phylloid-algal bafflestone (BS) was found to yield the result with the highest precision of 51.97%, while dolomite (D) yielded the lowest precision at 27.89%. The overall precision of 48.98% was calculated by taking the weighted average of all precision values.

The recall, on the other hand, calculates the percentage of true positive predictions from all positive predictions. Table 3 shows the recall values for all facies. The overall recall value is calculated by taking the weighted average of all recall values, and was found to be 47.26%.

The F1-score combines both the precision and recall by taking their harmonic mean. The highest F1-score found among non-marine facies was for non-marine sandstone (SS) at 59.10%, while the lowest score found was for non-marine coarse siltstone (CSiS) at 47.85%. Among the marine facies, the highest score was found for phylloid-algal bafflestone (BS) at 63.44%, while the lowest was found for packstone–grainstone (PS) at 35.36%. The overall F1-score was 46.96%.

The variations in precision and F1-score values among non-marine facies were observed to be lower than those for the marine facies. The difference between the highest and lowest values is significantly higher for marine facies. However, the recall values varied for both marine and non-marine facies.

It is noteworthy that efficiency experiences a marked increase when accounting for adjacency facies. This phenomenon is attributed to the constrained size of the dataset, juxtaposed with the substantial heterogeneity present in the field. It is imperative to emphasize that augmenting the input size of the study is integral for enhancing accuracy.

5.2. Comparison with Other Data Clustering Algorithms

The performance of the proposed lithofacies identification algorithm using piecemeal clustering is compared with the performances of the other aforementioned well-known data clustering algorithms in terms of clustering accuracy as well as the correct identification of the number of clusters present in the data. Subjective analyses were also used to compare the performances of the various clustering algorithms. When comparing the mapping accuracy, both accuracies with and without consideration of the adjacent facies are used. The clustering algorithms that are used in the comparison are: K-means, SOM, DBSCAN, RNN-DBSCAN, and HDBSCAN. The use of the K-means, SOM, and DBSCAN algorithms is widespread as these are more generic clustering algorithms, while the RNN-DBSCAN and HDBSCAN clustering algorithms build upon the DBSCAN algorithm while trying to overcome some of its limitations.

All algorithms compared required the setting of various parameters in order to tweak their performances. For example, K-means and SOM require the number of clusters to be specified beforehand, thus making the number of clusters a required input parameter. In the cases of these two algorithms, we assumed that the correct number of clusters was specified so that there was no chance of the number of clusters being incorrectly identified.

The DBSCAN, RNN-DBSCAN, and HDBSCAN clustering algorithms do not require the number of clusters to be a known input. However, we found that when setting the number of clusters to its known true value for any of these three algorithms, the majority of data points were identified as noise, regardless of the values set for the other parameter values input to those algorithms, and thus the resulting clusters were not comparable to those from the proposed piecemeal algorithm. Therefore, we performed a brute-force search over all possible parameter values for DBSCAN, RNN-DBSCAN and HDBSCAN algorithms, while ensuring that the detected number of clusters was close to the true number of clusters to find the optimal parameters giving the best results. For example, DBSCAN uses the value

ϵ

to determine the clusters, RNN-DBSCAN uses neighborhood relationships (i.e., KNN-search) to determine the data density, while HDBSCAN uses minimum spanning trees where the output clusters are determined by the tree parameters

α

and

l e a f s i z e s

.

Implementations of the K-means, SOM, and DBSCAN clustering algorithms from the MATLAB^® machine learning toolbox were used for the comparative experiments, while the RNN-DBSCAN package [53] from the MATLAB^® library was used for RNN-DBSCAN implementation. Finally, the Scikit Learn Python package [54] was used in the experiment when using the HDBSCAN clustering algorithm.

When generating cluster results from the compared algorithms, there were two types of optimization that could be applied: (a) optimize the algorithms to obtain the correct number of clusters and then find the mapping accuracy; or (b) optimize the mapping accuracy and see how many clusters the algorithms can produce. For K-means and SOM, the number of clusters must be specified, so it is only possible to optimize them for the mapping accuracy. For DBSCAN algorithms, it was possible to optimize for both. When optimizing the results for the correct number of clusters, all of these algorithms detected the majority of the data points as noise and produced non-comparable results. Therefore, the results that are optimized for the mapping accuracy were used while keeping the number of clusters close to the known number of clusters, in the event that there is more than one result that provides a similar mapping accuracy. The experiment was run with two different methods, namely with and without splitting the dataset by marine and non-marine lithofacies. Splitting the datasets for all algorithms provided better results. Table 5 shows the outcome of the comparison. The four rows of the table compare the accuracy of the number of clusters identified by the clustering algorithms, while the rest of the rows compare the mapping accuracy. DBSCAN, RNN-DBSCAN, and HDBSCAN also separate data points as noise. The last row shows how many of the data points are found as noise by these algorithms.

The dominating facies approach previously mentioned was also applied to analyze the clusters generated by the algorithms being compared. It was found that some of the clusters generated by each of the algorithms represent the same dominating facies. At the same time, some of the lithofacies are not found as dominating facies in any of the clusters. When there is more than one cluster representing the same dominating facies, the cluster with the highest number of dominating data points is considered to be the correct mapping for the purpose of comparing accuracy. The first row of the table shows how many clusters each of the algorithms produces, and the second row shows the number of unique facies identified as dominating facies in the clusters. The number of uniquely identified marine and non-marine facies is also reported in rows three and four.

Piecemeal clustering is the only algorithm that uniquely identified all nine facies and also generated only nine unique clusters. SOM and K-means produced nine clusters as the number is given. DBSCAN could not produce nine clusters when optimized for a better mapping accuracy. RNN-DBSCAN and HDBSCAN produced 11 clusters each. HDBSCAN is the best among the algorithms that were able to identify the most unique facies. All of the algorithms compared, except RNN-DBSCAN, produced two unique non-marine facies. For identifying marine lithofacies, HDBSCAN exhibited the best result after piecemeal clustering by finding five non-marine facies.

Comparing the mapping accuracy also reveals interesting facts about the other clustering algorithms. For mapping accuracy, piecemeal clustering yields the best result for marine and non-marine facies. Among the other algorithms, DBSCAN yields the best result. HDBSCAN also produces a similar accuracy to DBSCAN and, at the same time, produces a better result in terms of finding the number of facies. Both DBSCAN and HDBSCAN also produce results close to those produced by piecemeal clustering. The total and adjacent accuracy also follow a similar comparison. DBSCAN and HDBSCAN performed better than the other three algorithms and produced results close to the piecemeal clustering.

5.3. Comparison with Other Machine Learning Methods

While piecemeal clustering outperforms the data clustering algorithms mentioned above, it is important to test how the piecemeal clustering algorithm performs against known machine-learning techniques. Table 6 compares our results to those from five other machine-learning techniques. Five performance indicators are used for this evaluation: total accuracy, F1-score, marine (M) facies accuracy, non-marine (NM) facies accuracy, and total adjacent accuracy. The techniques examined in this paper are Gaussian naïve Bayes (GNB), support vector machines (SVM), extreme gradient boosting (XGB), label propagation (LP), self-trained label propagation (self-trained LP), and piecemeal clustering. The results for the other algorithms were obtained from the previously published experiment in [4].

The percentage of correctly predicted wire-line log signatures from non-marine facies is measured by NM facies accuracy. In contrast, the percentage of correctly predicted wire-line log signatures of marine facies is measured by the M facies accuracy. The overall accuracy measures how many wire-line log signatures of marine and non-marine facies were adequately predicted. The total adjacent accuracy evaluates the proportion of accurately predicted log signatures of facies and respective neighboring facies, whereas the F1-score measures the balance between the precision and recall. With regard to the NM facies accuracy, the self-trained LP performed the best, coming in at 59.81%, followed by XGB, which came in at 54.40%, and was then followed by LP which came in at 51.07%, piecemeal clustering came in at 51.68%, SVM (with CV) came in at 50.41%, and finally GNB came in at 49.21%. Because self-trained LP is a sophisticated boosting algorithm that can handle nonlinear connections and has the capacity to capture interactions between various features, it attained the best accuracy. It is a type of level propagation that iteratively boosts accuracy using a self-training methodology. The powerful algorithm XGB (with default) also excels at this task. Self-trained LP outperformed all other methods in terms of M facies accuracy, coming up at 42.22%, followed by piecemeal clustering at 39.21%, SVM with CV at 36.60%, XGB at 34.67%, LP at 34.46%, and GNB at 32.26%. Piecemeal clustering, by virtue of its use of a clustering-based strategy that can spot patterns in the data and group related samples together, also did well on this challenge.

Total accuracy is a crucial parameter for assessing the overall effectiveness of algorithms. Self-trained LP in this study had the highest overall accuracy (50.92%), followed by piecemeal clustering (45.20%). XGB exhibited a performance which was similar to that of piecemeal clustering at 44.43%. SVM with CV also performed close to XGB at 43.43%. LP and GNB, on the other hand, performed the worst among all these algorithms, at 42.67% and 40.64%, respectively. Due to their capacity for handling unbalanced and noisy data as well as their capacity to learn from unlabeled data, self-trained LP and piecemeal clustering did well for this criterion.

The F1-score is another important metric that balances precision and recall. It gauges how well a model’s predictions balance positive predictive value (precision) with sensitivity (recall). For each of the machine learning methods tested in this study, F1-scores were determined. The self-trained LP had the highest F1-score in this study with 49.35%, followed by piecemeal clustering (46.96%), XGB (42.33%), SVM with CV (39.50%), LP (with default) (35.56%), and GNB (35.39%). Due to their ability to deal with unbalanced data and their capacity to learn, self-trained LP (with default) and piecemeal clustering performed well for this criterion.

Overall, piecemeal clustering has a total adjacent accuracy of 81.90%, indicating that, in most cases, the model correctly predicted the facies. While other algorithms did better than piecemeal clustering when considering the adjacent facies, piecemeal clustering was better than all other algorithms except self-trained LP when adjacent facies were not considered. In addition to that, when we look at the precision, recall, and F1-score for each facies, we can see that the piecemeal clustering performs better in predicting certain facies such as non-marine sandstone (SS) and phylloid-algal bafflestone (BS) and performs poorly in predicting other facies such as dolomite (D) and packstone-grainstone (PS). Therefore, it is important to consider these metrics when evaluating a model’s performance and making any decisions based on its predictions. It is important to note that, in these comparisons, the evaluation is specifically directed towards supervised and semi-supervised algorithms, both of which autonomously train themselves utilizing known outcomes. This intrinsic characteristic leads to the anticipation of higher accuracy results. However, it is important to acknowledge the inherent constraints associated with the utilization of supervised or semi-supervised clustering approaches, primarily attributed to the limited availability of labeled data. The generation of such data is not always feasible and, when possible, entails significant time and cost investments. One unique advantage of piecemeal clustering is that it clusters the dataset without requiring any labeled data and produces tight clusters. This advantage allows piecemeal clustering to be more useful for wells lacking core and thin slice samples. As a result, it is a significantly less expensive option in comparison to other methods while still producing comparable results.

5.4. Research Question Validation

The study was conducted to validate two research questions as presented in Section 1. Based on the provided information, it is possible to assess whether the research questions were answered.

The first research question was ‘is it possible to identify lithofacies using the piecemeal clustering algorithm without prior knowledge of the number of lithofacies in the data?’ The algorithm identified three non-marine lithofacies and six marine lithofacies from the dataset, one for each of the nine lithofacies from the known results. It was 100% accurate in identifying the number of lithofacies. The piecemeal clustering algorithm outperformed other data clustering algorithms. In addition, based on the provided results, which included accuracy, precision, recall, and F1-score, it is possible to assess the effectiveness of identifying lithofacies using the piecemeal clustering algorithm with other machine learning techniques, when there is no prior knowledge of the number of lithofacies in the data. The precision measures the percentage of correct positive predictions of a class out of all actual positive predictions. The precision values vary for different lithofacies, with non-marine coarse siltstone (CSiS) yielding the result with the highest precision at 55.18% and non-marine sandstone (SS) receiving the result with the lowest precision at 48.58% among the non-marine facies. For marine facies, phylloid-algal bafflestone (BS) received the result with the highest precision at 51.97%, while dolomite (D) received the result with the lowest precision at 27.89%. The overall precision, calculated as a weighted average of all precision values, was 48.98%, for the piecemeal clustering algorithm.

The recall calculates the percentage of true positive predictions out of all positive predictions. The recall values for all facies are provided in Table 3, and the overall recall value, calculated as a weighted average, was found to be 47.26% for piecemeal clustering.

The F1-score combines both precision and recall, taking their harmonic mean. Among the non-marine facies, non-marine sandstone (SS) resulted in receiving the highest F1-score at 59.10%, while non-marine coarse siltstone (CSiS) received the result with the lowest at 47.85%. For marine facies, phylloid-algal bafflestone (BS) received the highest F1-score at 63.44%, while packstone–grainstone (PS) received the lowest at 35.36%. The overall F1-score is calculated to be 46.96% for piecemeal clustering.

Based on these aforementioned results, the use of the Piecemeal Clustering algorithm for lithofacies identification without prior knowledge of the number of lithofacies in the data achieved reasonable precision, recall, and F1-scores. While the values may vary for different lithofacies, the overall performance indicates that unsupervised machine learning algorithms, such as piecemeal clustering, have the potential to effectively identify lithofacies.

The second research question was ‘does the piecemeal clustering algorithm yield comparable results to other data clustering algorithms, and supervised and semi-supervised machine learning techniques?’ Based on the provided information, the answer to this question can be inferred to be positive. The text mentions comparisons of the results obtained from different data clustering algorithms and machine learning techniques, including unsupervised piecemeal clustering. Among supervised algorithms, the comparison includes Gaussian naïve Bayes (GNB), support vector machines (SVMs), extreme gradient boosting (XGB), and semi-supervised algorithms including label propagation (LP) and self-trained label propagation (self-trained LP). This study describes the evaluation of these algorithms using various performance indicators, such as total accuracy, F1-score, marine and non-marine facies accuracy, and total adjacent accuracy. This found that piecemeal clustering performed better than the commonly used data clustering algorithms, and reasonably well compared to the other machine learning techniques in terms of accuracy, F1-score, and overall performance. Therefore, this suggests that unsupervised data clustering, represented by piecemeal clustering, can provide similar results compared to supervised and semi-supervised clustering algorithms for the identification of lithofacies. However, it is important to note that the evaluation and comparison were specific to the dataset and algorithms used in this particular study. The performance of different algorithms may vary depending on the dataset characteristics and the specific problem domain. It is recommended to consider these factors when selecting an appropriate algorithm for a given task.

6. Conclusions

In this study, the effectiveness of a novel unsupervised clustering algorithm, piecemeal clustering, was explored for identifying lithofacies without prior knowledge of the number of lithofacies in the data. The research questions address two important issues: (a) the feasibility of identifying lithofacies using the piecemeal clustering algorithm and (b) the comparison of the piecemeal clustering algorithm with other data clustering algorithms, supervised and semi-supervised machine learning techniques. By evaluating the algorithm’s performance using various metrics, such as accuracy, precision, recall, and F1-score, valuable insights into its capabilities were gained.

The results obtained from our experiments provide evidence that Piecemeal Clustering holds promise for lithofacies identification. Piecemeal clustering outperformed other commonly used data clustering algorithms and was competitive with known machine learning techniques. The evaluation included five performance indicators: total accuracy, F1-score, marine (M) facies accuracy, non-marine (NM) facies accuracy, and total adjacent accuracy. While some machine learning algorithms, such as XGB and self-trained LP (default), outperformed piecemeal clustering in specific metrics like non-marine facies accuracy and marine facies accuracy, the overall performance of piecemeal clustering was notable.

It is also important to note that the effectiveness of the piecemeal clustering algorithm varied across different lithofacies. On some lithofacies, such as non-marine sandstone (SS) and phylloid-algal bafflestone (BS), the algorithm demonstrated higher precision, recall, and F1-score, indicating that the algorithm performed better in predicting these facies. On the other hand, lithofacies like dolomite (D) and packstone–grainstone (PS) posed challenges to the algorithm, resulting in lower predictive performance. Therefore, it is crucial to consider these variations and assess the algorithm’s performance for each lithofacies category individually. It is imperative to acknowledge that the present study was conducted on a restricted set of wells, precisely 10 in number, owing to its accessibility within the public domain. The constrained availability of data may, in turn, constrain a comprehensive exploration of field heterogeneity through data clustering algorithms. To augment the efficacy of identifying lithofacies using data clustering techniques, further experiments on larger datasets are recommended. Such endeavors are anticipated to further understand the approach’s effectiveness in delineating the characteristics of lithofacies within diverse geological formations.

To further enhance the effectiveness of unsupervised clustering algorithms for the identification of lithofacies, several potential avenues can be explored. Firstly, incorporating additional variables or features derived from wire-line log measurements could provide valuable information and improve the algorithm’s accuracy. Parameters such as gamma ray (GR), resistivity (RT), neutron porosity (NPHI), and density (RHOB) have been widely used in lithofacies classification and could be considered as inputs to the clustering algorithm.

This study also demonstrated the potential of unsupervised data clustering algorithms for lithofacies identification using one state-of-the-art unsupervised data clustering algorithm. In the future, a more comprehensive study can also be conducted, employing other unsupervised data clustering algorithms on various wire-line log datasets. Furthermore, exploring advanced techniques for handling imbalanced and noisy data could address the challenges associated with certain lithofacies categories. Techniques such as data resampling, feature engineering, or applying ensemble methods might enhance the algorithm’s performance and accuracy, particularly for lithofacies with limited representation in the dataset. Additionally, incorporating domain-specific knowledge and geological constraints into the algorithm could contribute to better lithofacies identification. By integrating geological principles, such as stratigraphic layering, facies associations, and spatial patterns, the algorithm could leverage this information to guide the clustering process and improve the accuracy of lithofacies assignments.

In conclusion, this study demonstrates the potential of unsupervised clustering algorithms, particularly piecemeal clustering, for lithofacies identification without prior knowledge of the number of lithofacies. While the algorithm achieved competitive results compared to known machine learning techniques, its performance varied across different lithofacies categories. By considering additional variables, addressing data imbalances, and incorporating domain-specific knowledge, the algorithm’s effectiveness can be further enhanced. These findings contribute to the field of geoscience research and open up new opportunities for automated lithofacies identification in various applications, such as reservoir characterization and geological modeling. Future studies should continue to explore these avenues and refine the presented algorithm, and potentially others, to achieve even more accurate and robust lithofacies classification results.

Author Contributions

Conceptualization, M.M.U.H.; Formal analysis, M.M.U.H., T.H., R.S. and L.J.; Funding acquisition, L.J. and R.G.; Investigation, M.M.U.H. and T.H.; Methodology, M.M.U.H., R.S., D.P. and R.G.; Project administration, L.J. and D.P.; Resources, R.S., L.J., D.P. and R.G.; Software, M.M.U.H.; Supervision, R.S., L.J., D.P. and R.G.; Validation, M.M.U.H. and T.H.; Visualization, M.M.U.H. and T.H.; Writing—original draft, M.M.U.H. and T.H.; Writing—review and editing, R.S., L.J., D.P. and R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by “Natural Sciences and Engineering Research Council of Canada (NSERC)”, “Equinor (formerly Statoil)”, and Memorial University.

Data Availability Statement

Publicly available datasets were analyzed in this study, which were made public by the Kansas Geological Survey. This data can be found here: https://www.kgs.ku.edu/PRS/petroDB.html, (accessed on 30 June 2014).

Acknowledgments

The authors like to thank Memorial University for logistical support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dubois, M.K.; Bohling, G.C.; Chakrabarti, S. Comparison of four approaches to a rock facies classification problem. Comput. Geosci. 2007, 33, 599–617. [Google Scholar] [CrossRef]
Li, Y.; Zhao, Q.; Lyu, Q.; Xue, Z.; Cao, X.; Liu, Z. Evaluation technology and practice of continental shale oil development in China. Pet. Explor. Dev. 2022, 49, 1098–1109. [Google Scholar] [CrossRef]
Zhao, Q.; Guo, J.; Zhang, Z. A method for judging the effectiveness of complex tight gas reservoirs based on geophysical logging data and using the L block of the Ordos Basin as a case study. Processes 2023, 11, 2195. [Google Scholar] [CrossRef]
Dunham, M.W.; Malcolm, A.; Welford, J.K. Improved well log classification using semisupervised Gaussian mixture models and a new hyper-parameter selection strategy. Comput. Geosci. 2020, 140, 104501. [Google Scholar] [CrossRef]
Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef]
Riaz, M.; Banerjee, S.; Latif, K.; Liu, Z.; Ahmed, M.S.; Fathy, D. Understanding the origin of ancient carbonate ooids: Recent findings. Int. Geol. Rev. 2023. [Google Scholar] [CrossRef]
Shahid, A.R.; Khan, S.; Yan, H. Human expression recognition using facial shape based Fourier descriptors fusion. In Proceedings of the Twelfth International Conference on Machine Vision, Amsterdam, The Netherlands, 16–18 November 2019; Osten, W., Nikolaev, D.P., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2019; Volume 11433, p. 114330P. [Google Scholar] [CrossRef]
Tian, Y.; Xu, H.; Zhang, X.Y.; Wang, H.J.; Guo, T.C.; Zhang, L.J.; Gong, X.L. Multi-resolution graph-based clustering analysis for lithofacies identification from well log data: Case study of intraplatform bank gas fields, Amu Darya Basin. Appl. Geophys. 2016, 13, 598–607. [Google Scholar] [CrossRef]
Wang, G.; Carr, T.R.; Ju, Y.; Li, C. Identifying organic-rich Marcellus Shale lithofacies by support vector machine classifier in the Appalachian basin. Comput. Geosci. 2014, 64, 52–60. [Google Scholar] [CrossRef]
Saggaf, M.M.; Nebrija, E.L. A fuzzy logic approach for the estimation of facies from wire-line logs. Am. Assoc. Pet. Geol. Bull. 2003, 87, 1223–1240. [Google Scholar] [CrossRef]
Toumani, A.; Schmitz, D.; Schepers, R. Automatic determination of lithology from well logs using fuzzy classification. In Proceedings of the 56th Meeting of the European Association of Exploration Geophysicists, Vienna, Austria, 6–10 June 1994; European Association of Geoscientists & Engineers: Utrecht, The Netherlands, 1994; p. cp-47. [Google Scholar]
Baldwin, J.L.; Bateman, R.M.; Wheatley, C.L. Application of a Neural Network to the problem of mineral identification from well logs. Log Anal. 1990, 31, 279–293. [Google Scholar]
Kapur, L.; Lake, L.W.; Sepehrnoori, K.; Herrick, D.C.; Kalkomey, C.T. Facies prediction from core and log data using artificial Neural Network technology. Soc. Prof. Well Log Aanalysts Annu. Logging Symp. 1998, 39, 1–11. [Google Scholar]
Raeesi, M.; Moradzadeh, A.; Doulati Ardejani, F.; Rahimi, M. Classification and identification of hydrocarbon reservoir lithofacies and their heterogeneity using seismic attributes, logs data and artificial neural networks. J. Pet. Sci. Eng. 2012, 82–83, 151–165. [Google Scholar] [CrossRef]
Klose, C.D. Self-organizing maps for geoscientific data analysis: Geological interpretation of multidimensional geophysical data. Comput. Geosci. 2006, 10, 265–277. [Google Scholar] [CrossRef]
Vo-Thanh, H.; Amar, M.N.; Lee, K.K. Robust machine learning models of carbon dioxide trapping indexes at geological storage sites. Fuel 2022, 316, 123391. [Google Scholar] [CrossRef]
Bhattacharya, S.; Carr, T.R.; Pal, M. Comparison of supervised and unsupervised approaches for mudstone lithofacies classification: Case studies from the Bakken and Mahantango-Marcellus Shale, USA. J. Nat. Gas Sci. Eng. 2016, 33, 1119–1133. [Google Scholar] [CrossRef]
Bestagini, P.; Lipari, V.; Tubaro, S. A machine learning approach to facies classification using well logs. In Seg Technical Program Expanded Abstracts 2017; Society of Exploration Geophysicists: Tulsa, OK, USA, 2017; pp. 2137–2142. [Google Scholar] [CrossRef]
Hall, B. Facies classification using machine learning. Lead. Edge 2016, 35, 818–924. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Chapelle, O.; Schölkopf, B.; Zien, A. Semi-Supervised Learning; The MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
Zhu, X.; Goldberg, A.B. Overview of Semi-Supervised Learning. In Introduction to Semi-Supervised Learning; Brachman, R., Rossi, F., Stone, P., Eds.; Morgan & Claypool: San Rafael, CA, USA, 2009; Volume 3, pp. 1–130. [Google Scholar] [CrossRef]
Freund, Y. Boosting a weak learning algorithm by majority. Inf. Comput. 1995, 121, 256–285. [Google Scholar] [CrossRef]
Mandal, P.P.; Rezaee, R. Facies classification with different machine learning algorithm–An efficient artificial intelligence technique for improved classification. Aust. Soc. Explor. Geophys. Ext. Abstr. 2019, 2019, 1–6. [Google Scholar] [CrossRef]
Isam, S.; Yan, L.; Zhang, Q. Unsupervised Learning for Oil Drilling Data Classification. In Proceedings of the International Conference on Automation and Computing, Portsmouth, UK, 2–4 September 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
Qian, F.; Yin, M.; Liu, X.Y.; Wang, Y.J.; Lu, C.; Hu, G.M. Unsupervised seismic facies analysis via deep convolutional autoencoders. Geophysics 2018, 83, A39–A43. [Google Scholar] [CrossRef]
Al Kattan, W.; Jawad, S.N.A.; Jomaah, H.A. Cluster analysis approach to identify rock type in Tertiary Reservoir of Khabaz oil field case study. Iraqi J. Chem. Pet. Eng. 2018, 19, 9–13. [Google Scholar] [CrossRef]
Al-Anazi, A.; Gates, I. A support vector machine algorithm to classify lithofacies and model permeability in heterogeneous reservoirs. Eng. Geol. 2010, 114, 267–277. [Google Scholar] [CrossRef]
Alizadeh, B.; Najjari, S.; Kadkhodaie-Ilkhchi, A. Artificial neural network modeling and cluster analysis for organic facies and burial history estimation using well log data: A case study of the South Pars gas field, Persian Gulf, Iran. Comput. Geosci. 2012, 45, 261–269. [Google Scholar] [CrossRef]
Macqueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Sacramento, CA, USA, 21 June–18 July 1966; pp. 281–297. [Google Scholar]
Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Chang, H.C.; Kopaska-Merkel, D.C.; Chen, H.C. Identification of lithofacies using Kohonen self-organizing maps. Comput. Geosci. 2002, 28, 223–229. [Google Scholar] [CrossRef]
Benabdellah, A.C.; Benghabrit, A.; Bouhaddou, I. A survey of clustering algorithms for an industrial context. Procedia Comput. Sci. 2019, 148, 291–302. [Google Scholar] [CrossRef]
Hasan, M.M.U.; Shahidi, R.; Peters, D.K.; James, L.; Gosine, R. Piecemeal Clustering: A Self-Driven Data Clustering Algorithm. IEEE Access 2022, 10, 129985–130000. [Google Scholar] [CrossRef]
Dubois, M.K.; Byrnes, A.P.; Bohling, G.C.; Doveton, J.H. Multiscale Geologic and Petrophysical Modeling of the Giant Hugoton Gas Field (Permian), Kansas and Oklahoma, U.S.A. In Giant Hydrocarbon Reservoirs of the World: From Rocks to Reservoir Characterization and Modeling; Harrism, P.M., Weber, L.J., Eds.; American Association of Petroleum Geologists: McLean, VA, USA, 2006; Volume 88, pp. 307–353. [Google Scholar] [CrossRef]
Avseth, P.; Mukerji, T.; Mavko, G. Quantitative Seismic Interpretation: Applying Rock Physics Tools to Reduce Interpretation Risk; Cambridge University Press: Cambridge, MA, USA, 2005. [Google Scholar]
Hall, M.; Hall, B. Distributed collaborative prediction: Results of the machine learning contest. Lead. Edge 2017, 36, 194–280. [Google Scholar] [CrossRef]
Theodoridis, S. Machine Learning: A Bayesian and Optimization Perspective; Academic Press: Cambridge, MA, USA, 2015. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networkst. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Agrawala, A.K. Learning with a probabilistic teacher. IEEE Trans. Inf. Theory 1970, 16, 373–379. [Google Scholar] [CrossRef]
Kohonen, T. The Self-Organizing Map. Proc. IEEE 1990, 78, 1464–1480. [Google Scholar] [CrossRef]
Melo, F.F.; Rio, P. Prediction of photoelectric effect log and facies classification in the Panoma gas field. In Proceedings of the International Congress of the Brazilian Geophysical Society, Rio de Janeiro, Brazil, 31 March–17 June 2021; SBGF: Rio de Janeiro, Brazil, 2021. [Google Scholar]
Dunham, R.J. Classification of carbonate rocks according to depositional textures. In Classification of Carbonate Rocks–A Symposium; Ham, W.E., Ed.; AAPG Memoir, AAPG: Tulsa, OK, USA, 1962; pp. 108–121. [Google Scholar]
Alberty, M.W.; Smolen, J.J.; Goetz, J.F.; Luthi, S.M.; Patchett, J.G.; Peveraro, R.; Hancock, N.J.; Hashmy, K.; Aguilera, R. Wireline Methods. In Development Geology Reference Manual; Morton-Thompson, D., Woods, A.M., Eds.; American Association of Petroleum Geologists: Tulsa, OK, USA, 1992; Volume 10. [Google Scholar]
Hartmann, D.J.; Beaumont, E.A. Predicting Reservoir System Quality and Performance. In Treatise of Petroleum Geology / Handbook of Petroleum Geology: Exploring for Oil and Gas Traps; AAPG Special Volumes; AAPG: Tulsa, OK, USA, 1999; Volume 9, pp. 1–154. [Google Scholar]
Puckette, J.; Boardman, D.R., II; Al-Shaieb, Z. Evidence for Sea-Level Fluctuation and Stratigraphic Sequences in the Council Grove Group (Lower Permian), Hugoton Embayment, Southern Mid-Continent. Hyne N. J. 1995, 269–290. [Google Scholar]
Alqubalee, A.; Abdullatif, O.; Babalola, L.; Makkawi, M. Characteristics of Paleozoic tight gas sandstone reservoir: Integration of lithofacies, paleoenvironments, and spectral gamma-ray analyses, Rub’ al Khali Basin, Saudi Arabia. Arab. J. Geosci. 2019, 12, 344. [Google Scholar] [CrossRef]
Phujareanchaiwon, C.; Chenrai, P.; Laitrakull, K. Interpretation and Reconstruction of Depositional Environment and Petroleum Source Rock Using Outcrop Gamma-ray Log Spectrometry From the Huai Hin Lat Formation, Thailand. Front. Earth Sci. 2021, 9, 638862. [Google Scholar] [CrossRef]
Kominz, M.A.; Patterson, K.; Odette, D. Lithology Dependence of Porosity In Slope and Deep Marine Sediments. J. Sediment. Res. 2011, 81, 730–742. [Google Scholar] [CrossRef]
Selley, R.C. Concepts and Methods of Subsurface Facies Analysis. In Concepts and Methods of Subsurface Facies Analysis; U.S. Department of Energy Office of Scientific and Technical Information: Oak Ridge, TN, USA, 1978. [Google Scholar] [CrossRef]
Ben-Hur, A.; Guyon, I. Detecting Stable Clusters Using Principal Component Analysis. In Functional Genomics: Methods and Protocols; Brownstein, M.J., Khodursky, A.B., Eds.; Humana Press: Totowa, NJ, USA, 2003; pp. 159–182. [Google Scholar] [CrossRef]
Vannoy, T. RNN DBSCAN. Available online: https://www.mathworks.com/matlabcentral/fileexchange/97924-rnn-dbscan. (accessed on 30 June 2014).
McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]

Figure 1. An example of how the distance measurement can be varied based on the choice of measurement technique. The vector

B

is closer to

A

than

D

when measured with Euclidean distance, while

B

is more similar to

D

than

A

when measured with cosine similarity. This figure is taken from the authors’ presentation of the piecemeal clustering algorithm [35].

Figure 1. An example of how the distance measurement can be varied based on the choice of measurement technique. The vector

B

is closer to

A

than

D

when measured with Euclidean distance, while

B

is more similar to

D

than

A

when measured with cosine similarity. This figure is taken from the authors’ presentation of the piecemeal clustering algorithm [35].

Figure 2. Flowchart showing pre-clustering phase of piecemeal clustering algorithm.

Figure 3. Flowchart showing training phase of piecemeal clustering algorithm.

Figure 4. An illustrative result of the pre-clustering and training phases on two-dimensional data is presented. The figure depicts a set of closely located two-dimensional points. In the pre-clustering phase, these points are organized into three distinct clusters. Subsequently, the training phase refines the positions of the cluster centers, indicated by ’#’ symbols, within each of the three clusters. This figure is taken from the authors’ presentation of the piecemeal clustering algorithm [35].

Figure 5. Confusion matrix representing the piecemeal clustering result. The predicted clusters are named with the number and presented in the x axis while the actual clusters are on the y axis.

Figure 6. Predicted mapping of wire-line log signature for Kimzey A well. The left side of the image shows the wire-line logs, the right two-columns show the facies with different colors. The second rightmost column indicates the true facies, while the last column shows the predicted facies.

Table 1. The facies labels and their corresponding descriptions from the wire-line log dataset.

Facies	Name	Label	Adjacent Facies	Description	Depositional Environment	Reference
1	Non-marine sandstone	SS	2	A clastic sedimentary rock mainly composed of sand-sized minerals or rock grains.	These formations typically occur in regions with a substantial accumulation of sand, such as deserts, riverbeds, or coastal areas.	[43]
2	Non-marine coarse siltstone	CSiS	1, 3	A fine-grained sedimentary rock composed of silt-sized particles.	Deposited in semi-quiet environments such as river systems, deltas, shelves, and submarine fans and basins.	[43]
3	Non-marine fine siltstone	FSiS	2	Similar to coarse siltstone, but the silt particles are smaller in size.	These sediments are deposited in calm environments adjacent to old deltas, lakes, or seashores, where calm currents result in minimal particle suspension.	[43]
4	Marine siltstone and shale	SiSh	5	This facies includes both siltstone and shale.	Formed in calm water environments where fine particles can settle out of the transporting medium (air or water) and accumulate on the surface.	[43]
5	Mudstone	MS	4, 6	A muddy carbonate rock with less than 10% grains.	Calm water and the apparent inhibition of grain-producing organisms (low-energy depositional setting).	[43,44]
6	Wackestone	WS	5, 7, 8	A mud-supported carbonate rock with more than 10% grains. In cases where grains are exceptionally large, these carbonates are designated as “floatstones”.	Characterized by calm water conditions and a restricted presence of grain-producing organisms, indicating a low-energy depositional setting.	[43,44]
7	Dolomite	D	6, 8	A grain-supported muddy carbonate rock. Packstones indicate a range of depositional properties.	A range of depositional properties. The prevalence of mud implies the involvement of lower-energy processes, while the abundance of grains indicates the influence of higher-energy processes.	[43,44]
8	Packstone–grainstone	PS	6, 7, 9	Mud-free carbonate rocks, which are grain-supported.	Moderate- to high-energy environments.	[43,44]
9	Phylloid-algal bafflestone	BS	7, 8	These organisms do not construct a framework or bind the sediments together. Instead, they provide protected areas for the sediment to accumulate by baffling the currents.	Organisms baffle currents to provide protected areas for sediments to accumulate.	[43,44]

Table 2. Magnitude variation among marine and non-marine data points.

	NM_M Indicator	Minimum	Maximum	Mean	Standard Deviation	Median
GR	NM	26.06	183.38	74.89	15.23	73.72
GR	M	10.15	361.15	55.81	37.17	46.98
ILD_log10	NM	−0.03	0.97	0.53	0.16	0.54
ILD_log10	M	−0.01	1.80	0.78	0.26	0.80
DeltaPHI	NM	−19.90	19.26	5.37	6.15	5.80
DeltaPHI	M	−21.83	19.31	3.49	4.10	3.10
PHIND	NM	3.95	84.40	16.76	7.70	14.90
PHIND	M	0.55	47.72	9.90	4.53	8.93
PE	NM	0.20	5.10	3.18	0.48	3.20
PE	N	1.61	8.09	4.30	0.76	4.30

Table 3. Accuracy of identification of individual facies.

Facies	Accuracy (%)	Accuracy with Adjacent (%)
Non-marine sandstone (SS)	75.55	85.09
Non-marine coarse siltstone (CSiS)	42.24	100.00
Non-marine fine siltstone (FSiS)	48.69	79.19
Marine siltstone and shale (SiSh)	41.50	48.22
Mudstone (MS)	35.28	64.72
Wackestone (WS)	32.25	79.46
Dolomite (D)	77.12	80.39
Packstone–grainstone (PS)	21.58	81.32
Phylloid-algal bafflestone (BS)	81.42	91.15

Table 4. Prediction for each well from Piecemeal Clustering.

Well Name	Total Number of Logs	Number of Non-Marine Logs	Number of Correctly Predicted Non- Marine Logs		Number of Marine Logs	Number of Correctly Predicted Marine Logs		Overall Accuracy (%)
			Without Adjacent	With Adjacent		Without Adjacent	With Adjacent	Without Adjacent	With Adjacent
Shrimplin	468	238	118	208	230	81	169	42.52	80.56
Alexander D	463	208	123	206	255	71	184	41.90	84.23
Shankle	449	295	112	242	154	59	109	38.08	78.17
Luke G U	454	241	120	152	213	94	159	47.14	68.50
Kimzey A	431	168	102	159	263	108	251	48.72	95.13
Cross H Cattle	491	337	149	314	154	51	105	40.73	85.34
Nolan	411	190	109	186	221	64	154	42.09	82.73
Recruit F9	68	0	0	0	68	55	61	80.88	89.71
Newby	456	176	117	175	280	94	186	46.27	79.7
Churchman Bible	393	115	66	104	278	153	222	55.73	82.95
Total	4084	1968	1016	1756	2116	830	1600	45.20	81.90

Table 5. Lithofacies identification result; comparison between Piecemeal Clustering and other data clustering algorithms.

	Piecemeal Clustering	K-Means	SOM	DBSCAN	RNN-DBSCAN	HDBSCAN
Total number of clusters identified	9	9	9	8	11	11
Number of unique facies identified	9	5	5	5	4	7
Unique non-marine unique facies	3	2	2	2	1	2
Unique marine unique facies	6	3	3	3	3	5
Non-marine facies mapping accuracy (%)	49.21	37.37	37.39	45.73	30.06	44.61
Marine facies mapping accuracy (%)	32.26	17.66	17.42	27.48	23.09	29.51
Total accuracy (%)	40.64	27.15	27.03	36.26	26.44	36.78
Total adjacent accuracy (%)	82.59	55.25	55.24	79.58	53.38	77.69
Noise size	-	-	-	206	134	394

Table 6. Lithofacies identification result; comparison between Piecemeal Clustering and semi-supervised classification techniques. The results of semi-supervised algorithms are from [4].

	GNB	SVM (with CV)	XGB (with Default)	LP (Default)	Self-Trained LP (Default)	Piecemeal Clustering
NM facies accuracy (%)	49.21	50.41	54.40	51.07	59.81	51.68
M facies accuracy (%)	32.26	36.60	34.67	34.46	42.22	39.21
Total accuracy (%)	40.64	43.43	44.43	42.67	50.92	45.20
F1-score (%)	35.39	39.50	42.33	35.56	49.35	46.96
Total adjacent accuracy (%)	82.59	84.13	83.88	82.34	88.67	81.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hasan, M.M.U.; Hasan, T.; Shahidi, R.; James, L.; Peters, D.; Gosine, R. Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm. Energies 2023, 16, 8116. https://doi.org/10.3390/en16248116

AMA Style

Hasan MMU, Hasan T, Shahidi R, James L, Peters D, Gosine R. Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm. Energies. 2023; 16(24):8116. https://doi.org/10.3390/en16248116

Chicago/Turabian Style

Hasan, Md Monjur Ul, Tanzeer Hasan, Reza Shahidi, Lesley James, Dennis Peters, and Ray Gosine. 2023. "Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm" Energies 16, no. 24: 8116. https://doi.org/10.3390/en16248116

APA Style

Hasan, M. M. U., Hasan, T., Shahidi, R., James, L., Peters, D., & Gosine, R. (2023). Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm. Energies, 16(24), 8116. https://doi.org/10.3390/en16248116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lithofacies Identification from Wire-Line Logs Using an Unsupervised Data Clustering Algorithm

Abstract

1. Introduction

2. Methodology

2.1. Self-Organizing Maps (SOM)

2.2. Piecemeal Clustering

3. The Dataset

4. Results

5. Discussion

5.1. Prediction Effectiveness

5.2. Comparison with Other Data Clustering Algorithms

5.3. Comparison with Other Machine Learning Methods

5.4. Research Question Validation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI