Shape Carving Methods of Geologic Body Interpretation from Seismic Data Based on Deep Learning

: The task of seismic data interpretation is a time-consuming and uncertain process. Machine learning tools can help to build a shortcut between raw seismic data and reservoir characteristics of interest. Recently, techniques involving convolutional neural networks have started to gain momentum. Convolutional neural networks are particularly efﬁcient at pattern recognition within images, and this is why they are suitable for seismic facies classiﬁcation and interpretation tasks. We experimented with three different architectures based on convolutional layers and compared them with different synthetic and ﬁeld datasets in terms of quality of the seismic interpretation results and computational efﬁciency. The architectures used in our study were three deep fully convolutional architectures: a 3D convolutional network with a fully connected head; a 2D fully convolutional network, and U-Net. We found the U-Net architecture to be both robust and the fastest when performing classiﬁcation at the prediction stage. The 3D convolutional model with a fully connected head was the slowest, while a fully convolutional model was unstable in its predictions.


Introduction
Building a 3D model of the subsurface based on information gained from seismic reflection data and conditioned to well data is an important step in the conventional seismicto-simulation workflow. Obtaining facies distribution in 3D helps to better understand the overall depositional environment and significantly reduce uncertainty in further stages of reservoir modeling and simulation. There are, however, well-known limitations associated with this workflow.
The conventional approach to building a discrete facies reservoir model with seismic data [1] involves interpreting seismic data, building a 3D cell model of a reservoir, distributing its properties with geostatistical algorithms, and eventually using these properties to obtain the prediction of reservoir performance. An alternative approach would be to perform seismic inversion if necessary data are available and use the results to estimate a facies distribution directly. A problem with both of these approaches is that assumptions and approximations are introduced at each stage of the impedance inversion along with interpreter bias, and the process requires significant manual work. The comparative study's purpose was to see what could be interpreted directly from the seismic traces without any inversion. Another issue is the tradeoff between reproducing statistical information from existing hard data and obtaining geologically sound results when applying geostatistical algorithms. Overall, obtaining a meaningful and realistic result in the form of a 3D reservoir model using the conventional workflow is a challenging and time-consuming task.
The task of facies distribution estimation directly from seismic data using machine learning methods has been addressed by researchers. Some of the first to apply a machine learning algorithm for reservoir characterization include [2,3]. In [3], PCA was used to reduce the dimensionality of data, and a fully connected neural network was applied to classify seismic facies. Multiple works addressing the seismic facies classification task feature application of unsupervised algorithms of various complexity [4][5][6]. Some researchers have approached the task of facies classification with linear models [7,8]. In [9][10][11] analysis and comparison of the results of different models was performed. One of the first applications of a convolutional neural network (CNN) to seismic data was published in [12]. Later, [13] applied a modified U-Net architecture to the public F3 dataset using 41 labeled sections for training. In [14], the authors classified facies without supervision using a deep convolutional autoencoder.
Deep learning architectures based on convolutional layers are one of the most popular tools that have been used to address this challenge. Fully convolutional networks (FCN) [15], dilated convolutional networks [16], and U-Net [17] are examples of deep architectures that were successfully applied to seismic data. As there are currently a variety of different architectures that show different performances on different datasets and in different settings, we perform a comparative study of the results and performance of three different architectures for seismic interpretation. This work is a comparative study to identify the advantages and disadvantages of using different kinds of state-of-the-art networks in different conditions. Though all architectures are based on convolutional layers, the networks take data of different dimensionality as input (either 2D or 3D), have a different number of trainable parameters, and behave differently during both the training and inference. The datasets they were applied to are also diverse-2D and 3D real datasets from different depositional environments-which further diversifies the experiment. In addition, one of the key challenges that need to be taken into account when applying machine learning algorithms to seismic data-that of differences in statistical distributions of inline and crossline training and test sets-is isolated and highlighted, which may be a topic for further investigations.
The three architectures tested are 2D convolutional network with dilated convolution, 3D convolutional network, and a U-Net architecture. Tests are done on three different datasets: a synthetic 2D dataset and two 3D field seismic datasets. The sensitivity analysis of architecture hyperparameters was performed with the fully convolutional architecture using accuracy as a target parameter. This work is intended as a test and assessment of supervised algorithms. The seismic interpretations obtained from human experts were taken as given and were used as labels in a supervised learning setting. The expert interpretations are themselves subject to uncertainty and human biases, but estimating the quality of human expert interpretations and uncertainty associated with them was not a focus of this work. We take the expert labels as a given and compare the performance of supervised machine learning algorithms with respect to the human expert interpretation. The following sections describe briefly the three datasets and describe the different computational tests and their results.

Materials and Methods
In this section, we provide a description of the datasets and the deep learning architectures used in the experiments.

Datasets
We used three datasets, described below, to test the different deep learning algorithms.

The Synthetic Dataset
The synthetic dataset that is featured in this study was modeled based on the Stanford VI synthetic model [18]. The dataset is an array of synthetically generated 2D normalincidence seismic sections. The Stanford VI facies model corresponds to a prograding fluvial channel system. The facies model was populated with petrophysical properties, and based on those properties, forward modeling of post-stack seismic data was performed [18]. The dataset featured 5000 training examples, 500 validations, and 500 testing examples. A single pair of training example and a corresponding label is shown in Figure 1. The numbers shown along axes are the number of samples. The facies model is simulated in depth, and the corresponding seismic data is modeled in time.
The synthetic dataset that is featured in this study was modeled based on the Stanford VI synthetic model [18]. The dataset is an array of synthetically generated 2D normalincidence seismic sections. The Stanford VI facies model corresponds to a prograding fluvial channel system. The facies model was populated with petrophysical properties, and based on those properties, forward modeling of post-stack seismic data was performed [18]. The dataset featured 5000 training examples, 500 validations, and 500 testing examples. A single pair of training example and a corresponding label is shown in Figure 1. The numbers shown along axes are the number of samples. The facies model is simulated in depth, and the corresponding seismic data is modeled in time.
(a) Synthetic seismic section (b) Facies model

The F3 Dataset
The F3 dataset [19] was obtained in the Dutch sector of the North Sea. The exploration target was the Upper Jurassic-Lower Cretaceous strata. The interpretation based on qualities of seismic reflectors was part of the F3 dataset provided by dGB Earth Sciences. These data were interpreted through points partially covering the seismic section. The dataset featured a 3D seismic cube which was used as the input data for models in this work. The inline 339 from the F3 dataset along with interpretation points and description of classes is shown in Figure 2.

The F3 Dataset
The F3 dataset [19] was obtained in the Dutch sector of the North Sea. The exploration target was the Upper Jurassic-Lower Cretaceous strata. The interpretation based on qualities of seismic reflectors was part of the F3 dataset provided by dGB Earth Sciences. These data were interpreted through points partially covering the seismic section. The dataset featured a 3D seismic cube which was used as the input data for models in this work. The inline 339 from the F3 dataset along with interpretation points and description of classes is shown in Figure 2.
The gaps were filled in the existing interpretation of the inline 339 and the crossline 500 was interpreted to perform tests on it. Final interpretation of seismic section is shown in Figure 3.
In the experiments, 85% of examples from the inline 339 were used as a training set and the remaining 15% served as a validation set. The crossline 500 were used as a test set. In Figure 4, a plot is shown that compares distributions of the inline and crossline data (IL339 and XL500) using the multidimensional scaling (MDS) technique [20]. Each point represents a single seismic trace coming from seismic sections being compared in the two-dimensional MDS space. We can see qualitatively that the inline and crossline traces had somewhat different distributions. This plot will be referred to in the following sections.

The RIPED Dataset
Another 3D post-stack real dataset was provided by the Research Institute of Petroleum Exploration and Development, PetroChina (here called the RIPED dataset). The interpretation of seismic data was performed by interpreters from RIPED and used in our experiments. The main feature of the RIPED dataset was channel bodies that can be seen on seismic sections, and especially on stratigraphic slices. Stratigraphic slices used to highlight channels were obtained by flattening the seismic cube in time on the bottom of the stratigraphic zone and extracting seismic data along constant time planes. The interpretation of the horizon used for flattening was also provided by RIPED. The main channel system is represented by channels with N-S orientation, and the additional system is oriented in the NW-SE direction. The facies identified by RIPED interpreters within the zone of interest were underwater distributary channel, estuary dam, sheet sand, and distal bar.  The unlabeled parts were either labeled as background facies or were not taken into account in the training dataset, depending on the algorithm. In the test dataset, unlabeled samples were ignored.
In the experiments, 85% of examples from the inline 339 were used as a training set and the remaining 15% served as a validation set. The crossline 500 were used as a test set. In Figure 4, a plot is shown that compares distributions of the inline and crossline data (IL339 and XL500) using the multidimensional scaling (MDS) technique [20]. Each point represents a single seismic trace coming from seismic sections being compared in the twodimensional MDS space. We can see qualitatively that the inline and crossline traces had somewhat different distributions. This plot will be referred to in the following sections.

The RIPED Dataset
Another 3D post-stack real dataset was provided by the Research Institute of Petroleum Exploration and Development, PetroChina (here called the RIPED dataset). The interpretation of seismic data was performed by interpreters from RIPED and used in our experiments. The main feature of the RIPED dataset was channel bodies that can be seen on seismic sections, and especially on stratigraphic slices. Stratigraphic slices used to highlight channels were obtained by flattening the seismic cube in time on the bottom of the stratigraphic zone and extracting seismic data along constant time planes. The interpretation of the horizon used for flattening was also provided by RIPED. The main channel system is represented by channels with N-S orientation, and the additional system is The unlabeled parts were either labeled as background facies or were not taken into account in the training dataset, depending on the algorithm. In the test dataset, unlabeled samples were ignored.  The unlabeled parts were either labeled as background facies or were not taken into account in the training dataset, depending on the algorithm. In the test dataset, unlabeled samples were ignored.
In the experiments, 85% of examples from the inline 339 were used as a training set and the remaining 15% served as a validation set. The crossline 500 were used as a test set. In Figure 4, a plot is shown that compares distributions of the inline and crossline data (IL339 and XL500) using the multidimensional scaling (MDS) technique [20]. Each point represents a single seismic trace coming from seismic sections being compared in the twodimensional MDS space. We can see qualitatively that the inline and crossline traces had somewhat different distributions. This plot will be referred to in the following sections.

The RIPED Dataset
Another 3D post-stack real dataset was provided by the Research Institute of Petroleum Exploration and Development, PetroChina (here called the RIPED dataset). The interpretation of seismic data was performed by interpreters from RIPED and used in our experiments. The main feature of the RIPED dataset was channel bodies that can be seen on seismic sections, and especially on stratigraphic slices. Stratigraphic slices used to highlight channels were obtained by flattening the seismic cube in time on the bottom of the stratigraphic zone and extracting seismic data along constant time planes. The interpretation of the horizon used for flattening was also provided by RIPED. The main channel system is represented by channels with N-S orientation, and the additional system is In an attempt to reduce the cost of seismic data labeling, a small amount of seismic data were interpreted. An example of a partially interpreted seismic section is shown in Figure 5. The "background" facies shown in the figure represent any other facies that were not interpreted as either channels or a fault zone. Eight inlines and two crosslines were interpreted similarly. In this setting, points obtained for one of the crosslines were used as a test set, and the remaining points were split into training and validation sets using an 85%/15% ratio.
Additionally, interpretation of stratigraphic slices was performed to form another dataset for training and testing. Two stratigraphic seismic slices were labeled to be used for training, while another slice was chosen as a test set. An interpreted slice extracted at the time −1034 ms is shown in Figure 6. Here, one of the interpreted slices was used as the test set, and the examples from the other two were split with a ratio of 85%/15% to form training and validation sets. In an attempt to reduce the cost of seismic data labeling, a small amount of seism data were interpreted. An example of a partially interpreted seismic section is shown Figure 5. The "background" facies shown in the figure represent any other facies that w not interpreted as either channels or a fault zone. Eight inlines and two crosslines w interpreted similarly. In this setting, points obtained for one of the crosslines were u as a test set, and the remaining points were split into training and validation sets using 85%/15% ratio. Figure 5. Partial interpretation of the IL365 from the RIPED 3D cube (a) along with the initial seis section (b). The goal of using this kind of interpretation was to test the limits of the algorithms w reducing labeling costs.
Additionally, interpretation of stratigraphic slices was performed to form anot dataset for training and testing. Two stratigraphic seismic slices were labeled to be u for training, while another slice was chosen as a test set. An interpreted slice extracted the time −1034 ms is shown in Figure 6. Here, one of the interpreted slices was used as test set, and the examples from the other two were split with a ratio of 85%/15% to fo training and validation sets. Interpretation was done by interpreters at RIPED. Though there is uncertainty associated with labels, especially where the faulted zone and channels overlap, the goal of this work was to test the supervised algorithm, so human interpretation uncertainty was not taken into account.

Architectures
Three convolutional deep neural network models were featured in this study. All three models showed good results when applied to seismic data [13,21,22]; however, they have different numbers of parameters and costs of training. To understand advantages and disadvantages of the architectures when applied to different datasets, three deep Interpretation was done by interpreters at RIPED. Though there is uncertainty associated with labels, especially where the faulted zone and channels overlap, the goal of this work was to test the supervised algorithm, so human interpretation uncertainty was not taken into account.

Architectures
Three convolutional deep neural network models were featured in this study. All three models showed good results when applied to seismic data [13,21,22]; however, they have different numbers of parameters and costs of training. To understand advantages and disadvantages of the architectures when applied to different datasets, three deep learning architectures were compared in this study, and their descriptions are given in this section.

Fully Convolutional 2D Network with Dilated Convolutional Layers
The fully convolutional 2D network with dilated convolutional layers architecture was based on [21]. In [21], the authors employ a 3D model, while in this work we used a 2D version of this model. The network consisted of two parts: the first part was based on ordinary 2D convolutional layers and the second part was built with 2D convolutional layers with a dilation factor [23] that essentially controlled the increase of the field of view of convolutional filters. Each convolutional layer was followed by a batch normalization layer [24] to stabilize the distribution of the output of the preceding layer and a ReLU activation function [25] to introduce nonlinearity. The final activation function was a softmax activation, which is suitable for a multiclass classification task and outputs the probability of each class. Since the model was fully convolutional, the input was 2D and the output was also 2D, so it mapped a seismic section to a segmented section, where each pixel had a value of the class with maximum predicted probability obtained by learning latent representations of the input. A schematic architecture is shown in Figure 7. . Stratigraphic seismic slice extracted at the time −1034 ms (a) and the corresponding label (b). Interpretation was done by interpreters at RIPED. Though there is uncertainty associated with labels, especially where the faulted zone and channels overlap, the goal of this work was to test the supervised algorithm, so human interpretation uncertainty was not taken into account.

Architectures
Three convolutional deep neural network models were featured in this study. All three models showed good results when applied to seismic data [13,21,22]; however, they have different numbers of parameters and costs of training. To understand advantages and disadvantages of the architectures when applied to different datasets, three deep learning architectures were compared in this study, and their descriptions are given in this section.

Fully Convolutional 2D Network with Dilated Convolutional Layers
The fully convolutional 2D network with dilated convolutional layers architecture was based on [21]. In [21], the authors employ a 3D model, while in this work we used a 2D version of this model. The network consisted of two parts: the first part was based on ordinary 2D convolutional layers and the second part was built with 2D convolutional layers with a dilation factor [23] that essentially controlled the increase of the field of view of convolutional filters. Each convolutional layer was followed by a batch normalization layer [24] to stabilize the distribution of the output of the preceding layer and a ReLU activation function [25] to introduce nonlinearity. The final activation function was a softmax activation, which is suitable for a multiclass classification task and outputs the probability of each class. Since the model was fully convolutional, the input was 2D and the output was also 2D, so it mapped a seismic section to a segmented section, where each pixel had a value of the class with maximum predicted probability obtained by learning latent representations of the input. A schematic architecture is shown in Figure 7.  A number of the model parameters treated as hyperparameters are shown in Table 1. The number of filters in all convolutional layers was kept constant across the model. If the max pooling parameter was "true", every other convolutional layer was followed by a max pooling layer with the 2 × 2 kernel, and after dilated convolutional layers there were upconvolutional layers to preserve image size. Telescopic architecture requires decreasing kernel size by two and increasing the number of filters by a factor of two for each convolutional layer (without the dilation factor). Training samples were constructed by extracting rectangular regions of the size equal to the "training example size" parameter value with horizontal and vertical overlaps set by the "training example overlap" value in percent from interpreted seismic sections.

The 3D Convolutional Network
This architecture consisted of 3D convolutional layers with a fully connected head. It was based on the MalenoV tool [26], which was later improved [22], with the resulting architecture shown in Figure 8. The architecture consisted of three convolutional 3D and three fully connected layers with two max pooling layers after the first and the third convolutional layers. The activation functions used were ReLU activations; the final activation was a softmax activation. Batch normalization layers were introduced after fully connected layers. Three-dimensional convolutional layers were designed to work with 3D data, while fully connected layers were capable of processing 1D vectors. Thus, to feed the output of convolutional layers to the first fully connected layer, we flattened it, making it one dimensional. The final prediction was done by a layer consisting of the only node, which, topped with a softmax function, gave the probability of each class, and the class with the maximum probability was given as the output. To leverage spatial awareness of 3D layers, around each of the points used for training or prediction a small 3D subcube of seismic data was taken to form a single training example. Prediction was done for each seismic sample individually in contrast to the model described above. However, to inform the model of each sample's surroundings, a small 3D volume of data was taken into account.  [22]. ReLU activation was used after each convolutional and fully connected layer except the last one. The final activation was softmax. The input is 3D and the output is a single value (class index).
With the sparse sampling scheme introduced in [22], every other sample was taken into account along each dimension; therefore, the initial size of 65 × 65 × 65 was reduced to 33 × 33 × 33. In this way, the computational time required was reduced significantly with a negligible decrease in accuracy.

The U-Net Architecture
The U-Net architecture [17] was used to tackle the challenge of facies classification. The implementation in Python was adopted from [27] and follows the originally proposed architecture configuration. The architecture was fully convolutional, meaning its input is 2D and output is also 2D. The architecture is shown in Figure 9. It had clear encoder and decoder paths, with the encoder reducing the size of the input and the decoder decompressing the latent space representation back to the original size. Each block except the last block had two convolutional 2D layers followed by an additional layer depending on the paths: for an encoder, it was a max pooling layer, and for a decoder, it was a transpose convolutional 2D layer [28]. Each max pooling layer reduces the size of the input by a factor of two along both height and width dimensions, which reduces the complexity of the following operations while retaining the most important details of the input. Twodimensional transpose convolution both doubles the input size along two dimensions and calculates the inverse of a convolution operation with a weight matrix, so it also learns some useful features. An additional feature of the U-Net architecture is skip connections Figure 8. The 3D convolutional architecture. Based on [22]. ReLU activation was used after each convolutional and fully connected layer except the last one. The final activation was softmax. The input is 3D and the output is a single value (class index).
With the sparse sampling scheme introduced in [22], every other sample was taken into account along each dimension; therefore, the initial size of 65 × 65 × 65 was reduced to 33 × 33 × 33. In this way, the computational time required was reduced significantly with a negligible decrease in accuracy.

The U-Net Architecture
The U-Net architecture [17] was used to tackle the challenge of facies classification. The implementation in Python was adopted from [27] and follows the originally proposed architecture configuration. The architecture was fully convolutional, meaning its input is 2D and output is also 2D. The architecture is shown in Figure 9. It had clear encoder and decoder paths, with the encoder reducing the size of the input and the decoder decompressing the latent space representation back to the original size. Each block except the last block had two convolutional 2D layers followed by an additional layer depending on the paths: for an encoder, it was a max pooling layer, and for a decoder, it was a transpose convolutional 2D layer [28]. Each max pooling layer reduces the size of the input by a factor of two along both height and width dimensions, which reduces the complexity Energies 2022, 15, 1064 9 of 23 of the following operations while retaining the most important details of the input. Twodimensional transpose convolution both doubles the input size along two dimensions and calculates the inverse of a convolution operation with a weight matrix, so it also learns some useful features. An additional feature of the U-Net architecture is skip connections that concatenate outputs from encoder path blocks with decoder path outputs, thus mixing low-level and high-level features, which is designed to enhance the result. Hyperparameter values that were considered are shown in Table 2. Rectangular patches of the size equal to a "training example size" value wer tracted from initial slices to form a training dataset with overlap equal to a "trainin ample size overlap" value.

Investigating the Validation-Test Accuracy Gap
During our experiments, we observed a clear and consistent gap between valid and test accuracy, which in some cases was significant. The phenomenon was espe prominent when working with the F3 dataset. This issue is usually caused by the d ence in distributions of test data and training data, which is indeed the case in this s As exhibited in Figure 4, despite a significant overlap between distributions, th (crossline) and training (inline) data clearly differed from each other.
To confirm that the validation/test accuracy gap is caused by differences in dist tions and to better understand how the gap is affected by this difference, the follo experiment was performed. In the comparative study, we used an inline to construc training and validation sets and a crossline to form the test set, which caused the dist tion difference. We updated the datasets by exchanging some number of examples co from inline and crossline to reduce the difference in distributions between them. In of using only examples extracted from the inline to form training and validation set substituted some of them with examples extracted from the crossline. The same proce was applied to the test set: a number of examples from the crossline that were previ used as test data and moved to the training/test set were substituted by the same nu of inline examples from those that were no longer used for the training and valid Hyperparameter values that were considered are shown in Table 2. Rectangular patches of the size equal to a "training example size" value were extracted from initial slices to form a training dataset with overlap equal to a "training example size overlap" value.

Investigating the Validation-Test Accuracy Gap
During our experiments, we observed a clear and consistent gap between validation and test accuracy, which in some cases was significant. The phenomenon was especially prominent when working with the F3 dataset. This issue is usually caused by the difference in distributions of test data and training data, which is indeed the case in this study. As exhibited in Figure 4, despite a significant overlap between distributions, the test (crossline) and training (inline) data clearly differed from each other.
To confirm that the validation/test accuracy gap is caused by differences in distributions and to better understand how the gap is affected by this difference, the following experiment was performed. In the comparative study, we used an inline to construct the training and validation sets and a crossline to form the test set, which caused the distribution difference. We updated the datasets by exchanging some number of examples coming from inline and crossline to reduce the difference in distributions between them. Instead of using only examples extracted from the inline to form training and validation sets, we substituted some of them with examples extracted from the crossline. The same procedure was applied to the test set: a number of examples from the crossline that were previously used as test data and moved to the training/test set were substituted by the same number of inline examples from those that were no longer used for the training and validation sets. Thus, adding to the training and validation sets, some crossline data and some inline data to the test set, we reduced the difference in validation and test distributions and examined how the results compared to the initial data.

Results and Discussion
This section describes the numerical experiments run with different deep learning architectures on the different datasets and discusses their results.

Experiments with the Fully Convolutional 2D Network with Dilated Convolutional Layers
The architecture was first tested on the synthetic dataset to validate its ability to handle the task. The result of predicting the test samples is shown in Figure 10. The resulting test accuracy was 0.92, which indicates that the architecture was suitable for handling the task. data to the test set, we reduced the difference in validation and test distributions and examined how the results compared to the initial data.

Results and Discussion
This section describes the numerical experiments run with different deep learning architectures on the different datasets and discusses their results.

Experiments with the Fully Convolutional 2D Network with Dilated Convolutional Layers
The architecture was first tested on the synthetic dataset to validate its ability to handle the task. The result of predicting the test samples is shown in Figure 10. The resulting test accuracy was 0.92, which indicates that the architecture was suitable for handling the task. The first real dataset the architecture was applied to was the F3 dataset. Hyper parameter tuning was performed by Monte Carlo sampling [29] parameters which were uniformly distributed over a set of possible distinct values. Fifty different sets of parameters were drawn to initialize and train 50 architectures. Prediction results varied significantly between different initializations. The most and least accurate results of predicting facies distribution of the crossline 500 are shown in Figure 11. The test accuracy of the most accurate prediction was 0.85, and that of the least accurate prediction was 0.16. Attempts were made to improve the stability of the model by regularizing it (using Dropout layers [30] with different rates and adding max pooling layers), but they did not have a noticeable effect. Lack of stability may indicate that this network does not have enough expressive power, which may be solved by adding layers, increasing the number of weights, or adjusting the architecture in some other ways-essentially, changing the model. The first real dataset the architecture was applied to was the F3 dataset. Hyper parameter tuning was performed by Monte Carlo sampling [29] parameters which were uniformly distributed over a set of possible distinct values. Fifty different sets of parameters were drawn to initialize and train 50 architectures. Prediction results varied significantly between different initializations. The most and least accurate results of predicting facies distribution of the crossline 500 are shown in Figure 11. The test accuracy of the most accurate prediction was 0.85, and that of the least accurate prediction was 0.16. Attempts were made to improve the stability of the model by regularizing it (using Dropout layers [30] with different rates and adding max pooling layers), but they did not have a noticeable effect. Lack of stability may indicate that this network does not have enough expressive power, which may be solved by adding layers, increasing the number of weights, or adjusting the architecture in some other ways-essentially, changing the model. The history of validation loss and accuracy of the prediction with the highest test accuracy is shown in Figure 12. From the graphs, we concluded that the model did not overfit, and the validation accuracy achieved was close to 1. However, there is a clear gap between the validation accuracy and the test accuracy achieved during the training; the validation accuracy was close to 1, while the highest test accuracy was 0.85. To investigate how validation and test distribution difference affects this gap, the test described in 2.3 was performed.
The experiment was performed three times by exchanging 10%, 20%, and 30% of patches from the crossline with patches from the inline. The MDS plot comparing the The history of validation loss and accuracy of the prediction with the highest test accuracy is shown in Figure 12. From the graphs, we concluded that the model did not overfit, and the validation accuracy achieved was close to 1. The history of validation loss and accuracy of the prediction with the highest test accuracy is shown in Figure 12. From the graphs, we concluded that the model did not overfit, and the validation accuracy achieved was close to 1. However, there is a clear gap between the validation accuracy and the test accuracy achieved during the training; the validation accuracy was close to 1, while the highest test accuracy was 0.85. To investigate how validation and test distribution difference affects this gap, the test described in 2.3 was performed.
The experiment was performed three times by exchanging 10%, 20%, and 30% of patches from the crossline with patches from the inline. The MDS plot comparing the However, there is a clear gap between the validation accuracy and the test accuracy achieved during the training; the validation accuracy was close to 1, while the highest test The experiment was performed three times by exchanging 10%, 20%, and 30% of patches from the crossline with patches from the inline. The MDS plot comparing the distributions of test and training samples after exchanging 30% of samples is shown in Figure 13. The mean Mahalanobis distance [31] between the samples in the test set and the distribution of the training set was reduced from 2.90 for the baseline case (corresponding to the points in Figure 4 to 2.66 for the data after exchanging 30% of samples. The difference was reduced, and training samples became more representative of the test set distribution. distributions of test and training samples after exchanging 30% of samples is shown in Figure 13. The mean Mahalanobis distance [31] between the samples in the test set and the distribution of the training set was reduced from 2.90 for the baseline case (corresponding to the points in Figure (4) to 2.66 for the data after exchanging 30% of samples. The difference was reduced, and training samples became more representative of the test set distribution. The accuracies obtained over five runs for each scenario are shown in Table 3. Each column in the table corresponds to a particular number of patches from the crossline and the inline mixed to form a training set and a test set (baseline-no mixing). The number of patches extracted from the crossline and placed in the training set was calculated as a fraction (0.1, 0.2, 0.3) of the total number of patches being extracted from the crossline. The same number of patches was taken from the inline and placed in the test set. It was clearly shown that accuracy increases with an increasing amount of mixed data, which provides evidence validating the initial hypothesis. Additionally, for the F3 dataset, sensitivity analysis was performed to identify which parameters affect prediction accuracy. The technique used to perform the analysis was distance-based generalized sensitivity analysis (DGSA) [32], and it was based on unsupervised clustering of all the response factors being considered into N classes (three in this case) and calculating L1-norm distance between the prior cumulative distribution function (CDF) and the CDF of each cluster for each parameter. If a confidence interval for a parameter centered around 1 did not overlap with its Pareto bar, the parameter was marked as insensitive, otherwise it was marked as sensitive [32].
The result of the sensitivity analysis is shown in Figure 14. The sensitive parameters (shown as blue bars) are the number of convolutional filters, number of dilated layers, batch size, and window size. Unexpectedly, kernel size and number of convolutional layers were insensitive, even though they had a direct effect on the overall number of parameters of the model. The number of dilated layers is more important than the number of The accuracies obtained over five runs for each scenario are shown in Table 3. Each column in the table corresponds to a particular number of patches from the crossline and the inline mixed to form a training set and a test set (baseline-no mixing). The number of patches extracted from the crossline and placed in the training set was calculated as a fraction (0.1, 0.2, 0.3) of the total number of patches being extracted from the crossline. The same number of patches was taken from the inline and placed in the test set. It was clearly shown that accuracy increases with an increasing amount of mixed data, which provides evidence validating the initial hypothesis. Additionally, for the F3 dataset, sensitivity analysis was performed to identify which parameters affect prediction accuracy. The technique used to perform the analysis was distance-based generalized sensitivity analysis (DGSA) [32], and it was based on unsupervised clustering of all the response factors being considered into N classes (three in this case) and calculating L1-norm distance between the prior cumulative distribution function (CDF) and the CDF of each cluster for each parameter. If a confidence interval for a parameter centered around 1 did not overlap with its Pareto bar, the parameter was marked as insensitive, otherwise it was marked as sensitive [32].
The result of the sensitivity analysis is shown in Figure 14. The sensitive parameters (shown as blue bars) are the number of convolutional filters, number of dilated layers, batch size, and window size. Unexpectedly, kernel size and number of convolutional layers were insensitive, even though they had a direct effect on the overall number of parameters of the model. The number of dilated layers is more important than the number of regular layers, which may be one reason that the dilation factor in the convolutional kernel allows it to learn more relevant and descriptive features.  Finally, the experiment was performed on the RIPED dataset. As input interpretatio and corresponding labels were sparse, a binary mask was introduced to mask unlabele pixels to prevent the network from learning from pixels with no real labels. The concep and the code were adapted from [33]. The underlying concept was simply multiplying th output from each convolutional layer by 0 where input pixels do not have labels. Dat augmentation (horizontal flipping and addition of Gaussian noise) was utilized to in crease the number of examples and make the network more robust.
Again, 50 different hyper parameter sets were formed to run 50 independent trainin and prediction iterations. Examples of predicting the IL500 are shown in Figure 15. Th results were unsatisfactory as there were no hyper parameter sets that resulted in a satis factory prediction. Finally, the experiment was performed on the RIPED dataset. As input interpretation and corresponding labels were sparse, a binary mask was introduced to mask unlabeled pixels to prevent the network from learning from pixels with no real labels. The concept and the code were adapted from [33]. The underlying concept was simply multiplying the output from each convolutional layer by 0 where input pixels do not have labels. Data augmentation (horizontal flipping and addition of Gaussian noise) was utilized to increase the number of examples and make the network more robust.
Again, 50 different hyper parameter sets were formed to run 50 independent training and prediction iterations. Examples of predicting the IL500 are shown in Figure 15. The results were unsatisfactory as there were no hyper parameter sets that resulted in a satisfactory prediction.
As mentioned previously, the most likely reason for these poor results was that the amount of training data was very limited, and only part of the distribution represented in the input data was labeled. In addition, channels on seismic sections are very hard to identify precisely, even for human interpreters.
When trained on stratigraphic slices from the same RIPED dataset, however, the model shows a significantly better performance. Though it did not capture all of the small details, it did produce a comprehensive result, achieving a 0.82 test accuracy. The result is shown in Figure 16. As mentioned previously, the most likely reason for these poor results was that the amount of training data was very limited, and only part of the distribution represented in the input data was labeled. In addition, channels on seismic sections are very hard to identify precisely, even for human interpreters.
When trained on stratigraphic slices from the same RIPED dataset, however, the model shows a significantly better performance. Though it did not capture all of the small details, it did produce a comprehensive result, achieving a 0.82 test accuracy. The result is shown in Figure 16.

Experiments with the 3D Convolutional Network
The tool was applied to the synthetic data to assess its performance. To make a 3D model applicable to 2D data, each 2D line was repeated to form the third dimension. The test accuracy obtained was 0.74, and an example of the prediction result is shown in Figure   Figure 16. Prediction (b) and an interpreted label (c) obtained with the slice-1029 (a) from the RIPED dataset.

Experiments with the 3D Convolutional Network
The tool was applied to the synthetic data to assess its performance. To make a 3D model applicable to 2D data, each 2D line was repeated to form the third dimension. The test accuracy obtained was 0.74, and an example of the prediction result is shown in Figure 17. This result was not very accurate, but it captured key details. Since the data was 2D in nature, the 3D model was restricted in its capabilities, which is why this result was poor compared to the other predictions. Then, an experiment was performed on the F3 dataset. The goal of the experiment was to find an optimal subcube size, treating it as a hyper parameter, and then relate it to some real geological or spatial characteristics of the data. This would allow for the estimation of the subcube size given a new dataset rather than perform hyper parameter tuning again.
To compare results obtained with different subcube sizes, values from 5 to 61 voxels with a step of four were considered for both horizontal and vertical dimensions. The resulting accuracies varied, but not significantly; almost all of the accuracies were within the 0.75 to 0.9 range. In Figure 18, facies classification of the XL500 is shown for different subcube sizes. Overall, the bigger the input subcube, the smoother the predicted result. If a subcube size is too small, the result becomes very noisy (b). However, if the subcube size is big, the result is smooth, but some details may be lost. The dominating facies tends to spread its influence even further (d). Then, an experiment was performed on the F3 dataset. The goal of the experiment was to find an optimal subcube size, treating it as a hyper parameter, and then relate it to some real geological or spatial characteristics of the data. This would allow for the estimation of the subcube size given a new dataset rather than perform hyper parameter tuning again.
To compare results obtained with different subcube sizes, values from 5 to 61 voxels with a step of four were considered for both horizontal and vertical dimensions. The resulting accuracies varied, but not significantly; almost all of the accuracies were within the 0.75 to 0.9 range. In Figure 18, facies classification of the XL500 is shown for different subcube sizes. Overall, the bigger the input subcube, the smoother the predicted result. If a subcube size is too small, the result becomes very noisy (b). However, if the subcube size is big, the result is smooth, but some details may be lost. The dominating facies tends to spread its influence even further (d).
To make geological sense of the subcube size parameter, an experiment was performed to relate it to the two-point correlation (variogram) ranges obtained from the seismic data. Variogram ranges were estimated from the seismic data as 68 samples in the inline direction, 40 samples in the crossline direction, and four samples in the vertical direction. With a sparse sampling scheme (dropping every other sample), the horizontal subcube size based on the variogram range was 35 × 21; the vertical size was increased to nine samples to avoid a very noisy result. The result is shown in Figure 19. The prediction made with a subcube size based on the variogram ranges was significantly more detailed than the base case, which could be potentially attributed to geological features. The test accuracy obtained with the variogram-sized subcube was 0.785. This shows that the two-point correlation of the data can be a reasonable guide to set the parameter value for the input subcube size. To make geological sense of the subcube size parameter, an experiment was performed to relate it to the two-point correlation (variogram) ranges obtained from the seismic data. Variogram ranges were estimated from the seismic data as 68 samples in the inline direction, 40 samples in the crossline direction, and four samples in the vertical direction. With a sparse sampling scheme (dropping every other sample), the horizontal subcube size based on the variogram range was 35 × 21; the vertical size was increased to nine samples to avoid a very noisy result. The result is shown in Figure 19. The prediction made with a subcube size based on the variogram ranges was significantly more detailed than the base case, which could be potentially attributed to geological features. The test accuracy obtained with the variogram-sized subcube was 0.785. This shows that the twopoint correlation of the data can be a reasonable guide to set the parameter value for the input subcube size. Figure 19. Facies classification for the XL500 with a variogram range sized subcube (a) and a 33 × 33 × 33 subcube (b). Prediction obtained with a variogram-based subcube has more features and details which could be related to actual geological features.  To make geological sense of the subcube size parameter, an experiment was performed to relate it to the two-point correlation (variogram) ranges obtained from the seismic data. Variogram ranges were estimated from the seismic data as 68 samples in the inline direction, 40 samples in the crossline direction, and four samples in the vertical direction. With a sparse sampling scheme (dropping every other sample), the horizontal subcube size based on the variogram range was 35 × 21; the vertical size was increased to nine samples to avoid a very noisy result. The result is shown in Figure 19. The prediction made with a subcube size based on the variogram ranges was significantly more detailed than the base case, which could be potentially attributed to geological features. The test accuracy obtained with the variogram-sized subcube was 0.785. This shows that the twopoint correlation of the data can be a reasonable guide to set the parameter value for the input subcube size.  The next experiment involved using the RIPED dataset. The subcube size, which is the amount of data around each seismic sample taken into consideration, was a hyperparameter, and 15 different sizes were tested. The network was trained for two epochs, and the best result of predicting the IL540 is shown in Figure 20. The next experiment involved using the RIPED dataset. The subcube size, which is the amount of data around each seismic sample taken into consideration, was a hyperparameter, and 15 different sizes were tested. The network was trained for two epochs, and the best result of predicting the IL540 is shown in Figure 20. In this case, there were no fully labeled examples, so the accuracy was not calculated. Blobs classified as channels in the central part of the sections corresponded to the actual channels circled on the seismic section, which were identified as channels by interpreters from RIPED. It can be seen from predicted results that the network misinterpreted most in areas where there were no labels and especially in the faulted zone, which was expected. Overall, channels were correctly but imprecisely predicted by the model; however, the majority of channel bodies were the result of the model being confused by artifacts in the faulted zone.
The model was also applied to stratigraphic slices extracted from the RIPED cube. The result is shown in Figure 21, the best test accuracy obtained was 0.80. In the result, the key details were captured, but it lacked precision. One possible explanation for this is the lack of expressive power of the model. In this case, there were no fully labeled examples, so the accuracy was not calculated. Blobs classified as channels in the central part of the sections corresponded to the actual channels circled on the seismic section, which were identified as channels by interpreters from RIPED. It can be seen from predicted results that the network misinterpreted most in areas where there were no labels and especially in the faulted zone, which was expected. Overall, channels were correctly but imprecisely predicted by the model; however, the majority of channel bodies were the result of the model being confused by artifacts in the faulted zone.
The model was also applied to stratigraphic slices extracted from the RIPED cube. The result is shown in Figure 21, the best test accuracy obtained was 0.80. In the result, the key details were captured, but it lacked precision. One possible explanation for this is the lack of expressive power of the model. In the prediction, the key details are captured, but it is overall not very accurate.

Experiments with the U-Net Architecture
First, the same exercise of testing the architecture on the synthetic dataset was performed. An extreme case was tested, with only 13 examples for training and two examples for testing. The prediction result is shown in Figure 22. Though not very accurate, the predicted facies distribution patterns still captured the true facies distribution. In the prediction, the key details are captured, but it is overall not very accurate.

Experiments with the U-Net Architecture
First, the same exercise of testing the architecture on the synthetic dataset was performed. An extreme case was tested, with only 13 examples for training and two examples for testing. The prediction result is shown in Figure 22. Though not very accurate, the predicted facies distribution patterns still captured the true facies distribution. When the entire synthetic training dataset was used to train the model, a test accuracy of 0.95 was obtained, which is the best among all three models. The result is shown in Figure 23.  When the entire synthetic training dataset was used to train the model, a test accuracy of 0.95 was obtained, which is the best among all three models. The result is shown in Figure 23.
The model was then used to predict facies distribution in the F3 dataset. As there was a single training section available, a patch extraction process was performed to form a training set. The best accuracy obtained was 0.86, and the result is shown in Figure 24. Several factors could have affected the performance negatively: a significant number of classes to learn and lack of data corresponding to each of the classes; overall lack of data for training such a deep model; and the difference in distributions between the inline used for training and the crossline used for testing.
Investigating the validation accuracy graph in Figure 25, the issue described in Section 2.3 can be identified. The model achieved very high validation accuracy but significantly lower test accuracy on the same F3 dataset.
The same experiment was performed by mixing data patches from inline and crossline to balance out the distributions of the training set and the test set, and the results are summarized in Table 4 (structured in the same way as the Table 3). Again, the trend of test accuracy increasing with an increasing amount of mixed data is observed. This confirms that the distribution shift accounted for at least a significant share of the validation-test accuracy gap. The experiment also highlights the fact that the distribution shift is a major problem when applying machine learning algorithms to seismic data: seismic data is usually spatially heterogeneous; therefore, the distribution of data changed throughout the cube or from inline to crossline seismic sections. When the entire synthetic training dataset was used to train the model, a test accuracy of 0.95 was obtained, which is the best among all three models. The result is shown in Figure 23.  The model was then used to predict facies distribution in the F3 dataset. As there was a single training section available, a patch extraction process was performed to form a training set. The best accuracy obtained was 0.86, and the result is shown in Figure 24. Several factors could have affected the performance negatively: a significant number of classes to learn and lack of data corresponding to each of the classes; overall lack of data for training such a deep model; and the difference in distributions between the inline used for training and the crossline used for testing. Investigating the validation accuracy graph in Figure 25, the issue described in Section 2.3 can be identified. The model achieved very high validation accuracy but significantly lower test accuracy on the same F3 dataset. The same experiment was performed by mixing data patches from inline and cross- The model was then used to predict facies distribution in the F3 dataset. As there was a single training section available, a patch extraction process was performed to form a training set. The best accuracy obtained was 0.86, and the result is shown in Figure 24. Several factors could have affected the performance negatively: a significant number of classes to learn and lack of data corresponding to each of the classes; overall lack of data for training such a deep model; and the difference in distributions between the inline used for training and the crossline used for testing. Investigating the validation accuracy graph in Figure 25, the issue described in Section 2.3 can be identified. The model achieved very high validation accuracy but significantly lower test accuracy on the same F3 dataset. The same experiment was performed by mixing data patches from inline and crossline to balance out the distributions of the training set and the test set, and the results are summarized in Table 4 (structured in the same way as the Table 3). Again, the trend of test accuracy increasing with an increasing amount of mixed data is observed. This con-  When applied to the RIPED dataset, with sparsely labeled vertical sections as training data, the model did not yield any meaningful results. The model was then applied to stratigraphic slices from the RIPED dataset. Patch extraction and data augmentation by flipping left-right and adding Gaussian noise were utilized.
In Figure 26, examples of predictions made with different hyperparameters are shown. Overall, in the results, all the main features visible in seismic data were captured. The prediction in (d) was more precise and more detailed than the one shown in (c), which was expected given its deeper network with more weights and longer training. The image in (d) also has some details not present in the interpretation itself. The accuracy achieved is 0.847.  When applied to the RIPED dataset, with sparsely labeled vertical sections as training data, the model did not yield any meaningful results. The model was then applied to stratigraphic slices from the RIPED dataset. Patch extraction and data augmentation by flipping left-right and adding Gaussian noise were utilized.
In Figure 26, examples of predictions made with different hyperparameters are shown. Overall, in the results, all the main features visible in seismic data were captured. The prediction in (d) was more precise and more detailed than the one shown in (c), which was expected given its deeper network with more weights and longer training. The image in (d) also has some details not present in the interpretation itself. The accuracy achieved is 0.847. The prediction in (d) is much more detailed than the one in (c), even some details not present in the manual interpretation were captured. Figure 27 shows results from the interpretation of multiple stratigraphic slices, visualized as isosurfaces delineating the channel geobodies. The interpreted volume was 20 ms in the vertical size. In Section 4 the comparison of training times and prediction times for the different architectures is shown. The good performance of the U-Net architectures came at the cost of a much longer training time. However, once trained, its predictions were faster than the other architectures. The geobody interpretation shown in Figure 27 was obtained in less than a minute. The prediction in (d) is much more detailed than the one in (c), even some details not present in the manual interpretation were captured. Figure 27 shows results from the interpretation of multiple stratigraphic slices, visualized as isosurfaces delineating the channel geobodies. The interpreted volume was 20 ms in the vertical size. In Section 4 the comparison of training times and prediction times for the different architectures is shown. The good performance of the U-Net architectures came at the cost of a much longer training time. However, once trained, its predictions were faster than the other architectures. The geobody interpretation shown in Figure 27 was obtained in less than a minute.

Conclusions
A comparison of architectures in terms of accuracy values on different datasets was performed; its results are shown in Table 5. The performance of the 3D convolutional architecture on the synthetic data, which is 2D in nature, was worse than on the 3D datasets. This, together with the fact that in the synthetic dataset there are 5000 training examples and test data comes from the same distribution as training data, may indicate that the 3D convnet is sensitive to the size and dimensionality of input examples. Fully convolutional architectures showed good performance on all the datasets. The key difference between the dilated FCN and the U-Net was that the former was very sensitive to hyper parameter changes, while the latter was overall robust. The validation/test accuracy gap highlighted in this study emphasizes the problem that is likely to arise when applying ML models to seismic data. Due to geological heterogeneity, it may be impossible to obtain human-level performance on a part of the seismic cube that a model has not seen, even when a significant amount of training data is available for another part of a cube if there are differences in the distributions of the two parts.
Performance comparison in terms of the prediction time is shown in Table 6. The 3D convolutional model is a robust solution, but its predictions take a long time, as it predicts one voxel at a time. Numerical results showed that the 2-point correlation of the data (variogram range) can be a reasonable guide to set the parameter value for the input subcube size around the prediction voxel. The performance of the fully convolutional dilated architecture was similar to that of the U-Net, but it was much more sensitive to changes in hyperparameters. U-Net showed good results (~0.85 accuracy) when trained on stratigraphic slices, and was overall consistent in its predictions. A universal robust solution working with a small amount of training data and sparsely labeled data is yet to be discovered.

Conclusions
A comparison of architectures in terms of accuracy values on different datasets was performed; its results are shown in Table 5. The performance of the 3D convolutional architecture on the synthetic data, which is 2D in nature, was worse than on the 3D datasets. This, together with the fact that in the synthetic dataset there are 5000 training examples and test data comes from the same distribution as training data, may indicate that the 3D convnet is sensitive to the size and dimensionality of input examples. Fully convolutional architectures showed good performance on all the datasets. The key difference between the dilated FCN and the U-Net was that the former was very sensitive to hyper parameter changes, while the latter was overall robust. The validation/test accuracy gap highlighted in this study emphasizes the problem that is likely to arise when applying ML models to seismic data. Due to geological heterogeneity, it may be impossible to obtain human-level performance on a part of the seismic cube that a model has not seen, even when a significant amount of training data is available for another part of a cube if there are differences in the distributions of the two parts.
Performance comparison in terms of the prediction time is shown in Table 6. The 3D convolutional model is a robust solution, but its predictions take a long time, as it predicts one voxel at a time. Numerical results showed that the 2-point correlation of the data (variogram range) can be a reasonable guide to set the parameter value for the input subcube size around the prediction voxel. The performance of the fully convolutional dilated architecture was similar to that of the U-Net, but it was much more sensitive to changes in hyperparameters. U-Net showed good results (~0.85 accuracy) when trained on stratigraphic slices, and was overall consistent in its predictions. A universal robust solution working with a small amount of training data and sparsely labeled data is yet to be discovered.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.