1. Introduction
Naturally reserved gas hydrate (GH) has high uncertainty regarding its kinetic behavior, geomechanical stability, and economic feasibility. For these reasons, multiple countries such as South Korea, Japan, the United States, and India [
1,
2,
3,
4,
5,
6] encounter difficulties in pursuing research and development (R&D) or holding their test production in fields, despite having conducted related R&D. In particular, one location containing reserved GH—the East Sea in South Korea—is a challenging field for producing GH owing to its sparse GH distribution, well stability problems, uncertain GH dissociation, and the current energy ecosystem being incompatible for GH [
7].
Concerning these issues, many investigations into the behaviors of GH and reducing uncertainty of general characteristics of GH have been carried out [
8,
9,
10,
11,
12]. One method, X-ray computerized tomography (CT), involves scanning out of a target GH sample during experiments to infer how inner fluids behave in porous media to address the difficulty of understanding what happens in a GH sample directly [
13,
14,
15]. In the GH experimental environment, production rates are difficult to measure accurately due to either the dead volume between measurement equipment and a GH sample, or flow delay in a GH sample, or a flow line. By addressing these issues, X-ray CT scanning could be an appropriate method to investigate fluid workings in a GH sample and infer its approximate trend.
The Korea Institute of Geoscience and Mineral Resources (KIGAM) utilized X-ray CT images to quantitatively analyze depressurization velocity and critical GH saturation (SGH,C) during application of depressurization in a GH sample. In the KIGAM experiments, normalization of CT values was effective for quantitative analysis of approximate GH behavior, but the CT values included three phase behaviors (i.e., water, gas, and GH) such that no significant difference between the different depressing velocities could be determined.
Therefore, identification of each phase is needed for more accurate analysis in GH experiments. In particular, it is key to distinguish water from GH owing to their similar densities, which causes difficulty in their identification. Dependable identification of GH saturation will, in turn, lead to identifying an optimal depressurization parameter with higher reliability than using only normalized CT values. Our previous study has shown reliable applicability of machine-learning for GH saturation identification based on X-ray CT images [
16]. In that study, the machine-learning methods utilized CT images for input and saturation values for output; in particular, random forest (RF) brought over 95% correlation between the original and predicted data for both training and test data. However, that study used only 960 items of training data without thorough filtering or selection of CT images for each of the phase saturations. Furthermore, such an amount of data seems insufficient to cover overall types of GH trends.
In most cases, the number of training data is several hundred or at most a few thousand for machine-learning applications in petroleum engineering [
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28], which was also noted in a previous paper [
16]. Contrastingly, the number of training data is over ten thousand or even up to one million in computer-science-engineering-centered applications [
29]. In spite of that, the number of training data does not necessarily guarantee reliability of training performance when inappropriate data is combined with the entire data pool. Either unqualified data should be removed, or differing types of data should be properly separated based on domain knowledge for better machine-learning [
28].
S
GH,C can be one suitable standard to separate or categorize given GH data because S
GH,C highly distinguishes GH behavior. In Gil et al. [
30] and KIGAM [
31], numerical analysis of GH dissociation behavior was conducted to find S
GH,C. In Gil et al. [
30], the S
GH,C was suggested as ~50–60% according to the simulation results, mimicking the environment of Ulleung Basin, East Sea, South Korea. Although the S
GH,C has been narrowed down to a certain range of GH saturation, more accuracy is still required to better describe GH saturation for Ulleung Basin.
We need to conduct reliable saturation identification of GH, thereby leading to more accurate SGH,C measurements and descriptions of GH behaviors. Eventually, it will be necessary to apply proper machine-learning, owing to its ability to draw meaningful conclusions or lessons from given data. Considering the previous machine-learning applications for petroleum engineering and necessity of proper data construction, this paper will suggest how to separate given X-ray CT images for machine-learning applications with SGH,C. In addition, it will analyze how data quantity and quality (or construction) function in terms of machine-learning performance.
Section 2 explains how GH experiments were conducted and how X-ray CT was scanned for the experiments. In addition, the procedures of data acquisition and data preprocessing will be described with the applied machine-learning method, RF.
Section 3 presents the data of the given CT images and compares with previous data [
16].
Section 4 reports the saturation identification results pictured for four cases to analyze the effect of data construction in terms of quantity and quality.
Section 5 presents conclusions and identifies valuable guidelines for the next steps of GH saturation identification.
3. Construction of Training Data
Figure 6 shows an example of normalized CT values and their distribution in the five experimental stages. In the DRY and SAT stages, only 0 or 1 values are present for all normalized CT values according to Equation (1) and
Figure 3 [
16]. On the contrary, individual distributions in the IWS, GH, and GTW stages are present, shown in histogram form as
Figure 6b. The GH sample is marked with the yellow dotted circles in three of the stages and each histogram covers only that circle area. From IWS to GTW, the averaged, normalized CT values become larger than the previous stage because the overall density increases corresponding to the experiment design due to the five stages (
Figure 2). The normalized CT values become closer to 0.9–1, similar to the density of GH and water (
Table 1).
Figure 7 describes the eight total GH samples for the experiment and their distributions of normalized CT values of the GH stage. The first to third samples are categorized into the low GH saturation group, while the fourth to eighth samples are assigned to the high GH saturation group. There was no difficulty with the separation into these two groups because of the obvious difference of S
GH values—whereby values were around 40% and 50%, respectively.
Figure 7b presents the distributions of normalized CT values for the first slice of eight GH samples, with black dots indicating the means of each distribution. Generally, the low GH saturation group has low averages of the normalized CT values and the high GH saturation group gives higher average values. We can distinguish S
GH values into the forty- and fifty-percent groups which are taken as the conventional values and critical values, respectively, because S
GH,C was expected to be ~50% for the target GH sample in this study. For values close to that of S
GH,C, production efficiency of GH can be drastically reduced due to the slow pressure propagation in porous media.
The eight GH samples are randomly indexed after they are categorized into the two groups. In this study, the total eight GH samples are divided into four cases (Cases 1–4) for analysis of data construction and machine-learning performance. Case 1 represents the total eight GH samples and Cases 2 and 3 are the low and high SGH, respectively. Case 4 consists of the odd-numbered GH samples—1, 3, 5, and 7—for randomly constructed data collection with low and high SGH evenly. Case 4 is set to have the four selected GH samples in order to produce the fairest comparison with Cases 2 and 3 by fitting to the amount of training data.
Table 2 shows averaged GH saturation and CT values of each GH sample and Cases 1–4. Orange-colored cells indicate the utilized GH sample for each of the four cases. The means that Cases 1 and 4 are similar to each other because they are composed of combined GH samples from the low and high S
GH groups. The difference between averaged S
GH of Cases 2 and 3 is ~11%, which could give significant contrast of GH behaviors.
Figure 8 and
Table 3 explain how the training and test data are divided and constructed for Cases 1–4.
Figure 8 illustrates that four test sets are randomly selected from each data pool as a conventional machine-learning procedure. Case 1 uses all eight GH samples, and its test set consists of a random 10% of these. This test set of Case 1 is set as the universal test set, which is utilized as the common standard to compare all four cases with each other. The test sets for Cases 2, 3, and 4 are random 10% sets from each entire pool, represented by green, orange, and blue colored lines, respectively, in the figure. Training of RF is conducted for all four cases and the number of used training data is presented in
Table 3. The eight GH samples have 320 data points due to the multiplication of 5 stages and 64 slices (
Figure 2 and
Figure 3). Case 1 has 2560 due to the multiplication of 8 GH samples and 320 slices. The number of training data should then be 2304, which is 2560 subtracted by 256, and the same procedure is carried out for the rest of the three cases. The detailed training conditions of RF are shown in
Table 3. Regarding number of properties, 219 is determined by the root of the total number of CT values, 47,996.
4. Results and Discussion
Figure 9,
Figure 10,
Figure 11 and
Figure 12 are the results of RF for the training and test sets, respectively.
Figure 9a,
Figure 10a,
Figure 11a and
Figure 12a are the training and
Figure 9b,
Figure 10b,
Figure 11b and
Figure 12b are the test set, while each column lists
SW,
SGH, and
SG in order. In both (a) and (b) panels, the first row is the scatterplot and the second row displays the same results in a histogram. In the first row, the
X-axis indicates the original value of saturations and the
Y-axis indicates the predicted (modeled) values. Blue dots indicate individual data samples and increasing darkness of the color indicates increasingly scattered data relative to a certain position. Therefore, although some of data seem to be deviated from the diagonal line, it can give a high coefficient of determination (
SW of
Figure 9b). Correlation coefficients are calculated and presented at the top of all charts. In the histograms of the second row, the blue dotted box means the predicted saturation values and the red solid-line box presents the original saturations.
Figure 9 illustrates the largest number and widest range of data because all eight GH samples are included. The coefficient of determination (
R2) of the training data is ~0.99 for all of the variables—
SW,
SGH, and
SG. In particular,
SG shows considerable fitting between the original and the predicted values in comparison with
SW and
SGH. This is because the density of water and GH are relatively similar, which leads to little difference in X-ray CT images and normalized CT values. However, the density of gas is much lower than that of water or GH, thereby causing certain changes in CT values. The certain difference of densities between gas and the others causes a clear discrepancy between normalized CT values of gas and the others. It makes the prediction of gas saturation easier than the prediction of water and GH saturations. This phenomenon was identified in our previous study [
16].
Figure 10 and
Figure 11 contrast each other in terms of
SGH. It was expected for these charts to show the different distribution of S
GH because they are separated with
SGH criteria, showing the
SGH results. In both Cases 2 and 3, the overall machine-learning performances are suitable considering that all
R2 values are greater than 0.99 and the scattered dots are positioned on the diagonal line in an orderly fashion. However, in terms of
SGH, Case 2 has a relatively wide range of 0–0.6, whereas Case 3 mostly shows either 0 or ~0.5. In
Figure 11, the scattered dots are deviated from the diagonal line especially near 0.4, which is a comparably low
SGH. It would be expected that Case 3 was trained for large
SGH values compared to Case 2, which is the reason why Case 3 functions this way according to the given data composition.
Figure 12 shows overall decent performance except for the test set of
SGH whose
R2 is 0.89, the lowest value. Case 1 has the smallest
R2 value of 0.92 in the test set for
SGH. In Cases 2 and 3,
SGH shows the smallest
R2, seemingly indicating that the most challenging component of the process is the identification of
SGH for the test sets. The density of water is maintained at approximately 1 g/cc regardless of experimental stage. However, the density of GH highly depends on the given pressure and temperature of the three experimental stages—IWS, GH, and GTW (i.e., the last column of
Table 1). Therefore,
SGH differently affects the normalized CT values according to these experimental stages. This phenomenon of
SGH could further lead to more complex relationships between S
GH and the normalized CT values, and consequently, results in higher difficulty of the machine-learning training. For this reason, the methodology introduced in this study should continue to be conducted.
In
Figure 13 and
Figure 14, the four trained RF models correspond to the four cases, and those four RF models are tested using the universal test. The test results shown in
Figure 13a and
Figure 14a are identical with those of
Figure 9b. In most machine-learning-related studies, its trained performance is mainly evaluated with errors and correlation coefficients between original and predicted data in the test data set. In this study, the four, learned, RF models are tested together with the universal test set for a consistent analysis. According to one previous study, it was estimated that
SGH,C might be somewhere between approximately 50–60% [
31].
Interestingly, according to the S
GH results shown in
Figure 13b,c, an obvious boundary is shown at ~50%
SGH, where the red dotted lines indicate the validation as to whether there is any trend related to
SGH,C. In
Figure 13b, the
SGH values over 50% are poorly matched in performance compared to
Figure 13a,c,d. On the other hand, to the left of the red line in the
SGH results shown in
Figure 13b, the
SGH values less than 50% have comparatively good fitting results. This is further emphasized when viewing the left of the red line for S
GH results of
Figure 13c.
Furthermore, it should be noted that Case 2 shows deviated results for over 50%
SGH (
Figure 13b and
Figure 14b), even though some training data nearby showed 50%
SGH (
Figure 10a). This is an indication of an additional effect other than the range of training data values, “critical gas saturation”. Thus, we can expect certain distinguishing behaviors of GH samples according to the different
SGH,C values setting. Although there could be some
SGH values near 40% or 50% in one specific GH sample, the trend of GH behaviors would highly depend on the decided
SGH,C as an experimental condition. Based on that possibility, we can infer that there must be a certain radical change of GH behavior from
SGH,C, which is evaluated as ~50% in this study.
Cases 2, 3, and 4 are relatively comparable to each other in terms of the absolute number of training data—864, 1440, and 1152, respectively (
Table 3)—all of which are close to the value of 1000. Considering this, Case 4 has relatively less-biased results of S
GH compared to Cases 2 and 3 (
Figure 13d). On both the left and right sides of the red line, data are generally positioned following the diagonal line.
Table 4 organizes the mean square error (MSE) results of Cases 1–4 for both the training and test sets and the universal test set. The MSEs are computed as follows:
where
is the number of training or test data,
is the original
ith data, and
represents the predicted data for the
ith original data.
Table 4 presents the MSEs corresponding to each data set, fluid phase, and case, and also shows the averaged MSEs for an overall comparison of training data, random 10% data, and the universal test. In terms of each fluid phase, S
G has the lowest errors among the three phase saturations. As shown in
Figure 9,
Figure 10,
Figure 11 and
Figure 12 and
Figure 13, S
W and S
GH have the larger MSEs. As conventional machine-learning shows, the MSEs of the training data are lower than those of the two test sets.
Meaningful lessons can be understood from the comparison between all four cases. First, the absolute number of data substantially affects machine-learning performance (Cases 1 and 4). Typically, a higher number of training data would be expected to have better performance, as long as other conditions such as features and algorithms are sufficiently appropriate [
35,
36]. However, Cases 1 and 4 have two times the difference in the number of data; however, the MSEs are nearly in the same scale without critical difference (39 and 31.5 in the universal test). Therefore, the results of Cases 1 and 4 of this study indicate that distributions of these cases must be similar to each other, such that they also produce similar-scaled MSEs. This indicates that the process of how data is constructed is as important as the absolute number of data for machine-learning performance. Second, if it were possible to obtain a limited number of data, it would be ideal to focus on the specific
SGH range. Cases 2 and 3 show lower MSEs compared to Cases 1 and 4 in random 10% test with the similar number of training data, which means specialized targeted data construction can be strategically advantageous for machine-learning performance. Third, the ratio of MSEs from Cases 2 and 3 is about 5:3, which is similar to a ratio of the rest of the data for Cases 2 and 3, respectively—in that, Case 2 has only three GH samples among the total eight GH samples and the rest has five GH samples. For Case 3, the rest of the data has three GH samples. The less related the data, the larger the MSE.
5. Conclusions
This paper proposed the saturation identification of water, GH, and the gas phase in GH samples based on a machine-learning method in consideration of
SGH,C. Moreover, the effects of training data quantity and quality were analyzed for RF utilization in the four cases. Compared to our previous related study [
16], this study utilized five additional GH samples, whose number of data was 1600. Owing to this extra data, we could categorize samples into low and high
SGH groups and determine how the number of data and
SGH,C affect the overall machine-learning performance.
This study validated the significant influence of
SGH,C in cases where training data consists of low and high
SGH groups. The average MSE differences of random 10% test (
Table 4) between Cases 1 and 4 (10.9) was larger than that between Cases 2 and 3 (1.2), indicating that
SGH can be a highly important standard for saturation identification in GH formation and dissociation experiments. In particular,
SGH,C can be an important criterion to divide training data when any machine-learning technique is applied to given CT images in a GH experiment (refer to Cases 2 and 3). Thus, the separation of CT images according to
SGH,C can be an appropriate option for constructing training data, leading to obtaining reliably specialized machine-learning models.
In conclusion, it is important to acquire a sufficiently high number of data in order to carry out trustworthy application of machine-learning; however, proper data construction should also be considered. It was expected that one specific standard for data building would be identified from the essential factors of interested behaviors based on domain knowledge, and it was verified to be SGH,C from this study. Therefore, if obtainment of data was restricted to some specific type or quantity of data, the first order of business would be selection of GH experiment to be conducted first according to a value of SGH. After that, GH experiments could be intensively performed to preferentially obtain data of CT images and saturations based on a target field condition of SGH,C. Accordingly, SGH,C would be the optimal guideline for training data building.
In future studies, additionally acquired GH CT images would be assigned to one of low or high SGH groups whose criterion is SGH,C. According to that, two machines could then be trained with those two categorized data, respectively, so as to produce two customized machine-learning models. After construction of the reliable machine-learning models based on qualitatively and quantitatively sufficient data, those models could be utilized to identify saturation values during the dissociation stage of GH sample experiments with depressurization. Saturation identification of GH samples in real time is expected to be a powerful tool to help determine general GH behaviors and conduct a variety of experiments for optimization of the parameters of GH production by depressurization.