Comparison of Methods to Segment Variable-Contrast XCT Images of Methane-Bearing Sand Using U-Nets Trained on Single Dataset Sub-Volumes

: Methane (CH 4 ) hydrate dissociation and CH 4 release are potential geohazards currently investigated using X-ray computed tomography (XCT). Image segmentation is an important data processing step for this type of research. However, it is often time consuming, computing resource-intensive, operator-dependent, and tailored for each XCT dataset due to differences in greyscale contrast. In this paper, an investigation is carried out using U-Nets, a class of Convolutional Neural Network, to segment synchrotron XCT images of CH 4 -bearing sand during hydrate formation, and extract porosity and CH 4 gas saturation. Three U-Net deployments previously untried for this task are assessed: (1) a bespoke 3D hierarchical method, (2) a 2D multi-label, multi-axis method and (3) RootPainter, a 2D U-Net application with interactive corrections. U-Nets are trained using small, targeted hand-annotated datasets to reduce operator time. It was found that the segmentation accuracy of all three methods surpass mainstream watershed and thresholding techniques. Accuracy slightly reduces in low-contrast data, which affects volume fraction measurements, but errors are small compared with gravimetric methods. Moreover, U-Net models trained on low-contrast images can be used to segment higher-contrast datasets, without further training. This demonstrates model portability, which can expedite the segmentation of large datasets over short timespans.


Introduction
Deep sea sediments and perfmafrost host large quantities of methane (CH 4 ), an energy source and potent greenhouse gas that may be a contributor to climate change [1,2]. Much of this CH 4 is present as hydrates (clathrates), that is, solid crystalline lattices of water at low temperature and high pressures that enclose CH 4 molecules. 164 m 3 of CH 4 gas at normal temperature and pressure can be stored in one m 3 of hydrate [3]. However, the extent of the world-wide CH 4 hydrate inventory is subject to considerable uncertainty [4,5]. This is in part due to discrepancies between measurements produced by geophysical and electrical resistivity methods [6,7]. These discrepancies are potentially associated with hydrate and CH 4 gas distribution heterogeneity in the host soil [8]. Uncertainties regarding the global CH 4 hydrate inventory affect resource estimation and CH 4 emission prediction models [5,9,10]. CH 4 hydrate formation and dissociation has also been associated with changes in the mechanical characteristics of the host sediment. For instance, hydrates may strengthen and stiffen the sediment matrix by creating inter-grain cementation bonds [11,12]. This is speculated to lead to, for example, underwater slides that may trigger tsunami or damage seabed infrastructure such as cables and pipelines [13][14][15].
Recently, researchers have shown that X-ray computed tomography (XCT) can be used to successfully detect hydrate and CH 4 gas bubble distribution heterogeneity and characterise changes in sediment microstructure associated with hydrate formation and dissociation [8,[16][17][18]. This has been possible in great part due to advancements in image segmentation techniques. Segmentation is the process of classifying 2D pixels or 3D voxels into regions, for example, the solids, liquids and gases present in XCT images of geomaterials. Microstructural parameters such as grain, pore and bubble size, shape and orientation can then be derived from the segmented image, as well as volumetric (bulk) quantities such as porosity and CH 4 gas saturation ratios.
Some of the most common segmentation techniques used in geomechanics and geoscience are greyscale thresholding and watershed algorithms [19,20]. The former involves the selection of a greyscale range to classify pixels or voxels into regions of interest. Watershed algorithms redefine the image as a map where greyscale intensities form topographical elevations and catchment basins. Pixel/voxel markers within these basins are used to define the materials (or 'labels') present in the image, and the algorithm then morphologically dilates these markers until they fill their catchment basins [21,22]. Greyscale range determination in the case of thresholding techniques and marker grey value and location in the case of watershed techniques are operator and/or method dependent [19,23,24]. The values assigned to these parameters also depend on the recorded greyscale contrast, which is highly reliant on the X-ray imaging instrument and how it is optimised [25]. Sample heterogeneity or density changes during an in situ experiment will further introduce contrast variability in space and time [19,26]. As a result, thresholding and watershed segmentation are typically optimised per XCT scan, and objective comparison is difficult given that the data treatment varies between datasets. These issues often result in segmentation procedures in geomechanics and geoscience that are highly demanding of computing resources and operator time.
Novel alternative approaches have employed machine learning to segment multiple material phases present in XCT images of soil and rock samples [27,28]. For these applications, segmentations are produced via a mathematical model optimised or 'trained' using a series of 'ground truth' example segmentations of XCT images provided by the user. Within the realm of machine learning, convolutional neural networks (CNNs) are a class of deep neural networks that employ convolutional layers where the filters ('kernels') used to separate image features are learned [29]. Researchers have recently begun exploring the application of CNNs to segment XCT images of soil and rock [30][31][32][33].
U-Nets are a class of CNN originally designed to segment biomedical images [34]. The U-Net architecture is composed of downsampling (encoding/contracting) and upsampling (decoding/expanding) paths. The former reduces the spatial dimensions of the data while increasing feature information; the latter recombines spatial and feature data to generate the label image. Both paths are linked by connections that can feed the output from the contracting path directly into the corresponding level of the expanding path, thereby allowing the transfer of spatial information and the preservation of fine-grained details in the output label image. A limitation to the implementation of U-Nets (and CNNs in general) to segment XCT images of soil and rock is the preparation of training and validation datasets, which often require labour-intensive manual segmentation (hand-annotation) of many images.
The suitability of such time-and resource-saving approaches have not been examined before in research on methane-bearing sediments. This paper explores this application using three U-Net implementation strategies to segment a large number of synchrotron radiation XCT (SXCT) volumes of CH 4 -bearing sand during hydrate formation and dissociation experiments; two of these U-Net implementations are entirely novel to the field of geomechanics and geoscience. These strategies were focused on the segmentation of the CH 4 phase, as it exhibited low contrast with regard to the brine-hydrate phase and was less common in the data compared to the other material phases, as shown in Figure 1a. This rendered the use of conventional thresholding or watershed techniques largely unsuitable. The aim of this investigation was thus to determine if U-Nets can accurately segment XCT images of soil samples with varying greyscale contrast between material phases using only a small number of training and validation images, therefore reducing operator and computing time and allowing for objective data comparison.
unsuitable. The aim of this investigation was thus to determine if U-Nets can acc segment XCT images of soil samples with varying greyscale contrast between m phases using only a small number of training and validation images, therefore re operator and computing time and allowing for objective data comparison. Section 2 of this paper presents the details of the in situ hydrate formation expe as well as the methodology used to acquire, reconstruct and post-process the SXC Section 3 describes the three U-Net approaches used, as well as the conventional se tation methods and accuracy metrics used for comparison and performance analys poses. Results are presented and discussed in Section 3 and outcomes are summar Section 4.

Methane Gas Hydrate Formation and Dissociation Experiments
A custom rig designed and manufactured by Sahoo et al. [8] for in situ SXCT im of gas hydrate formation and dissociation was used in the present study. The rig i of polyether ether ketone (PEEK) and consists of a monolithic 2 mm internal diam 23 mm tall cylindrical vessel with 0.8 mm thick walls and an enlarged base, as sh Figure 2. The soil sample is placed through the bottom of the rig. The pore fluid in pipe is connected to this inlet, as depicted in Figure 2. The rig features thermocou the base of the scan zone, shown in this Figure, to measure sample temperature. The imaging zone in this study corresponds to a vertically centred 1.755 mm-tall region the 10 mm-tall scan zone.
Leighton Buzzard sand Fraction E (LBE) with mean grain diameter of 100 µ used as surrogate marine sediment. LBE is an angular silica sand widely used as sta laboratory material in geomechanics research. The sand was tamped into PEEK ve a target porosity of 35%. A vacuum pressure of less than 1 Pa was applied throu injection pipe to reduce air presence in the pore space. A calculated volume of brin tion (3.5% NaCl by weight, representative of deep ocean water; [35]) was therea jected into the sample, such that approximately 90% of the pore volume became sat CH4 gas was then injected at 10 MPa and the valve to the sample closed. The samp gradually cooled to a target constant temperature of 2 °C using a N2 cryostream thermobaric condition enabled hydrate formation in the pore space instead of ic target temperature was maintained for 30 h to complete the hydrate formation p [11]. Section 2 of this paper presents the details of the in situ hydrate formation experiment as well as the methodology used to acquire, reconstruct and post-process the SXCT data. Section 3 describes the three U-Net approaches used, as well as the conventional segmentation methods and accuracy metrics used for comparison and performance analysis purposes. Results are presented and discussed in Section 3 and outcomes are summarised in Section 4.

Methane Gas Hydrate Formation and Dissociation Experiments
A custom rig designed and manufactured by Sahoo et al. [8] for in situ SXCT imaging of gas hydrate formation and dissociation was used in the present study. The rig is made of polyether ether ketone (PEEK) and consists of a monolithic 2 mm internal diameter by 23 mm tall cylindrical vessel with 0.8 mm thick walls and an enlarged base, as shown in Figure 2. The soil sample is placed through the bottom of the rig. The pore fluid injection pipe is connected to this inlet, as depicted in Figure 2. The rig features thermocouples at the base of the scan zone, shown in this Figure, to measure sample temperature. The SXCT imaging zone in this study corresponds to a vertically centred 1.755 mm-tall region within the 10 mm-tall scan zone. Leighton Buzzard sand Fraction E (LBE) with mean grain diameter of 100 µm was used as surrogate marine sediment. LBE is an angular silica sand widely used as standard laboratory material in geomechanics research. The sand was tamped into PEEK vessel to a target porosity of 35%. A vacuum pressure of less than 1 Pa was applied through the injection pipe to reduce air presence in the pore space. A calculated volume of brine solution (3.5% NaCl by weight, representative of deep ocean water; [35]) was thereafter injected into the sample, such that approximately 90% of the pore volume became saturated. CH 4 gas was then injected at 10 MPa and the valve to the sample closed. The sample was gradually cooled to a target constant temperature of 2 • C using a N 2 cryostream. This thermobaric condition enabled hydrate formation in the pore space instead of ice. The target temperature was maintained for 30 h to complete the hydrate formation process [11].

Set-Up and Image Acquisition
Data was collected on beamline I13-2 at Diamond Light Source (DLS). Scans were performed using a polychromatic 'pink beam' at 30 keV peak energy. The detector system used was a scintillator-coupled pco.edge 5.5 camera fitted with a 4× optic magnification lens, resulting in an effective pixel size of 0.8125 µm. The X-ray projection size was 2560 × 2160 pixels (width × height).
Scans were carried out in situ at various time intervals after reaching 2 • C. The number of projections and the exposure time per projection varied amongst scans to reduce acquisition times at specific moments of the CH 4 hydrate formation process. Table 1 correlates each scan discussed in this paper with the time after the start of the 30 h sustained 2 • C period, as well as the scan specifications used. Tomographic reconstruction was carried out using Savu [36][37][38]. Two Savu reconstruction pipelines were used: one with and one without Paganin phase enhancement [39]. These pipelines were labelled 'phase contrast' (Figure 3b) and 'absorption contrast' (Figure 3a), respectively. Both pipelines implemented filtered back-projection reconstruction [40,41] and pre-reconstruction algorithms for speckle and ring artefact suppression [37,42] and the automatic determination of the centre of rotation [43]. Further processing was carried out on the output from both reconstruction pipelines using Fiji [44,45]. This consisted in: 1. The application of a median filter of kernel size 3 to the absorption volume and the halving of the resulting greyscale values; 2. The application of an unsharp mask filter of radius 3 and weight 0.70 to the phase contrast volume; 3. The elementwise averaging of both volumes.
This procedure resulted in a single reconstructed volume with clear edge detail and phase contrast (Figure 3c).
Finally, to mitigate the halo-like or 'cupping' artefact caused by the preferential attenuation of lower-energy X-rays close to the specimen surface, known as beam hardening, as well as by truncation artefacts introduced by attenuation from sample regions outside the field of view [46,47]), each slice was convolved with two mollifier functions with an inverse shape to that of the cupping artefact. This flattened the horizontal (XY) grey value profile of Methane 2023, 2 5 each slice. A circular mask with a radius of 1100 pixels was then applied to remove voxels at the outer edges of the field of view (FOV), which were resistant to cupping correction. An example output slice is presented in Figure 3d.
value profile of each slice. A circular mask with a radius of 1100 pixels was then applied to remove voxels at the outer edges of the field of view (FOV), which were resistant to cupping correction. An example output slice is presented in Figure 3d.
As outlined in Section 1, limited greyscale contrast between the CH4 gas and the brine-hydrate phase persisted after reconstruction and post-processing. Distinction between these two phases became increasingly difficult as the distance between the 3D image histogram peaks for the sand and non-sand phases reduced, as exemplified in Figure  1b. This distance is therefore used in this paper as an overall measure for image contrast, with regard to the ease with which the material phases could be identified and segmented. Considering this, intermediate contrast dataset IC01 89062 (Table 1) was selected initially to investigate the suitability of U-Nets to perform segmentations.  As outlined in Section 1, limited greyscale contrast between the CH 4 gas and the brinehydrate phase persisted after reconstruction and post-processing. Distinction between these two phases became increasingly difficult as the distance between the 3D image histogram peaks for the sand and non-sand phases reduced, as exemplified in Figure 1b. This distance is therefore used in this paper as an overall measure for image contrast, with regard to the ease with which the material phases could be identified and segmented. Considering this, intermediate contrast dataset IC01 89062 (Table 1) was selected initially to investigate the suitability of U-Nets to perform segmentations.

U-Net Segmentation
Three different methodologies were used to create trained U-Net models to segment the three main material phases present in the images: sand, brine-hydrates and CH 4 gas. These were: 1. A 3D hierarchical approach where two separate 3D U-Net models were trained to perform binary segmentations: On the sand phase vs. the others and the CH 4 gas phase vs. the others; 2. A 2D multi-label and multi-axis approach where a single 2D U-Net was trained to classify the three labels. The encoder section of this U-Net implementation was pretrained on the ImageNet dataset [48], meaning that the network should only require a small amount of 'transfer' training in order to achieve acceptable results on new data; Methane 2023, 2 6 3. RootPainter software, which uses a graphical user interface (GUI) and human intervention by interactive corrections to train a lightweight binary 2D U-Net model.
The U-Net models produced by each method were used to segment a 1554 × 1554 × 2000 voxels sized region of the 2560 × 2560 × 2000 voxels sized reconstructed and post-processed volumes, hereafter termed 'analysis region' and 'total volume'. The analysis region was inscribed within the cylindrical FOV of the total volume and omitted the black pseudo-background generated during reconstruction. Figure 4a shows the analysis region for dataset IC01. All analysis regions discussed in this paper are available in Alvarez-Borges et al. [49]. It is emphasised that the segmentation of the sand phase via U-Nets was done to assess the multi-label segmentation capacity of the algorithm. Due to its uniformly high-contrast and well-defined edges, sand could be easily segmented with any 'conventional' method, for example, thresholding.
1. A 3D hierarchical approach where two separate 3D U-Net models were trained to perform binary segmentations: On the sand phase vs. the others and the CH4 gas phase vs. the others; 2. A 2D multi-label and multi-axis approach where a single 2D U-Net was trained to classify the three labels. The encoder section of this U-Net implementation was pretrained on the ImageNet dataset [48], meaning that the network should only require a small amount of 'transfer' training in order to achieve acceptable results on new data; 3. RootPainter software, which uses a graphical user interface (GUI) and human intervention by interactive corrections to train a lightweight binary 2D U-Net model.
The U-Net models produced by each method were used to segment a 1554 × 1554 × 2000 voxels sized region of the 2560 × 2560 × 2000 voxels sized reconstructed and postprocessed volumes, hereafter termed 'analysis region' and 'total volume'. The analysis region was inscribed within the cylindrical FOV of the total volume and omitted the black pseudo-background generated during reconstruction. Figure 4a shows the analysis region for dataset IC01. All analysis regions discussed in this paper are available in Alvarez-Borges et al. [49]. It is emphasised that the segmentation of the sand phase via U-Nets was done to assess the multi-label segmentation capacity of the algorithm. Due to its uniformly high-contrast and well-defined edges, sand could be easily segmented with any 'conventional' method, for example, thresholding.

Training and Validation Data
The U-Net training procedures required both greyscale and label datasets. The latter was the 'ground truth' information used during training and validation. Label data was produced by hand-annotating the sand, CH 4 gas and brine-hydrate in the greyscale data using Avizo Lite ® software. This was carried out on small subregions of the analysis volumes to reduce labelling time. The 3D hierarchical approach initially used a 384 × 384 × 384 voxels sized training subvolume and a 256 × 256 × 256 voxels sized validation subvolume, selected from two different regions of the 3D image ( Figure 4b). These are hereafter referred to as 3D training and 3D validation subvolumes, respectively. RootPainter requires 2D label images (slices) of at least 572 × 572 pixels in size for both training and validation, as explained later in Section 2.2.4. Therefore, a 572 × 572 × 572 voxels sized subvolume was delimited for this purpose (Figure 4c), hereafter referred to as 2D learning subvolume. The same 2D learning subvolume was used only to train the 2D multi-label models, with the 3D validation subvolume used for validation as with the 3D methods.
The 2D and 3D training and validation subvolume coordinate origins relative to the global origin of the total volume are listed in Table 2 (note in column three that all subvolumes were 3D arrays, despite their naming and usage). The global coordinate system origin is indicated in Figure 4. All training, validation and segmented data used in this investigation are available in [49]. Table 2. U-Net training and validation subvolume details (shown in Figure 4).

Subvolume Name
Usage

3D Hierarchical Segmentation
The 3D hierarchical U-Net model was implemented in the Python library PyTorch [50] and based upon an existing implementation of a residual 3D U-Net from the literature [51,52]. This model had 35.3 million trainable parameters and five downsampling and upsampling stages. Voxel intensity values were rescaled and clipped, truncating values beyond 2.575 standard deviations of the mean to mitigate the skewing effect of outliers. The ground truth label volumes (3D training subvolume with three labels: sand, brinehydrates and CH 4 gas) were used to create separate binary label volumes, one with sand vs. background and the other with CH 4 gas vs. background. These volumes were used as the label data for training the separate binary 3D U-Net models.
Unlike the multilabel 2D U-Net implementation described later, this model had not been pre-trained on ImageNet and was therefore likely to require a larger amount of training data to reach a high segmentation accuracy. To overcome this, the TorchIO library [53] was used to sample 128 × 128 × 128 voxels sized regions from the greyscale 3D training subvolume and generate 48 sets with random noise, flips, blurs, affine, and elastic transformations to be used as an extended training data set for each training epoch (i.e., a full training cycle). This procedure is termed 'augmentation'. These 128 × 128 × 128 voxels regions matched the input size of the U-Net.
During training, U-Net model parameter optimisation (i.e., the process of updating the model parameters on each training iteration) was carried out with a method known as AdamW [54]. The learning rate, a parameter that controls the step size of the updates made by the optimiser, was initially determined automatically and then cycled up and down every epoch to reduce the need to tune this parameter and to accelerate the training process [55]. The parameters β1, β2 and weight decay were set at 0.9, 0.999 and 0.1, respectively. Binary cross entropy (BCE), a measure of the uncertainty between two data distributions, was used as the loss function (the function minimised by the optimiser during training). Training progress was monitored using Intersection Over Union (IOU) on the validation set as the evaluation metric. If either no improvement in validation loss occurred after 40 passes of the entire training dataset (epochs) or 100 epochs were completed, the model with the lowest validation loss was saved. This was aimed at preventing overfitting. Software source code for this method is available from King and Alvarez-Borges [56].
When predicting the segmentation of the analysis region, two binary predictions were produced for each data set, one for sand vs. background and the other for CH 4 gas vs. background. These two label volumes were then combined using a label hierarchy: first, a new volume was created with all voxel labels set to brine-hydrates, then the labels correspond-ing to CH 4 gas were transferred from the CH 4 vs. background prediction, and lastly the labels corresponding to sand were transferred from the sand vs. background prediction.

2D Multi-Label Segmentation
Training of the 2D U-Net with multiple labels was performed using the 2D learningsubvolume using two approaches. The first mimicked that of RootPainter, described later, with the network being trained on horizontal 2D (XY) slices through the image volume. The second, multi-axis approach, utilised slices taken in the XY, XZ and YZ planes (coordinate system shown in Figure 4). A 2D U-Net was used with a ResNet34 encoder [57]. This model had a total of 41.2 million trainable parameters and four downsampling and upsampling stages. This encoder was loaded with pre-trained weights from ImageNet. The model was created with Fastai [58], a Python library which has a high-level interface that utilizes Py-Torch. Default Fastai image augmentations were used during training. These consisted of random image crops, zooms, rotations, flips, affine transforms and brightness and contrast adjustments. The loss function used was cross entropy (CE) and the evaluation metric used was the number of correctly labelled voxels expressed as a percentage. Training was carried out for 15 epochs.
For the single-axis implementation, the XY greyscale stack and corresponding label stack of 572 images were randomly split into training (80%) and validation (20%) sets. The input image size for the model was also set to 572 × 572 pixels to match that used by RootPainter. When predicting the segmentation for the analysis region, data was fed into the network in the form of 2000 XY slices of size 1554 × 1554 pixels.
For the multi-axis approach, the 2D greyscale learning subvolume and corresponding label data were sliced into 2D images in the XY, XZ and YZ planes, resulting in 1716 (572 × 3) training image and label pairs. These images were also randomly split into a training (80%) and validation (20%) set. The input image size for the model was again set to 572 × 572 pixels. When predicting the segmentations for the analysis region, an averaging approach for data produced from each plane was used as described by Tun et al. [59], but with a modification to take the multiple labels into account. In short, this averaging approach consisted in slicing, segmenting, and rotating the volume across the XY 4-fold symmetry plane and then splitting and hierarchically recombining the 12 resulting segmentation volumes so that two label volumes were obtained, one containing labels for sand vs. background and the other for CH 4 vs. background. These two binary label volumes were then combined into a multi-label volume as done for the data output from the 3D hierarchical method (Section 2.2.2).
Software source code for this method is available from King and Alvarez-Borges [56].

RootPainter Segmentation
RootPainter [60] is a client-server application originally developed to segment plant root features from photographs of soil profiles [61,62]. The client GUI is employed to annotate 2D images from a dataset, such as a tomography image stack of horizontal (XY) slices, as in the present case. The tomography slices and corresponding annotations are then read by the server and used to train the segmentation model using a U-Net variant with 1.3 million trainable parameters and four downsampling and upsampling stages. This is implemented in PyTorch and described by Smith et al. [61] and Smith et al. [62]. To execute the training routine, the software creates a validation dataset by randomly selecting one annotation image out of every five created. The accuracy of the model produced at the end of each training epoch is evaluated using the F-score parameter described by Smith et al. [62]. At the end of each training epoch, F-score values for the current and previous model are compared and the one with the highest value is saved. Training is stopped if 60 epochs are completed without F-score improvements.
RootPainter uses interactive corrections. These are created by annotating image slices overlaid with the segmentation labels produced by the best model currently available.
These corrective annotation slices are added to the training and validation datasets so that the five to one ratio is maintained.
RootPainter 0.2.5 can only predict binary segmentations ('foreground' vs. 'background'). Therefore, it was initially used to segment the CH 4 gas phase only. The 2D learning sub-volume was used for training and validation.
Sparse annotations have been shown to produce better results than dense/intensive annotations when interactively training U-Net models [61,63]. Thus, arbitrarily sparsely annotated images were produced by converting all CH 4 gas labels into foreground and enclosing them with background labels that included brine-hydrate and sand pixels, as shown in Figure 5a,b. This was done by morphologically dilating the CH 4 label of each slice in the training dataset and re-labelling the added pixels as background. The annotated slices were then copied into annotation and validation directories, maintaining the five-toone ratio. Training was initiated after copying the first batch of five images. Further batches were added if a training epoch finished without further improvements in F-score and the model could not segment the majority of CH 4 pixels, or if the erroneously segmented pixels were patently greater than the number of correctly segmented pixels, as shown in Figure 5c. Corrective annotation was started after a training epoch had produced a model that segmented most of the CH 4 regions with a roughly equivalent number of erroneously labelled pixels, as presented in Figure 5d,e. Once a model was produced that could segment CH 4 without evident erroneously labelled pixels, the software was left to carry on training until the 60-epoch limit was reached. The resulting model was then used to segment the analysis region slice by slice.

Thresholding and Watershed Segmentation
To compare the performance of the U-Net methods with conventional segmentation routines, the SXCT data was segmented using manual and automatic thresholding, and the watershed method. Images were downsampled to 8-bit as in the U-Net methods described previously. A bilateral filter was used before segmentation to improve thresholding performance and mitigate over-segmentation (filter parameters were:100-pixel spatial kernel, 50-pixel window size, and a grey-value kernel of 30 counts; implemented in Python using the open-cv library, Bradski [64]; see, e.g., Paris et al. [65] for filter description).
Manual thresholding was carried out by selecting a single threshold value for all

Thresholding and Watershed Segmentation
To compare the performance of the U-Net methods with conventional segmentation routines, the SXCT data was segmented using manual and automatic thresholding, and the watershed method. Images were downsampled to 8-bit as in the U-Net methods described previously. A bilateral filter was used before segmentation to improve thresholding performance and mitigate over-segmentation (filter parameters were:100-pixel spatial kernel, 50-pixel window size, and a grey-value kernel of 30 counts; implemented in Python using the open-cv library, Bradski [64]; see, e.g., Paris et al. [65] for filter description).
Manual thresholding was carried out by selecting a single threshold value for all slices by visual inspection. Automatic thresholding was performed on a slice-by-slice basis using the multi-level Otsu method [66] implemented using the scikit-image Python library [67].
Watershed segmentation was carried out in Fiji using the morphological segmentation tool in the Morpholibj library [68]. It consisted in the application of a morphological gradient with radius of 1 and the automatic determination of markers by finding local minima (with a tolerance of 8 greyscale intensity values), prior to the watershed 'inundation' phase. The output label image contained different labels for all features in the greyscale input image, including sand and brine-hydrate. Labels corresponding to regions in the 8-bit volume with mean greyscale intensity values below 50 to 70, depending on the dataset, and above 130 were classified as CH 4 gas and sand, respectively. The remaining voxels were classified as brine-hydrates.

Quantitative Analysis
The central 40 XY slices of the segmented analysis region were compared with handannotated counterparts created in Avizo Lite ® and considered to represent 'ground truth' labels. These slices do not intersect any of the training or validation subvolumes. These ground truth volumes are available in Alvarez-Borges et al. [49]. The previously mentioned IOU metric was used to evaluate segmentation performance. IOU is defined as: where TP refers to the number of voxels or pixels correctly predicted to correspond to the label of interest ('true positive'), and FP and FN are the number of voxels or pixels incorrectly predicted to be part of the label of interest ('false positive') and voxels/pixels incorrectly predicted to belong to any of the other material phases ('false negative'), in each case. A comparable analysis of U-Net accuracy has been done by, e.g., Karabag et al. [69] and Phan et al. [32]. IOU returns a value between 0 and 1, where the latter corresponds to the scenario were the segmentation matches the validation image pixel by pixel (or voxel by voxel). In the following sections, quantitative analyses were carried out on a slice-by-slice basis (i.e., using pixel counts as input). Figure 6 compares the original and segmented central slice for dataset IC01, produced using the three U-Net methods (Section 2.2) and three standard methods (Section 2.3). Training and validation in both the 2D multi-label approach and RootPainter was carried out using XY slices only (i.e., single plane). Figure 7a presents accuracy metrics for the segmentation of CH 4 gas in the central 40 XY slices this dataset. It may be noted that RootPainter delivered slightly higher metrics than the other two U-Net methods, but this difference in performance cannot be readily identified in Figure 6. Figures 6 and 7a also show that, for this dataset, watershed and manual thresholding methods return lower accuracy results than the U-Net approaches, and that Otsu-thresholding performed poorly. In fact, the Otsu approach consistently segmented the brine-hydrate and CH 4 gas as a single label, as evident in Figure 6e. This is chiefly due to the absence of well-defined inter-class variance extrema between these materials and the small relative size of CH 4 bubbles [70,71]. In later comparisons, results from the Otsu method are omitted for this reason.

Segmentation Performance Comparison
show that, for this dataset, watershed and manual thresholding methods return lower accuracy results than the U-Net approaches, and that Otsu-thresholding performed poorly. In fact, the Otsu approach consistently segmented the brine-hydrate and CH4 gas as a single label, as evident in Figure 6e. This is chiefly due to the absence of well-defined inter-class variance extrema between these materials and the small relative size of CH4 bubbles [70,71]. In later comparisons, results from the Otsu method are omitted for this reason.  Since each model had been trained to convergence with their respective data, the slightly lower performance metrics observed in Figure 7a for the 3D hierarchical output, compared to that of RootPainter, may be attributed to the smaller training subvolume used. To present a more balanced comparison, a further 3D hierarchical model was trained on a subvolume of the same size as the one used for both 2D methods, i.e., the larger 2D learning subvolume. This comparison is presented in Figure 7b, where it is evident that RootPainter still outperformed the 3D hierarchical approach, though the difference between methods reduced. While an even larger training subset may deliver more substantial improvement, the preparation of such data would require much greater operator input, which is contrary to the aim of this study. Figure 7a,b show that pre-training on the ImageNet database for the 2D multi-label method did not result in a significant segmentation performance advantage over the 3D hierarchical method. A similar outcome on the effect of transfer learning has been reported by He et al. [72]. They remarked that, ultimately, pre-training primes the U-Net for feature identification, which leads to fewer training iterations rather than greater segmentation accuracy. Such appears to be the present case, as the 2D multi-label approach produced similar results to the 3D hierarchical method with up to six times fewer training epochs, as shown in Tables 3 and 4. A disadvantage of the use of 2D U-Net segmentation methods that operate solely with XY slices, such as RootPainter and the single-axis 2D multi-label method, is that horizontal stripe artefacts may appear in the vertical (YZ or XY) slices of the segmented volume. This occurs because training and segmentation does not account for feature continuity between slices. Such artefacts are absent in the output of the 3D hierarchical implementation, which is reflected in the "smoothness" of the line showing the per-slice metrics for this approach in Figure 7. These artefacts can be mitigated by predicting segmentation of data slices taken along different axes and subsequently recombining them into a single volume, as done for the multi-axis 2D method, described in Section 2.2.3. This also improves the algorithm segmentation performance metrics, as shown in Figure 7b, but at the expense of greater computation times, as presented in Figure 8. A disadvantage of the use of 2D U-Net segmentation methods that operate solely with XY slices, such as RootPainter and the single-axis 2D multi-label method, is that horizontal stripe artefacts may appear in the vertical (YZ or XY) slices of the segmented volume. This occurs because training and segmentation does not account for feature continuity between slices. Such artefacts are absent in the output of the 3D hierarchical implementation, which is reflected in the "smoothness" of the line showing the per-slice metrics for this approach in Figure 7. These artefacts can be mitigated by predicting segmentation of data slices taken along different axes and subsequently recombining them into a single volume, as done for the multi-axis 2D method, described in Section 2.3.3. This also improves the algorithm segmentation performance metrics, as shown in Figure 7b, but at the expense of greater computation times, as presented in Figure 8.   for this approach in Figure 7. These artefacts can be mitigated by of data slices taken along different axes and subsequently recomb volume, as done for the multi-axis 2D method, described in Sec proves the algorithm segmentation performance metrics, as show expense of greater computation times, as presented in Figure 8.  . Segmentation time required using an Nvidia Tesla V100 ® GPU for the U-Net methods and HPCC of 40×Intel Gold 6242R ® CPU @ 3.10 GHz for the standard methods. Benchmarking data were extracted from the reconstructed and post-processed scan IC01 and are available from Alvarez-Borges et al. [49]. Excludes time spent on the preparation of training datasets.

U-Net Performance on Data with Different Greyscale Contrast
The trained U-Net models created with all three methods described above were able to predict high-quality segmentations for the intermediate contrast dataset IC01 when trained on subsections of the same dataset. To examine if similar results could be obtained on datasets exhibiting lower greyscale contrast, both 3D hierarchical and RootPainter U-Nets were used to segment 'low' contrast dataset LC03 (Table 1, Figure 1b), using 3D training and 2D learning subvolumes of the same data for training, respectively. Figure 7c presents the performance metrics resulting from this approach, as well as those of watershed and manual thresholding methods applied to the same volume. It may be noted that both U-Net methods return lower metrics than those used on IC01. The IOU computations show that, on average, 74% and 85% of the voxels predicted to be CH 4 gas were true positives in the 3D hierarchical and RootPainter results, respectively. In comparison, these average values were 92 and 94% for IC01. Figure 9a shows that, for both U-Net methods, the lower performance metrics of the segmentation for LC03 are driven by false positives. However, false positives are over twice as numerous than false negatives in the results for the 3D hierarchical approach, whereas they only surpass false negatives by about 30% in the RootPainter segmentation. For both methods, most false positives correspond to ground truth brine-hydrate voxels incorrectly labelled as CH 4 gas, as depicted in Figure 9b. This indicates that the reduced grey value differentiation (i.e., contrast) between CH 4 gas and brine-hydrate phases restricted U-Net segmentation accuracy, as anticipated. Despite this, the U-Net methods significantly outperform the standard approaches, as shown in Figure 7c. In fact, performance numbers reveal that watershed and manual thresholding cannot deliver a reliable quantification of the material phases of this low contrast dataset.
For both methods, most false positives correspond to ground truth brine-hydrate voxels incorrectly labelled as CH4 gas, as depicted in Figure 9b. This indicates that the reduced grey value differentiation (i.e., contrast) between CH4 gas and brine-hydrate phases restricted U-Net segmentation accuracy, as anticipated. Despite this, the U-Net methods significantly out-perform the standard approaches, as shown in Figure 7c. In fact, performance numbers reveal that watershed and manual thresholding cannot deliver a reliable quantification of the material phases of this low contrast dataset.

U-Net Segmentation Model Generalisation Across Datasets (Model Portability)
To examine U-Net model portability, 'low' and 'high' contrast datasets LC03 and HC05 (Table 1, Figure 1b) were segmented using the models produced from training on 'intermediate' contrast dataset IC01. Figure 7d,e presents the performance metrics of the resulting segmentations. It can be observed that segmentation accuracy is lowest in the case where the U-Net models trained on mid-contrast dataset IC02 were applied to the low-contrast dataset LC03. IOU values from this process are comparable to those obtained from the thresholding and watershed methods applied to mid-contrast dataset IC01, and thus, quantification from these segmentations may be unreliable. The U-Net model trained on IC01 produced higher accuracy segmentations of high-contrast dataset HC05,

U-Net Segmentation Model Generalisation across Datasets (Model Portability)
To examine U-Net model portability, 'low' and 'high' contrast datasets LC03 and HC05 (Table 1, Figure 1b) were segmented using the models produced from training on 'intermediate' contrast dataset IC01. Figure 7d,e presents the performance metrics of the resulting segmentations. It can be observed that segmentation accuracy is lowest in the case where the U-Net models trained on mid-contrast dataset IC02 were applied to the lowcontrast dataset LC03. IOU values from this process are comparable to those obtained from the thresholding and watershed methods applied to mid-contrast dataset IC01, and thus, quantification from these segmentations may be unreliable. The U-Net model trained on IC01 produced higher accuracy segmentations of high-contrast dataset HC05, comparable to those for the segmentation of low-contrast dataset LC03 using models trained 'natively' on LC03 subvolumes. Yet, it is evident that U-Net models trained on subvolumes of IC01 perform best when applied to the same 'native' IC01 dataset, as shown by Figure 7a,e.
As segmentation performance appeared to be higher when U-Net models trained on lower contrast data were used to segment higher contrast data, models trained on low-contrast LC03 images were used to segment high-contrast dataset HC05. Performance metrics are presented in Figure 7f. This Figure shows an overall improvement in performance metrics compared with segmentations produced with the U-Net models trained on IC01 (Figure 7e). However, an instance of localised poor performance for RootPainter can be observed in the profiles of Figure 7f, which resulted from a cluster of FP pixels on a single slice. This emphasises the limitations of the slice-by-slice (2D) segmentation described in Section 3.1, and denotes a broadly similar pattern of FP-driven model inaccuracy as for the results discussed previously in Section 3.2 (Figure 9).

Applications and Implications
The segmentation of XCT or SXCT images of soil and rock samples is often carried out to determine parameters such as porosity or liquid/gas saturation, as discussed in Section 1. The varying performances of the U-Net methods used in the present investigation result in differences in the parameters calculated from the segmented images. This is exemplified in Figure 10, which compares porosity and CH 4 gas saturation (by volume) ratios derived on a slice-by-slice basis from the segmented volumes produced with the 3D hierarchical approach (3D training sub-volume) and RootPainter, which were the procedures that seemed to provide the best results with the least user time. Calculation details are provided in Appendix A. Figure 12a also shows that thresholding and watershed methods are very effective at segmenting abundant, high-contrast, well-defined features like sand, as stated in Section 2.3. Indeed, the use of U-Nets may not be necessary or recommended if only such segmentations are required, as mentioned in Section 2.2.2. However, mainstream methods return unsatisfactory CH4 gas saturation measurements, particularly for the low-contrast 89069 volume, as shown in Figure 10. This is due to their inability to detect scarce, lowcontrast features like CH4 bubbles, as demonstrated in Section 3.1.  Results presented in Figure 10 correspond to two application scenarios, that is: 1. U-Nets trained on sub-volumes of the dataset of interest are then used to segment the entire dataset, shown in Figure 10a,d. As discussed in Section 3.2, differences in greyscale contrast affect the performance of the resulting segmentation. A training sub-volume needs to be created for each scan; 2. U-Nets trained on sub-volumes of a low-greyscale contrast dataset are then used to segment other 'unknown' datasets of higher greyscale contrast (model portability). This is presented in Figure 10e,f, corresponding to parameters derived for high-contrast dataset HC05 using segmentations produced from U-Nets trained on sub-volumes of low-contrast dataset LC03. Thus, only one training sub-volume is needed to segment multiple scans.
Porosity and CH 4 saturation calculations derived from manual thresholding and watershed methods are also included in Figure 10. This Figure suggests that, while U-Net models trained on a sub-volume of the same data delivered high segmentation performance metrics for the CH 4 gas phase, the derived parameters deviated from ground truth values to some extent, this being more acute for porosity inferences. In fact, in most cases watershed or thresholding methods delivered more accurate porosity profiles. A comparison between the mean absolute error (MAE) for the porosity and CH 4 saturation calculations along with the mean IOU values for the combined CH 4 gas and sand labels from the three volumes used to generate Figure 10 is shown in Figure 11. This Figure reveals that, while there is a general trend of lower MAE for derived material parameters with higher segmentation accuracy, the correlation exhibits some scatter. Considering that both CH 4 gas saturation and porosity are in part derived using the number of sand voxels and that these are significantly more numerous than pore voxels (CH 4 gas and brine-hydrates), it may be proposed that errors in porosity/CH 4 -saturation estimation originate from inaccuracies in the segmentation of the sand phase. This is evidenced in Figure 12 for dataset IC01, which presents (a) IOU metrics for the segmentation of the sand phase and (b) the number of FP and FN voxels. Figure 12a reveals that the inaccuracies in the segmentation of the sand phase are relatively small in terms of metrics, which are in fact higher than those of the CH 4 gas phase presented in Figure 7a. However, Figure 12b shows that the number of FP and FN voxels is large compared to the size of the CH 4 gas and brine-hydrate phases, which amount to roughly 3.0 × 10 4 and 8.75 × 10 5 voxels per slice, respectively. This, in turn, affects parameters calculated from voxel counts. This denotes that the estimation of soil parameters based on ratios between material phases from segmented images is particularly sensitive to the relative size of said phases. It should be noted, however, that the maximum absolute errors presented in Figure 11 for U-Net-derived parameters (1.40% and 0.26% for porosity and CH 4 gas saturation, respectively) are smaller than those commonly reported for laboratory methods [73][74][75].    Figure 12a also shows that thresholding and watershed methods are very effective at segmenting abundant, high-contrast, well-defined features like sand, as stated in Section 2.2. Indeed, the use of U-Nets may not be necessary or recommended if only such segmentations are required, as mentioned in Section 2.1.2. However, mainstream methods return unsatisfactory CH 4 gas saturation measurements, particularly for the low-contrast 89069 volume, as shown in Figure 10. This is due to their inability to detect scarce, low-contrast features like CH 4 bubbles, as demonstrated in Section 3.1.
A further application for U-Net segmentations of XCT/SXCT images of soil and rock is 3D data visualisation, which can then be used to investigate, for instance, CH 4 gas distribution within the pore matrix. Such application can greatly benefit from model portability. To exemplify this, Figure 13 compares 3D views of the CH 4 gas phase produced by segmenting datasets obtained at different stages of hydrate formation (Table 1) using the RootPainter model trained on the low-contrast LC03 sub-volume. Despite the presence of a modest number of segmentation errors in the form of small islands on some of the images (Figure 13b-d), the U-Net model produces sensible 3D representations of the data, and changes in CH 4 gas distribution as it is consumed for hydrate formation can be clearly distinguished. In a further example, a 2D multi-label U-Net, trained using the singleaxis approach on the 2D learning subvolume from scan IC01, has been used to segment a higher-contrast SXCT scan from a similar experiment carried out at the Swiss Light Source (SLS) originally reported by Sahoo et al. [8]. The post-processing steps described in Section 2.1.2, except cupping correction, were applied to the reconstructed data and a 1554 × 1554 × 2000 voxel region was extracted from the centre of the 3D image (data is available from Alvarez-Borges et al. [49]). Results are shown in Figure 14, where it is seen that the model delivers qualitatively accurate 3D views of the distribution of all three material phases, without any additional training, corrections or user input.
Methane 2023, 1, FOR PEER REVIEW 18 and bubble morphometry, as well as the potential for improving a model's ability to generalise and accurately segment new data by including training data from several SXCT volumes with varying contrast characteristics.  Both examples demonstrate the capability of U-Net models to segment multiple SXCT images of CH 4 -bearing soil, despite being obtained using different instruments and setups. The U-Net models used only a single (572) 3 voxel sub-volume for training and did not require any additional training, corrections or user input to segment new images.
A key implication is that training of a single U-Net model on a low greyscale contrast dataset could be used to deliver insight on variations in sediment morphology in other datasets. This has valuable applications. For example, segmentations are often required during a short period of time with limited operator input, such as during data acquisition at a synchrotron or other X-ray facility. The availability of pre-trained U-Net models would allow segmentations and sediment morphology/microstructure information to be produced within a short time after acquisition and reconstruction. Pre-trained models could also be used to segment numerous and/or large data sets over shorter timespans with reduced user effort and bias. However, accuracy will remain lower that what could be obtained with a model trained 'natively' on a subset of the target SXCT volume.

Conclusions
The application of U-Nets to segment SXCT images of CH4-bearing sand has been investigated. The general aim was to determine if these convolutional deep learning networks, trained on a small set of images (≤(572) 3 voxels), were capable of accurately segmenting large SXCT datasets (2000 × (1554) 2 voxels) of different greyscale contrast, with focus on the CH4 gas phase. Training images were obtained from a hand-annotated subset of the reconstructed SXCT data. Three U-Net deployment methods were used: 3D hierarchical, 2D multi-label and the RootPainter application. Quantitative comparisons amongst U-Net segmentation outputs, along with mainstream thresholding and watershed methods, were carried out using the IOU metric. Major outcomes of this investigation are presented below.
1. For a given SXCT data set, the three U-Net deployment methodologies produced models capable of delivering segmented images of the CH4 gas phase with average IOU metrics of at least 0.74 and up to 0.93. This demonstrated that the U-Net methods used were capable of accurately identifying the CH4 gas phase using a small number of training images. RootPainter delivered marginally higher IOU metrics than the other methods but suffered from minor horizontal stripping artefacts and required more human intervention and proportionally higher computing time; 2. Greyscale contrast between material phases in the different SRXCT datasets was a significant factor affecting U-Net segmentation accuracy. The lowest segmentation performance metrics corresponded to SRXCT datasets exhibiting the lowest Future work could encompass, for instance, an analysis of the effect of U-Net segmentation accuracy and training strategy on 'second order' metrics such as particle, pore and bubble morphometry, as well as the potential for improving a model's ability to generalise and accurately segment new data by including training data from several SXCT volumes with varying contrast characteristics.

Conclusions
The application of U-Nets to segment SXCT images of CH 4 -bearing sand has been investigated. The general aim was to determine if these convolutional deep learning networks, trained on a small set of images (≤(572) 3 voxels), were capable of accurately segmenting large SXCT datasets (2000 × (1554) 2 voxels) of different greyscale contrast, with focus on the CH 4 gas phase. Training images were obtained from a hand-annotated subset of the reconstructed SXCT data. Three U-Net deployment methods were used: 3D hierarchical, 2D multi-label and the RootPainter application. Quantitative comparisons amongst U-Net segmentation outputs, along with mainstream thresholding and watershed methods, were carried out using the IOU metric. Major outcomes of this investigation are presented below.
1. For a given SXCT data set, the three U-Net deployment methodologies produced models capable of delivering segmented images of the CH 4 gas phase with average IOU metrics of at least 0.74 and up to 0.93. This demonstrated that the U-Net methods used were capable of accurately identifying the CH 4 gas phase using a small number of training images. RootPainter delivered marginally higher IOU metrics than the other methods but suffered from minor horizontal stripping artefacts and required more human intervention and proportionally higher computing time; 2. Greyscale contrast between material phases in the different SRXCT datasets was a significant factor affecting U-Net segmentation accuracy. The lowest segmentation performance metrics corresponded to SRXCT datasets exhibiting the lowest greyscale contrast, while greater segmentation accuracy resulted from the use of higher contrast data; 3. All U-Net segmentations of CH 4 gas outperformed thresholding and watershed methods. However, mainstream methods proved to be more accurate at segmenting abundant, well-defined, and high-contrast features, like sand. U-Net methods are, thus, not recommended for this task; 4. The ability of a U-Net model trained on a subset of one dataset to generalise and produce an accurate segmentation of a different dataset, was explored. It was found that models trained on lower-contrast images were able to produce accurate segmentations of higher-contrast data without additional training. In comparison, U-Net models trained on higher-contrast images were found to deliver poor results when used to segment lower-contrast data. 'Portability' was further demonstrated by accurately segmenting independent data from a different synchrotron facility without additional training. This suggests that targeted training on small amounts of 'ground truth' data can produce U-Net segmentation models that can be used for rapid segmentation of a large number of different datasets with additional user input or training. However, segmentation accuracy will be lower than that of a model 'natively' trained on subsets of the target dataset; 5. The effect of segmentation accuracy on image-derived material parameters was investigated by calculating porosity and CH 4 gas saturation profiles using U-Net segmentations. A general trend of lower mean absolute error of the derived parameter with greater segmentation accuracy was found, but the correlation exhibited some scatter. Considering that porosity, fluid saturation and other parameters are ratios between material phases, it was proposed that errors in derived parameters are not only linked to segmentation accuracy metrics but to the number of false positive and negative voxel labels of the largest phase relative to the other phases.   [60].

Acknowledgments:
The authors gratefully acknowledge Diamond Light Source for the provision of beamtime (proposal MT16205-1) and are thankful for the support provided by beamline I13 staff.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A
Porosity was calculated as: Porosity (%) = volume of pores total volume × 100, and CH 4 gas saturation was determined as: where the volume of CH 4 gas amounts to the total number of CH 4 gas voxels, the volume of pores is the sum of CH 4 gas and brine-hydrate voxels, and the total volume is the total number of voxels in the image multiplied by the voxel volume (0.8125 × 0.8125 × 0.8125 µm).
For the RootPainter method, the sand phase has been segmented using the same approach used for CH 4 described in Section 2.2.4, but using sand labels and only one quadrant of each annotation slice to produce sparsely annotated training and validation images.