Adapting SAM for Volumetric X-Ray Data-sets of Arbitrary Sizes

Objective: We propose a new approach for volumetric instance segmentation in X-ray Computed Tomography (CT) data for Non-Destructive Testing (NDT) by combining the Segment Anything Model (SAM) with tile-based Flood Filling Networks (FFN). Our work evaluates the performance of SAM on volumetric NDT data-sets and demonstrates its effectiveness to segment instances in challenging imaging scenarios. Methods: We implemented and evaluated techniques to extend the image-based SAM algorithm fo the use with volumetric data-sets, enabling the segmentation of three-dimensional objects using FFN's spatially adaptability. The tile-based approach for SAM leverages FFN's capabilities to segment objects of any size. We also explore the use of dense prompts to guide SAM in combining segmented tiles for improved segmentation accuracy. Results: Our research indicates the potential of combining SAM with FFN for volumetric instance segmentation tasks, particularly in NDT scenarios and segmenting large entities and objects. Conclusion: While acknowledging remaining limitations, our study provides insights and establishes a foundation for advancements in instance segmentation in NDT scenarios.


Introduction
In the field of Non-Destructive Testing (NDT), of large scale components and assemblies, such as cars [20], shipping containers [12,11], or even airplanes [4,5,6,3] are often captured using large scale 3D X-ray computed tomography (CT) and are subsequently subjected to automated analysis and evaluation.In this context an important step of the analysis process consists of instance segmentation, where an attempt is made to assign a unique semantic identifier or label to each entity in a data-set.For example, all voxels belonging to a specific screw are hereby assigned the same unique identifier, while voxels belonging to another component are assigned a different unique identifier.
The process shown in Figure 1 exemplifies this attempt using an XXL-CT data-set of a historic aircraft [5].It begins with acquisition of data from the specimen, in this case the airplane, and proceeds with the reconstruction of a volumetric voxel data-set (Figure 1a). Figure 1b and Figure 1c show a sub-volume of size 512 × 512 × 512 voxels of the reconstruction and the instance segmentation.In Figure 1c each semantic entity within the sub-volume has been assigned a unique identifier.The classes of these entities (primarily screws and metal plates) are not considered, as the classification of the entities is not performed and is focus of future work.
image processing and data exploration in NDT and medical [7] applications.By segmenting a large scale volumetric image data-set into its semantic instances, it becomes easier to extract valuable information and to analyze complex component geometries.This is particularly important in cases where the data-set contains various acquisition and reconstruction that can make interpretation difficult for both experts and non-experts.
Instance segmentation is a critical task in computer vision, leading to the proposal and development of numerous methods that leverage both classical image processing and neural networks.These approaches, however, are not without their limitations.Some methods necessitate manual intervention and corrections [21,24], and others are tailored specifically to predefined component classes [18].Challenges associated with data quality, particularly in data-sets with a high incidence of artefacts, can significantly hinder the effectiveness of segmentation algorithms [3],

Segment Anything Model
The Segment Anything Model (SAM) [10] is an instance segmentation model based on the vision transformer architecture [2].It is an advanced model for segmenting arbitrary entities out of photographs.It stands out primarily for its high quality, robustness, and minimal required user input.One of its not-able features is the ability to be queried using a variety of prompts, allowing it to segment a RGB input image with a spatial resolution up to 1024 × 1024 pixels into multiple segments in one inference call.SAM supports prompts in various forms, such as seed points (point prompts), bounding boxes, brush masks (dense prompts) and text prompts.
Furthermore, SAM allows the generation of multiple output masks for each input prompt hence enabling image segmentations at varying hierarchical levels of granularity.Another advancement presented by the SAM is the extensive training data-set SA-1B, which has been iteratively collected and refined through prior versions of SAM during its own training process.

Combination with Tile-based FFN
This work aims to evaluate the applicability of SAM for segmenting volumetric NDT data-sets and to examine its potential enhancement through the integration of Flood Filling Networks (FFN), initially proposed by Januszewski et al. [9].FFNs are instance segmentation methods originally based on convolutional networks [13,14], which are able to segment arbitrarily large data-sets based on tiles.Originally FFN have been developed for the segmentation of organic objects, but have in the past been extended to other applications, including the delineation of large scale XXL-CT data [4].
The FFN approach maintains the current state of segmentation within an accumulator volume, which is sized to match the dimensions of the input volume.During each segmentation step, a sub-volume or tile of the input volume and the corresponding partially computed tile of the accumulator are passed to the model (in our case, a volumetric variant of SAM).The segmentation proposal of the tile is then updated and written back to the corresponding tile position within the accumulator.
Candidates for neighbouring tile positions with significant overlap, which could extend the current segment, are determined using the updated accumulator state and added to a queue of tiles pending processing.In the subsequent iteration, the next unprocessed tile is removed from the front of the queue for processing.Starting from a seed point, the FFN then processes all tiles which potentially belong to the current segment.The processing of the current segment is completed when the queue of potentially belonging tiles is depleted.The algorithm then proceeds with the next segment starting from another seed point.
The seed points of the segments can be manually specified or computed automatically by a reasonable algorithm.

Contributions
In this work, we propose a novel approach for volumetric instance segmentation in NDT by combining SAM with FFN.Our contributions include:

Evaluation of SAM on NDT data-sets:
We assess the performance of SAM on nondestructive testing data-sets and demonstrate its effectiveness in accurately segmenting instances in challenging CT imaging scenarios.

Implementation and evaluation of various methods to combine image-based SAM for the application with volumetric data-sets:
We implement and evaluate different techniques to integrate output of the image-based SAM approach for the application of volumetric data-sets, enabling the segmentation of three-dimensional objects using FFN's spatially adaptive capabilities.
Extending SAM for objects of arbitrary size through tile-based approaches: We propose a tile-based approach that leverages FFN's capabilities to segment objects of arbitrarily size.By initially dividing the input volumes into tiles and then applying SAM on each tile individually, we achieve accurate and efficient segmentation results for objects of any size.

Utilizing dense prompts for SAM to combine tiles in an accumulator:
To further improve the accuracy of the proposed tiled-based approach of SAM, we use dense prompts to guide SAM in combining the segmented tiles into a cohesive instance segmentation result.By leveraging the accumulated information from neighbouring tiles, we try to achieve more robust and accurate instance segmentation results.

Methods
This section presents the methodology and the experimental setup used, including the introduction of the data-sets (Section 2.1) used for the evaluation of the proposed methods.We furthermore describe a technique to improve the image segmentation performance of SAM with respect to the Me 163 airplane XXL-CT data-set by fine-tuning it specifically for this task (Section 2.2).Additionally, we detail our inference workflow in Section 2.3, which adapts the top-performing SAM model for volumetric data-sets.This process includes tile-based segmentation, accumulator-based dense prompts, and postprocessing.The workflow aims to integrate the best model into a cohesive volumetric inference approach.

Data-sets and Data Processing
To demonstrate, exemplify, and evaluate our achievements, we make use of three distinct data-sets: A specific sub-volume of the Me 163 data-set of a Second World War fighter airplane [5], as well as two bulk material data-sets depicting entities of glass marbles and corn kernels [4].Figure 2 shows a photograph of each specimen, along with one typical slice from the reconstructed volume and a corresponding reference segmentation.
The Me 163 data-set utilized in this study consists of a volumetric subset and manually obtained reference segmentation XXL-CT data-set obtained from a historic airplane [6], which itself was extracted from an XXL-CT reconstruction.The reference segmentation sub-volumes of the Me 163 data-set were manually annotated and underwent morphological postprocessing to clean up the edges.The acquisition process involved addressing challenging aspects such as noisy data, low contrast, and limited spatial resolution.A detailed description of the data-set creation, including the annotation and postprocessing process, can be found in [5].
The data-set consists of eight sets of sub-volume pairs, each sub-volume having the spatial dimensions of 512×512×512 voxels.For training, six sub-volume pairs of the data-set are used, while one sub-volume pair is used for validation and one for testing, respectively.Each sub-volume pair consists of a reconstructed sub-volume (see Figure 2b) and its corresponding reference segmentation sub-volume (see Figure 2c).
The reconstruction sub-volume is a small volumetric region which is extracted from the reconstructed Me 163 XXL-CT data.To ensure compatibility with SAM, both the reconstruction or input sub-volumes and the corresponding reference segmentation subvolumes are extended with zero-padded 512 voxels in every direction.This results in an embedded version of the sub-volumes with working dimensions of 1536 × 1536 × 1536 voxels.This arrangement allows for the extraction of a slice, centred on any arbitrary voxel within the original sub-volume, with the resolution of 1024 × 1024 × 1 voxels, matching the native input dimensions required by SAM.
The first row of Figure 3 illustrates the described enframing process for the Me 163 data-set.The green rectangles in the first two columns indicate the unembedded region with 512×512×512 voxels and their manually annotated references.Due to the fact that the input sub-volumes of this data-set are located directly at the edge of the XXL-CT volume, it was not Me 163  possible to fill the border of the sub-volumes with actual reconstruction values.Instead, we decided to use a border with a constant value of zero in all directions.The last two columns of Figure 3 display the prepared input and reference slices used in the subsequent processing.
The other two data-sets, which consist of CT scans of jars filled with marbles and corn, each also contain two sub-volumes: one for the input CT reconstruction sub-volume and one for its reference segmentation sub-volume.The segmentation process to yield the reference volumes of the bulk material data-set involved semi-automatic segmentation using threshold binarization with a threshold obtained from Otsu's method [19], followed by a distance transform, watershed transform, and label-wise morphological closing, as described in more detail in [4].As this traditional computer-vision process resulted in some erroneous segmentations in the contact regions between the jar and the bulk material, we only used a correctly segmented sub-volume in the centre of the jar, having a spatial dimension of 256 × 256 × 256 voxels (denoted by the green rectangle in Figure 3).Also, the sub-volumes of the bulk material were enframed by a border of 512 voxels thickness with a constant value of zero.

Fine-Tuning on NDT Data-set
The SA-1B training data-set published by the authors of the SAM [10] contains predominantly coloured natural photographs, such as street scenes or still life compositions of semantically well-known objects from daily life.In contrast volumetric data-sets obtained from the NDT field, and particularly the slices extracted from the volumes, are frequently of a rather abstract nature and do not depict recognizable objects.Hence, these NDT images deviate quite much from the familiar photographic data-set used by SAM, and this deviation poses several challenges in achieving sufficient segmentation quality (see Section 3.1).This, within the CT imaging domain even familiar objects can be difficult to recognize for nonexperts, as they exhibit unusual structures or nonorthogonal sections due to the specimen's imaging geometry, or they may contain strong imaging and reconstruction artefacts.
Ma et al. [16] showcased a potential improvement in segmentation quality by fine-tuning SAM on the problem domain, which inspired us to adopt a similar fine-tuning approach.
In this study, we opted to perform fine-tuning on a certain part of the SAM, specifically the Mask Decoder.For this purpose, we utilized, extracted, and pre-processed slices from the ME 163 training dataset.Our approach adhered to the guidelines outlined in [16], which have previously been employed for finetuning on medical volume CT data-sets.
The Me 163 data-set was chosen due to its distinct level of complexity, setting it apart from the bulk material data-sets also being investigated.In contrast, the marble and corn data-sets can be segmented relatively easily using conventional image processing techniques.
For the fine-tuning process we randomly selected voxel positions from the Me 163 training data-set.If the chosen voxel was a foreground voxel belonging to an known labelled entity, three orthogonal slices centred around its position were extracted.These slices were used as training examples, with the data range of the input slice normalized to [0.0, 255.0].For the target slice all voxels of the entities belonging to the centre voxel were one-hot encoded.
SAM operates on images, while our attempted input is a single slice from a volumetric data-set.To ensure that a three-dimensional connected object was represented by a single segment in the twodimensional slices, a connected component analysis (CCA) was performed on the one-hot encoded target slice.Only the segment connected to the centre of the target slice was used as the target for training (see Figure 4).The surrounding image does not provide sufficient information to distinguish if neighbouring non-touching segments are belonging to the same segment.Thus we performed a CCA and treat the parts of segments not connected in the current slice as separate segments.
If the voxel at the centre of a slice represented the background, we generated three orthogonal background examples, each containing an normalized input slice and an target slice.We evaluated three versions: ForegroundOnly, which included only  foreground input slices; ConstantValueBackground, where we provided both background and foreground input and target slices for training, but expected SAM to produce a completely empty response for background slices; and ConnectedComponentBackground, where we identified all background voxels connected to the centre voxel of the slice as the target segment.This was achieved through CCA on the data-set's background, formed by also enframing the reference segmentation with a zero-padded boundary.Consequently, the network was prompted to consider all voxels connected to the air space in the slice's centre as part of that segment.Figure 5d provides an illustrative example of the different target versions.
Due to the significantly lower count of foreground voxels (0.1-9.4%) compared to background voxels in the Me 163 data-set, we included all foreground examples while randomly selecting a subset of background examples of the same size.This approach ensured a balanced representation of both classes.To prevent batches from containing closely located examples, the selected examples were shuffled and grouped into batches, with each batch containing sixteen foreground examples and 16 background examples.Additionally, to further diversify the examples within each batch, we employed a relatively large stride during the example extraction process.This ensured that the examples originated from dif-  We chose a single point prompt in the exact centre of each slice as the input for SAM during training.This choice aligns with the input for our validation application as well as the tile-based SAM integration for volume data-sets (see Section 2.3).
The batch size was set to 64.We initiated the training with an learning rate of 8e −4 , which was linearly increased over the first 250 iterations.For optimization, we utilized the AdamW optimizer [15] with β 1 = 0.9 and β 2 = 0.999, along with a weight decay of 0.1.Our loss function consisted of a combination of Dice Loss (sigmoid=true, squared-pred=true, and mean reduction) and binary cross-entropy loss (mean reduction).We let the training run until overfitting for 10 to 25 days.We selected the model with the lowest validation loss, determined at moving window intervals of 128 iterations.

Inference Workflow for Volumetric Data-sets
Since SAM works only on RGB image data-sets but we wanted to segment volumetric data-sets, we had to incorporate a adequate workflow to translate between these two spatial domains.Since our goal was to evaluate SAM for volumentric data-sets and not necessarily to implement a complete new volumetric version, we refered to simple operators.Figure 6 shows an overview of the approximate workflow for a volumetric data inference of SAM.In short, we extract a subvolume tile from the input volume and pass it to the volumetric SAM adaption, which transforms it into three orthogonal slice stacks.
For each slice stack, we perform slice preparations (such as normalization and zero-padding), a forward pass through SAM, selection of the corresponding outputs, and slice postprocessing.The output slice stacks are then merged and undergo further volumetric postprocessing to generate segmentation proposals which are returned from the volumetric SAM adaption into the inference algorithm.The evaluated algorithms are listed and compared in Table 1

Adapting SAM for Volumetric Datasets
Adapting SAM, which was originally designed for segmenting image data-sets, to our volumetric CT data-sets required certain modifications and the implementation of appropriate postprocessing steps.In this section, we explore various possibilities for this transition and subsequently outline the approach we finally selected.Several 2D to 3D techniques can be utilized to facilitate this transformation [23].For example in [22] a Volumetric Fusion Net (VFN) was employed to merge multiple 2D segmentation predictions into a comprehensive 3D prediction volume.In a related work, [25] adopted a similar methodology for pancreas segmentation, albeit utilizing a different VFN.
According to [23] other approaches involve incorporating neighbouring 2D slices as additional channel information or utilizing specialized topologies to extract and merge features in both the 2D and 3D do-mains.However, the effectiveness of these methods for improving segmentation results heavily depends on the specific data-sets at hand.
Due to reports on the segmentation performance of SAM on volumetric medical data sets, such as those in [8], and our own preliminary experiments, which suggested that the segmentation quality of SAM was likely to be mixed, we opted for a simple majority voting approach to merge the 2D predictions into 3D volumes.
During the slice merging process, we experimented with different rules to determine when to terminate the slice-wise merging.We either combined all slice within the current field of view regardless of their content or stopped at the first empty slice, i.e., a slice without foreground voxels.We also tested various rules based on different thresholds of overlap or Intersection over Union (IoU) between the proposed segmentation of the current slice and the preceding slice or a foreground volume obtained through global Otsu thresholding followed by a morphological closing step.
As an optimization strategy, the slice-wise prediction was performed in an alternating manner, starting from the centre of the current sub-volume and moving outward slice-wise in both directions.This approach was implemented to save computational time and prevent the segmentation of unconnected segments, ensuring that only cohesive regions were accurately identified.
In situations where the segmentation results in an identification of unconnected segments, the algorithm may inadvertently continue segmenting entire regions composed of non-cohesive segments.This phenomenon occurs when the segmentation quality is significantly compromised.During the subsequent hyperparameter search, we also permitted segmentations without applying these rules.However, it appears that these deviations have only minimal impact on the output quality.
Subsequently, a new target volume is constructed.Voxels are included in the output volume if they are segmented as foreground in at least one, and depending on the configuration, up to three slice-wise predictions.
Additionally, we employed postprocessing tech- Table 1: Overview of algorithm choices and options for different stages of the volumetric SAM adaption seen in Figure 6.
niques such as slice-wise and volume-based median filtering and CCA prior and after merging the slices into volumes to smooth scattered and miss segmented voxels.
We also conducted experiments with different variants of SAM's outputs.Since SAM has the ability to generate multiple outputs per prompt, such as e.g.separating a backpack from a person wearing it, we investigated whether selecting any of these outputs could improve the segmentation quality.Specifically, we examined whether it is better for volumetric segmentation to use the segmentation proposal provided by SAM with the highest probable IoU or the one with the maximum IoU of the approximated foreground volume.Additionally, as SAM often tends to under-segment and include background or neighboring segments as part of the foreground, we investigated whether selecting the output with the smallest count of voxels among the multiple outputs would improve the segmentation quality.
In this context, experiments were conducted using both, the binarized output of SAM and the raw probability values, which are available at a lower resolution than the binary mask.After upscaling, different threshold values can be applied to the probability outputs for further processing and experimentation.

Tile based segmentation for data-sets of arbitrary size
Due to SAM's image-based nature, we encounter segmentation challenges when dealing with topologically complex objects depicted by volumentric CT NDT data-sets.These volumes may contain holes or inclusions, complex folds, are spatially sparse, or may extend beyond the boundaries of the currently processed tile.Figure 7 displays several schematic examples of different complexity.In volumetric datasets, such complex segments are easier to understand, but when segmenting them slice by slice there is a risk of mistakenly delineating them as multiple segments.This effect also occurs when the tile is smaller than the entity's size.
To overcome these challenges, we utilize the volume-based SAM inference (see Section 2.3.1)within the FFN framework (see Section 1.2).The Figure 7: Schematic views of multiple simple volumetric objects (bolt, U-profile, pipe, and spiral spring) and cross-sectional slices along their central axes in three orthogonal directions marked by three respective colours ( , , ).The disjunction of simple objects into multiple components if processed slice-wise poses a challenge, as there are no straightforward rules for merging them without a step-bystep traversal of the object.inference process starts with a single seed point and is applied to a small sub-volume tile.The resulting segmentation proposal is then stored in an result buffer, the accumulator volume.If a segment intersects the outer boundaries of a tile, the intersection position is added to a queue.In subsequent iterations, corresponding slightly shifted tiles aimed at theses intersection points are processed by the volume-based SAM inference.This iterative process generates segmentation proposals which are incorporated into the accumulator.This process repeats until the intersection points queue is empty and the segmentation proposal in the accumulator is no longer constrained by the boundary of the processed tiles.
As an optimization step, the proposed additional intersection positions are filtered based on the approximated foreground volume.They are added to the intersection points queue only if the corresponding voxels have a high probability to be foreground voxels.
The proposed combination of SAM and FFN allows to compute segments and input volumes of arbitrary size by combining multiple overlapping tiles using an temporary accumulator temporary volume.Never-theless this approach also increases the runtime due to the recomputation of the overlapping tiles.
The choice of using 48 voxels per tile side was made heuristically based on the original FFN algorithm, which also uses this tile size.However, the algorithm can be adjusted by changing the tile size up to 1024 voxels in each dimension, the maximum dimension SAM can handle without resizing the input.When the tile size is below this threshold, no resizing of tiles is required as we add a constant value border around the tile.Additionally, the step width between tiles and the overlap of the tiles can be adjusted to mitigate artefacts caused by the tile-based algorithm.Tile-based algorithms are capable of assembling entities with complex topologies.These algorithms can follow or trace the segment itself over multiple tiles and steps, even if it forms highly complex shapes.But tile-based algorithms may introduce additional artefacts.The segmentation result of the combined algorithms is heavily dependent on the performance of the SAM segmentation.

Prompt Selection and Accumulator Integration
As mentioned above SAM allows queries using various prompts such as point prompts (seed points, bounding boxes) and dense prompts (masks, brushes).Multiple studies [16,17] have shown that, depending on the input data, higher segmentation quality can be achieved by using multiple prompts, such as point prompts distributed evenly over the segment region or negative point prompts, which are not considered part of the segment.Additionally, the use of rectangular prompts consisting of two anchor points often leads to adequate segmentation results.Given that the main objective of this study is to evaluate the applicability of SAM in the automated NDT domain, we have opted to solely assess single point prompts and dense prompts, as they can be easily automated.
We placed a single point prompt at the exact centre of the tile.The centre point of a tile was either chosen by a seed point or deemed highly likely to belong to the current segment, due to the iterative processing of the tiles.
For dense prompts, we utilized the SAM output stored in the accumulator, which was shifted by the relative position of the current point prompt.This requires SAM to complete the segmentation proposal at the edge of the current tile.Since our tile step size was [1,20] voxels, the overlap between the tiles and the dense prompt with the expected segmentation proposal was high, allowing SAM to only predict a relative slim border of new voxels.Figure 8 illustrates an idealized schematic of such an operation.In the case of dense prompts, we also include a corresponding point prompt at the centre of the tile as more prompts tend to increase segmentation performance [16].

Evaluation of SAM Segmentation
Quality in NDT Slice Data-sets In an initial test of SAM's segmentation quality for CT NDT data, we applied SAM to segment individual slices from NDT volumetric data-sets.For each of the three data-sets introduced in Section 2.1 randomly selected slices were selected and segmented, which accounted for approximately 0.5% of all available validation data-sets.Each example underwent the preparation steps outlined in Section 2.2 before being processed by SAM.SAM then tried to segment the entity located at the exact centre of each slice using point prompts.Examples of typical segments can be seen in Figure 9. Notably, SAM demonstrated good segmentation performance for the marbles and corn kernels data-sets, while the segmentation quality was significantly inferior for the individual segments of the Me 163 data-set.Table 2 provides a summary of the statistics obtained from the conducted experiments, categorized by data-set and the model used.Figures 13a, 13b, and 13c demonstrate the segmentation dynamics of the individual models on the different data-sets: These plots represent the loss of the segmentation proposals generated by SAM for the entities at the centre of each layer of the corresponding validation data-set.The loss values are determined with respect to the reference data-set.From left Figure 8: Schematic view of two subsequent inference steps, denoted as n (represented by ) and n + 1 (represented by ), which use the modified accumulator volume from the previous step to create a dense SAM prompt.In step n, the content of the accumulator volume of the previous step n − 1 is used to generate a dense SAM prompt n+ 1.This prompt, along with the point prompt n and the extracted input volume tile n, is used by SAM to compute prediction n.Subsequently, the accumulator volume is updated to the state n based on this prediction.In the subsequent step n + 1, the accumulator volume n is used to determine the movement n + 1 to the tile n + 1. Tile n + 1 significantly overlaps with tile n.SAM is parametrized with the extracted input volume tile n + 1, point prompt n + 1, and dense prompt n + 1 to compute prediction n + 1.Which is used to update the accumulator volume n + 1.     to right the loss values are sorted in ascending order, so that the nearly correctly segmented segments are on the left side of the graph, while the difficult and often incorrectly segmented segments are on the right side.The seed points of the segments were chosen in such a way that each of them corresponds to an foreground voxel, so the networks are not tasked with segmenting the background.The different colours in the plots correspond to different networks.It can be observed that the unchanged SAM networks perform very well in segmenting the marble and corn data-sets.The few entities which exhibit lower segmentation quality in these data-sets, and are located on the right edge, are often due to insufficient quality in the reference segmentation data-set, as illustrated in Figures 10 and 11.A slightly lower segmentation quality can be observed for the corn data-set, which consists of a higher count of entities that are also not as homogeneous in colour compared to the marble data-set.Figure 13c demonstrates that the segmentation quality for the Me163 data-set is notably lower compared to the previously mentioned data-sets.Figure 12 displays some typical error patterns in the original trained SAM images.Both under-segmentation and over-segmentation occur, and segments are sometimes partially or not recognized at all.Among the different not fine-tuned SAM models, the smallest model vit b showed the most promising results.While it was sometimes outperformed by the other two original SAM models, vit l and vit h, in the well-segmented slices, it still had a higher segmentation quality in the moderately segmented slices.Therefore, we decided to use vit b as the base model for fine-tuning and volumetric segmentation experiments.

Marbles
Among the subsequently trained networks, vit b CVB exhibits the highest quality in Figure 13c.It is based on vit b and uses ConstantValueBack-  ground (CVB) (see Section 2.2) for background examples.In simple cases, it matches the segmentation quality of non fine-tuned SAM variants.A considerable improvement in segmentation quality on the challenging entities could be achieved through training, although not to a satisfactory level.This model was chosen as the representative of our fine-tuned model for further tests on our data.

Tile-Based Algorithms and Artefact Mitigation
Figures 14 and 15 showcase the segmentation results of a volumetric inference run using the proposed SAM algorithm on a small subset of the marble and corn data-sets for the two tile sizes 48 × 48 × 48 voxels and 1024 × 1024 × 1024 voxels.These results exhibit segmentation errors in the form of erroneous segmented edges as well as tiling artefacts, resulting in a textured appearance of the segment with noticeable gaps.
Notably for a tile size of 48 × 48 × 48 voxels, the marble example in Figure 15b   artefacts.Since the volumetric inference algorithm with the small tile size cannot segment the entire marble in a single step, it must combine multiple steps, which can introduce and propagate errors.These artefacts can be cleaned up using a morphological closing operation as postprocessing step.
In contrast, segmentations using a larger tile size of 1024×1024×1024 voxels exhibit fewer of these textured artefacts.However, segmentations may extend beyond the actual segment due to segmentation errors, as illustrated in Figure 14c, where thin segments protrude vertically and horizontally beyond the intended boundaries.These protrusions often occur within the initially segmented slices that include the seed point of the current segment.In the green upper right marble of the example in Figure 15c, the adjacent slices directly connected to the seed point were misclassified as not belonging to the marble, resulting in an early termination of the slice-wise segmentation process.
The inference algorithm with a tile size of 1024 × 1024 × 1024 voxels can only attempt to segment the segment once as due to its high field of view it performs a single volumetric step per seed point.In contrast, the inference algorithm with a tile size of 48×48×48 voxels iterates over the volume in multiple steps, providing the ability to compensate for weak and erroneous segmentations in subsequent steps.However, this approach tends to under-segment when a neighbouring segment has already been partially segmented in a previous step.
Figure 16 shows the correlation matrices for the result of four inference runs on the Me 163 testing data-sets.Two of the inference runs were performed using the default SAM model vit b, while the other two were performed using the fine-tuned model vit b CVB .Two of the four experiments used a tile size of 48 × 48 × 48 voxels, and the other two used a tile size of 1024 × 1024 × 1024 voxels.Each experiment was fine-tuned on the validation data-set using       15d), and the proposed segmentations generated by the proposed algorithm using the two tile sizes: 48 × 48 × 48 voxels (Figure 15b) and 1024 × 1024 × 1024 voxels (Figure 15c).Additionally, the postprocessed volumes are depicted in Figures 15e and 15f.The individual parameters of the four inference runs can be found in Table 3. Figure 17 displays correlation matrices from Figure 16 but constrained to the detected segments with the highest IoU.
As can be seen, the vit b CVB models tends to generate more noise outside the main diagonal.Especially Figure 17d depicts many over-and undersegmented segments.This can also be observed in the corresponding segmentation volume slice shown in Figure 18e and 18j.The correlation matrix of the fine-tuned vit b CVB model with tile size 48 × 48 × 48 voxels in Figure 17c seems to perform best with respect to diagonal segments.But comparing the corresponding segmentation volume slice in Figure 18h shows, that this model, tile, and parameter combination tends to miss most of the foreground segments.It seems that the default vit b model with tile size 1024 × 1024 × 1024 voxels produces the visually best results, followed by the fine-tuned vit b CVB model with tile size 48 × 48 × 48 shown in Figure 18c.
Figure 19 presents multiple renderings of the seven largest reference segments in the Me 163 testing dataset, along with predictions generated by different SAM snapshots using the volumetric algorithm and fine-tuned parameters.The true positive voxels are coloured green, while the eference segments are coloured blue and the false positive voxels are coloured orange.It is evident that the volumetric segmentation of the data-sets using tiles of size 1024 × 1024 × 1024 voxels yields visually more appealing segments compared to using a tile size of 48 × 48 × 48 voxels.
The predicted segmentation using the tile size of 48×48×48 voxels often appears empty, as only a small count of voxels have been segmented correctly.This is because the segmentation quality of the algorithm is to poor to generate connected tiles, and so often only a limited amount of steps (see Section 2.3.2) will be iterated for each segment.The segments are interrupted and only found in pieces.However, using a tile size of 48 × 48 × 48 voxels also often leads to under-segmentation, as demonstrated in Figure 20.
Here, three adjacent segments were mistakenly connected by a single predicted segment.
But even the segmentation with a tile size of 1024× 1024 × 1024 voxels is often insufficient, as both largescale under-segmentations and over-segmentations occur, as can be seen from the correlation matrices in Figure 17 and the cross-sectional images in Figure 18j.

Discussion
The transferability of the SAM model to instance segmentation of volumetric XXL-CT data-sets requires careful consideration.The presented results indicate that its two-dimensional image-based segmentation quality is insufficient for this specific problem domain.This limitation becomes particularly evident when dealing with the concatenation of numerous intertwined cross-sectional images in the volumetric case.The low contrast and high noise in these images pose challenges in accurately delineating individual segments.Additionally, using domain specific finetuning and improving slice-wise predictions did not yield substantial improvements for volumetric predictions.
One potential source of error in the presented method might be the limited computational resources allocated for both fine-tuning and subsequent hyperparameter search.A more thorough optimization process could potentially improve the results.Furthermore, the availability of    included in SAM.Specifically, the absence of neighbouring voxels when adding the 512 voxel wide border around the data-set for the Me 163 data-set may have possibly contributed to a decrease in segmentation quality.
Additionally, considering improved algorithms for merging the slice-wise predictions could be an initial step in the further development process.Previous studies [22,25,23] have demonstrated ample opportunities for the development of more sophisticated algorithms in this area.Implementing and embedding such algorithms into the processing pipeline has the potential to significantly enhance the segmentation quality.

Conclusion
The primary objective of this study was the exploration and possible applicability of the SAM algorithm for general image delineation to instance segmentation in XXL-CT volumetric data-sets.
In conclusion, our study highlights the potential of SAM for instance segmentation in XXL-CT volumetric data-sets, while acknowledging that there is still significant room for improvement.Furthermore, our research has contributed in the following areas: (1) the evaluation of SAM on volumetric NDT data-sets, (2) the exploration of various methods for integrating image-based SAM with volumetric data-sets, (3) the introduction of a tile-based approach for segmenting objects of arbitrary size, and (4) the utilization of dense prompts for tile combination using an accumulator.These contributions provide insights and establish a foundation for further advancements in this field.

Figure 3 :
Figure 3: Zero-padding preparation steps were performed on the input and reference slices of the different data-sets to create slices of size 1024×1024 pixels centred around each possible seed point.The white border regions in the available input and reference slices were filled with constant values of zero.

Figure 4 :
Figure 4: Processing of an example foreground slice used for fine tuning SAM.Consisting of reconstruction slice (Figure4a), reference slice (Figure4b), onehot encoded (Figure4c), and connected component training target (Figure4d).The green cross marks the centre of the slice.

Figure 5 :
Figure 5: Processing of an example background slice used for fine tuning SAM.The green cross marks the centre of the slice, which is located in the background of the reconstruction.The green border around the reconstruction slice in Figure 5a depicts the original volume size, which then was enframed with an constant value border.The other sub-figures show the tested possibilities for target slices for the finetuning: ForegroundOnly (Figure 5b), ConstantValue-Background (Figure 5c), and ConnectedComponent-Background (Figure5d).

Figure 9 :
Figure 9: Segmented examples of the corn and marbles data-set.The green crosses mark the position of the currently used point prompt.The last column depicts the result of the vit b model which was fine-tuned on the Me 163 data-set.

Figure 10 :
Figure 10: Error cases for the marble data-set.Here the reference segmentation which generated by an connected component analysis is erroneous.In Figure 10b the point prompt (marked with an green cross) lies on the boundary of two marbles and vit b segments the upper marble instead of the lower marble.While in Figure 10f the point prompt lies inside an artefact region.

Figure 11 :
Figure 11: Error cases of the corn data-set.In the first case in Figure 11b two kernels have been erroneously segmented together in the reference segmentation.In contrast in Figure 11f the reference segmentation only appears erroneous as the current slice only depicts one voxel.The next slice in the input volume contains the kernel this voxel belongs to.The green crosses mark the position of the currently used point prompt.

Figure 12 :
Figure 12: Poorly performing cases for SAM vit b segmenting thin metal sheets in the Me 163 data-set, as well as the better but still not optimal segmentation results achieved by the model fine-tuned on the Me 163 data-set.

Figure 13 :
Figure13: Graphs depicting the slice segmentation performance of the six evaluated SAM models on the three different testing data-sets.From left to right the index of each segmented slice sorted by their loss value.In an ideal case only an horizontal line close to the loss value of 0 would be visible.
[1].The correlation matrices show the IoU of each reference segment in relation to each detected segment.The reference segments are sorted from top to bottom based on their voxel count, with the segment having the largest voxel count at the top.Similarly,

Figure 14 :
Figure 14: Slices from a volumetric inference run on three corn kernels of the corn data-set.The input volume (Figure 14a), reference volume (Figure14d), and the proposed segmentations generated by the proposed algorithm using the two tile sizes: 48 × 48 × 48 voxels (Figure14b) and 1024 × 1024 × 1024 voxels (Figure14c).Additionally, the postprocessed volumes are depicted in Figures14e and 14f.
Figure 14: Slices from a volumetric inference run on three corn kernels of the corn data-set.The input volume (Figure 14a), reference volume (Figure14d), and the proposed segmentations generated by the proposed algorithm using the two tile sizes: 48 × 48 × 48 voxels (Figure14b) and 1024 × 1024 × 1024 voxels (Figure14c).Additionally, the postprocessed volumes are depicted in Figures14e and 14f.

Figure 15 :
Figure 15: Slices from a volumetric inference run on three marbles of the marbles data-set.The input volume (Figure 15a), reference volume (Figure15d), and the proposed segmentations generated by the proposed algorithm using the two tile sizes: 48 × 48 × 48 voxels (Figure15b) and 1024 × 1024 × 1024 voxels (Figure15c).Additionally, the postprocessed volumes are depicted in Figures15e and 15f.

Figure 16 :
Figure 16: Correlation matrix of default and finetuned volumetric SAM with multiple tiles of size 48 × 48 × 48 voxels or a single tile of size 1024 × 1024 × 1024 voxels of the Me 163 testing data-set.

Figure 17 :
Figure 17: Correlation matrix of default and find-tuned volumetric SAM with multiple tiles of size 48 × 48 × 48 voxels or a single tile of size 1024 × 1024 × 1024 voxels of the Me 163 testing data-set.Detected segments have been limited to the best matches for each reference segment.

Figure 18 :Figure 19 : 20 :
Figure 18: Exemplary slices of the proposed volumetric inference output performed by default and fine-tuned SAM models on the Me 163 reference data-set (Figures18a and 18f).For the remaining figures the top row (Figures18b -18e) shows all segments depicted in Figure16while the bottom row (Figures18g -18j) only shows the segments corresponding to the main diagonal in Figure17.

Figure 6 :
Schematic workflow of the volumetric data inference segmentation using SAM.Algorithm options and steps for the configurable stages (grey boxes ( )) are listed in Table1.values in each slice to the minimum and maximum range of the slice.Outlier and Empty Slice Detection Identification and handling of outlier and empty slices.RGB ConversionConversion of grey values to RGB colour in order to comply with SAM interface requirements.EnframingAdds a zero-padded border to each slice to centre the seed point to comply with SAM interface requirements.Estimated Foreground Volume Utilizes different binarization strategies and thresholds to estimate the foreground volume.

Table 2 :
Mean loss value (and standard deviation) over all slice-wise predictions on the validation data-sets by multiple models for the graphs in Figure13 demonstrates tiling

Table 3 :
Parameters optimized on the Me 163 validation data-set for the default vit b and fine-tuned vit b CVB SAM model for the tile sizes of 48 × 48 × 48 voxels and 1024 × 1024 × 1024 voxels.(FG = foreground; -= not applicable; Options marked with * indicate volumetric SAM parameters seen inTable 1; Options marked with × indicate FFN related parameters)