Improved Computer Vision Framework for Mesoscale Simulation of Xiyu Conglomerate Using the Discrete Element Method

: The complex mechanical characteristics of the Xiyu conglomerate signiﬁcantly inﬂuence the resistance and deformation features of its caverns’ surrounding rock, thereby constraining the construction of related water diversion tunnels. This paper introduces an improved SegFormer framework developed for the detection of mesoscale geomaterial structures. Computerized tomography (CT) scan images of the Xiyu conglomerate were employed to establish a high-precision numerical model. From the results of segmentation, the proposed algorithm outperformed UNet, HRNet, and the original SegFormer neural network. The segmentation results were used to calculate the porosity, and biaxial compression numerical simulation experiments based on the real structure were carried out using the particle ﬂow code (PFC). We observed the failure process of the model and obtained the shear strength of the Xiyu conglomerate. We explored the causes and inﬂuencing factors of the anisotropy of the Xiyu conglomerate from the microstructure perspective and provide a micro-observation basis for establishing an anisotropic mechanical model.


Introduction
The Xiyu Formation, one of the most important and widely cited late Cenozoic strata in western China, refers to a set of dark gray or deep gray, massively thick, blocky conglomerate deposits that are widely distributed in the foothill zones of the Tianshan and Kunlun Mountains [1].The thickness of the Xiyu conglomerate varies from tens of meters to several kilometers, with an average thickness of about 2000 m [2].Due to its complex material composition and special structure, the Xiyu conglomerate has characteristics such as poor cementation, low mechanical strength, and softening when encountering water [3].The resistance and deformation characteristics of the surrounding rock in the karst caves have become one of the key issues restricting the scientific design of related water diversion tunnel projects in this area [4].In addition, rock slopes formed by the Xiyu conglomerate are prone to erosion at the slope toe, posing severe challenges to water conservancy and hydropower engineering construction in this area [5].
Currently, the existing failure models inadequately explain the mechanical properties of the Xiyu conglomerate, restricting the effectiveness of conventional evaluation methods in analyzing the stability of the Xiyu conglomerate's slopes and tunnels.Qin [6] used a finite difference method to conduct numerical simulation experiments on the softening failure mechanism of the high slopes and slope toes of the Xiyu conglomerate and found that there was a significant difference between the simulation results and the observed slope failure phenomena.Wang et al. [7] analyzed the elastic resistance coefficient of the tunnels' surrounding rock using an inversion method and calculated that there was a gap between the elastic resistance coefficient of the surrounding rock around the tunnel and the actual deformation resistance characteristics of the surrounding rock.Li et al. [8] studied the softening failure mechanism of the high slope toes of the Xiyu conglomerate using a range analysis method and analyzed the influence of the slope angle and other sensitivity factors on the stability of the high slopes, but this method cannot fully consider the influence of other factors, such as the mechanical properties of the Xiyu conglomerate, on stability.
In the field of characterizing the microstructure of heterogeneous material, with the widespread application and development of computer science and digital images, the coupling method of digital image processing and numerical simulation makes it possible to analyze the micromechanical behavior.This method is of great significance for studying the mechanical properties of heterogeneous materials [9].Yang et al. [10] reconstructed non-homogeneous models, such as a silicate cement and asphalt mixture, based on real images and analyzed the advantages and disadvantages of image-based models and computer-generated models.Ali Abdulqader et al. [11] developed the StereoDIC system for detecting internal defects.This system was successfully applied in uniaxial compression tests of rock cores, suggesting its potential use in estimating material properties.Woodman et al. [12] utilized a coupled discrete element method (DEM) and grain-based modeling (GBM) simulations to accurately represent the microstructure of laboratory specimens.This approach allowed them to study the effects of thermo-mechanical loading on finegrained sandstone in a detailed and precise manner.Yu et al. [13] successfully estimated the porosity of sedimentary rocks by using digital image analysis methods and pore and crack analysis systems (PCAS).Qian et al. [14] modified an automatic deep learningbased segmentation algorithm for scanning electron microscopy (SEM) images of cement paste.This enhanced approach has enabled the precise examination of the irregularity and roundness of unhydrated cement particles, as dictated by the segmentation results.
With the advancement of numerical simulation theory, the micromechanical properties of rocks can be studied using this technology [15,16].Numerical simulations encompass three types of methods: continuous, discontinuous, and hybrid continuumdiscontinuum [17].Discontinuous methods, such as the DEM, are often favored in numerical simulations of heterogeneous materials due to their ability to explicitly consider microstructures and simulate complex fracture behaviors [18][19][20].Based on the contact type, DEMs can be divided into two types.The first type is represented by the UDEC (Universal Distinct Element Code).It handles various contact types, including face-face, face-vertex, and vertex-vertex, necessitating complex algorithms [21,22].Another type is the PFC, where the contact is simpler and only one contact appears between two practices [18,23,24].Notably, the PFC offers higher computational efficiency, which is why it was chosen for this study; however, it is computationally slower due to complex contact detection, and it has certain limitations in dealing with continuum analysis [25].
In this study, a computer vision framework was modified to establish the DEM model with real geomaterial structures, and a biaxial compression numerical simulation experiment was conducted by the spherical DEM software PFC to investigate the mechanical properties of the Xiyu conglomerate.This study successfully demonstrates that the approach can effectively segment the CT images of the rock and precisely simulate its mechanical behavior under various loading conditions.The deep learning-based segmentation model employed in this research outperforms the previous edge detection techniques in segmenting the mesoscale structures of geomaterials.This advancement enables more accurate mesoscale simulation and facilitates further investigation of the anisotropy of heterogeneous materials based on this study.

Method
The collective Xiyu conglomerate was subjected to CT scanning.After three-dimensional reconstruction of the obtained CT images, slicing was performed.Subsequently, different components within the rock were labeled.A dataset for training and validation of the deep learning Appl.Sci.2023, 13, 13000 3 of 19 model was generated.Afterwards, instance segmentation of other regions was performed using the trained model.The porosity of the rock was calculated, and the boundaries of each component were segmented.A high-precision numerical simulation model was generated using the segmented images.Two-dimensional biaxial compression tests were conducted.This section will provide a detailed discussion of these steps.The samples utilized in this study were extracted from the Xinjiang region, specifically, from the Xiyu conglomerate rocks, with sample dimensions of 50 mm × 100 mm.Due to the limited data and unfavorable training conditions for the neural network with large images, image cropping became necessary.To preserve the original aspect ratio, the initial images underwent interpolation to a size of 4096 × 3584 pixels.Subsequently, non-overlapping cropping was applied, reducing the size to 512 × 512 pixels, with a cropping stride of 512 × 512 pixels.This enabled the creation of 56 (8 × 7) cropped images from each original image.
The gravel, pore, and matrix of the rocks were identified and labeled as regions of interest (ROI) using LabelMe, a graphical tool for manual image annotation [26].These regions, depicted in Figure 1, are also referred to as instances.In this study, the ROIs were manually labeled, a common approach in supervised learning.Manual annotation, while highly accurate, has the drawback of being time-consuming and labor-intensive.The distinct colors in Figure 1 denote different ROI instances.

Method
The collective Xiyu conglomerate was subjected to CT scanning.After three-dimensional reconstruction of the obtained CT images, slicing was performed.Subsequently, different components within the rock were labeled.A dataset for training and validation of the deep learning model was generated.Afterwards, instance segmentation of other regions was performed using the trained model.The porosity of the rock was calculated, and the boundaries of each component were segmented.A high-precision numerical simulation model was generated using the segmented images.Two-dimensional biaxial compression tests were conducted.This section will provide a detailed discussion of these steps.

Data Preparation
The samples utilized in this study were extracted from the Xinjiang region, specifically, from the Xiyu conglomerate rocks, with sample dimensions of 50 mm × 100 mm.Due to the limited data and unfavorable training conditions for the neural network with large images, image cropping became necessary.To preserve the original aspect ratio, the initial images underwent interpolation to a size of 4096 × 3584 pixels.Subsequently, nonoverlapping cropping was applied, reducing the size to 512 × 512 pixels, with a cropping stride of 512 × 512 pixels.This enabled the creation of 56 (8 × 7) cropped images from each original image.
The gravel, pore, and matrix of the rocks were identified and labeled as regions of interest (ROI) using LabelMe, a graphical tool for manual image annotation [26].These regions, depicted in Figure 1, are also referred to as instances.In this study, the ROIs were manually labeled, a common approach in supervised learning.Manual annotation, while highly accurate, has the drawback of being time-consuming and labor-intensive.The distinct colors in Figure 1

Constructing Networks
The network construction framework for this study is illustrated in Figure 2. Semantic segmentation, a highly popular and crucial aspect of deep learning in computer vision, involves labeling each pixel in an image with its corresponding semantic class, achieving

Constructing Networks
The network construction framework for this study is illustrated in Figure 2. Semantic segmentation, a highly popular and crucial aspect of deep learning in computer vision, involves labeling each pixel in an image with its corresponding semantic class, achieving fine-grained segmentation and comprehension of the image.Standard datasets like COCO (common objects in context) and Cityscapes are extensively employed for semantic segmentation tasks [27,28].These datasets furnish numerous images with pixel-level labels that are utilized for training and evaluating the performance of semantic segmentation models.Many deep learning algorithms are designed to address semantic segmentation tasks akin to the aforementioned standard datasets.However, everyday photos exhibit notable differences from CT images: 1.
CT images encapsulate high-resolution internal structural information, enabling applications in rock classification and mineral composition analysis.They excel in capturing subtle structures and details at the microscale, contributing to their larger dimensions.2.
Owing to specific domain constraints, CT images typically possess a smaller overall data volume compared to extensive existing datasets.Moreover, data diversity may be relatively low, and the distinctions between labels might not be as apparent.
Traditional algorithms may struggle to predict effectively in this context.

3.
In the research process, it is essential to output the porosity of specified regions based on CT images.This necessitates achieving more comprehensive segmentation on the boundaries of these regions and the capability to model long distances.
fine-grained segmentation and comprehension of the image.Standard datasets like COCO (common objects in context) and Cityscapes are extensively employed for semantic segmentation tasks [27,28].These datasets furnish numerous images with pixel-level labels that are utilized for training and evaluating the performance of semantic segmentation models.Many deep learning algorithms are designed to address semantic segmentation tasks akin to the aforementioned standard datasets.However, everyday photos exhibit notable differences from CT images: 1. CT images encapsulate high-resolution internal structural information, enabling applications in rock classification and mineral composition analysis.They excel in capturing subtle structures and details at the microscale, contributing to their larger dimensions.2. Owing to specific domain constraints, CT images typically possess a smaller overall data volume compared to extensive existing datasets.Moreover, data diversity may be relatively low, and the distinctions between labels might not be as apparent.Traditional algorithms may struggle to predict effectively in this context.3.In the research process, it is essential to output the porosity of specified regions based on CT images.This necessitates achieving more comprehensive segmentation on the boundaries of these regions and the capability to model long distances.Considering points 2 and 3, along with additional considerations related to image texture, the Transformer architecture was selected.It is crucial to account for the impact of features at different scales and determine how to amalgamate these diverse features.The decoder of SegFormer incorporates this functionality.The fundamental concept of this algorithm involves integrating the self-attention mechanism of Transformer into the semantic segmentation task, empowering the model to establish relationships between different pixels and comprehend the global and local structure of the image [29,30].In contrast to traditional convolutional neural network (CNN)-based segmentation methods, SegFormer harnesses Transformer's ability to capture long-range dependencies between pixels, proving highly effective in addressing intricate segmentation challenges.The Considering points 2 and 3, along with additional considerations related to image texture, the Transformer architecture was selected.It is crucial to account for the impact of features at different scales and determine how to amalgamate these diverse features.The decoder of SegFormer incorporates this functionality.The fundamental concept of this algorithm involves integrating the self-attention mechanism of Transformer into the semantic segmentation task, empowering the model to establish relationships between different pixels and comprehend the global and local structure of the image [29,30].In contrast to traditional convolutional neural network (CNN)-based segmentation methods, SegFormer harnesses Transformer's ability to capture long-range dependencies between pixels, proving highly effective in addressing intricate segmentation challenges.The network architecture of SegFormer, comprising the hierarchical Transformer encoder and lightweight ALL-MLP decoder modules, is shown in Figure 3.

•
The hierarchical Transformer encoder functions as the core network in the SegFormer model.It is tasked with extracting features from the input image and transmitting these features to the segmentation decoder.Its purpose is to capture features at various scales within the image and integrate global contextual information, thereby improving segmentation performance.

•
The objective of this module is to generate CNN-like multi-level features when given an input image.These features offer high-resolution coarse features and low-resolution fine-grained features, typically enhancing the performance of semantic segmentation.More precisely, for an input image with a resolution of   3, patch merging is performed to yield a hierarchical feature map  , with a resolution calculated as follows: • Overlapped patch merging is employed to reduce the size of the feature maps while simultaneously increasing the number of channels.

•
The self-attention layer in the encoder represents the most computationally demanding component.To mitigate computational complexity, a reduction ratio (RRR) is applied to decrease the length of the sequence: where  is the sequence to be reduced, Reshape ,    refers to reshape  to the one with a shape of × ( • ), and Linear  ;  refers to a linear layer taking a  -dimensional tensor as input and generating a  -dimensional tensor as output [31].

•
Mix-FFN: enhancing transformer layers with positional information.The Mix-FFN structure enhances Transformer layers by integrating positional information.This is accomplished by incorporating a 3 × 3 convolutional layer and an MLP layer into the FFN:

•
The hierarchical Transformer encoder functions as the core network in the SegFormer model.It is tasked with extracting features from the input image and transmitting these features to the segmentation decoder.Its purpose is to capture features at various scales within the image and integrate global contextual information, thereby improving segmentation performance.

•
The objective of this module is to generate CNN-like multi-level features when given an input image.These features offer high-resolution coarse features and low-resolution fine-grained features, typically enhancing the performance of semantic segmentation.More precisely, for an input image with a resolution of H × W × 3, patch merging is performed to yield a hierarchical feature map F i , with a resolution calculated as follows: • Overlapped patch merging is employed to reduce the size of the feature maps while simultaneously increasing the number of channels.

•
The self-attention layer in the encoder represents the most computationally demanding component.To mitigate computational complexity, a reduction ratio (RRR) is applied to decrease the length of the sequence: where K is the sequence to be reduced, Reshape N R , C•R (K) refers to reshape K to the one with a shape of N R × (C • R), and Linear (C in ; C out )(•) refers to a linear layer taking a C in -dimensional tensor as input and generating a C out -dimensional tensor as output [31].

•
Mix-FFN: enhancing transformer layers with positional information.The Mix-FFN structure enhances Transformer layers by integrating positional information.This is accomplished by incorporating a 3 × 3 convolutional layer and an MLP layer into the FFN: • Lightweight ALL-MLP decoder: The lightweight ALL-MLP decoder serves as the segmentation decoder in the SegFormer model.Its primary objective is to transform the feature maps extracted by the encoder into pixel-level semantic segmentation predictions.This is accomplished through multi-scale feature fusion and upsampling to produce the final semantic segmentation mask.An essential rationale for SegFormer's use of MLP in the decoder lies in the larger receptive field of Transformers compared to CNNs.The decoding process can be summarized as follows: 1.
Multiple hierarchical features are fed into an MLP layer to standardize the channel dimension; 2.
Feature maps are upsampled to 1/4 of the original image size and subsequently concatenated; 3.
A single MLP layer aggregates the feature channels; 4.
The predicted segmentation mask is outputted with a resolution calculated as follows: where N cls is the number of categories.
The decoder is formulated as follows: where M refers to the predicted mask and Linear(C in ; C out )(•) refers to a linear layer with C in and C out as the input and output vector dimensions, respectively [31].
In this study, enhancements have been made to the decoder of the SegFormer model.The original decoder concatenated features from different scales, overlooking the inherent differences in these features across the scales.Lower-level features typically capture structural and textural characteristics, while higher-level features focus more on semantic representation.In contrast to other methods, the improved algorithm prioritizes the complementarity between these two types of features.Unet and HRNet, as network architectures in deep learning, have been extensively applied in image segmentation tasks, with the goal of labeling each pixel in an image with a specific semantic category for precise image segmentation [32,33].These models underscore the significance of multi-scale features for better information capture at different sizes and levels, thereby enhancing segmentation performance.The segmentation results of UNet, HRNet, the original SegFormer, and the modified SegFormer were compared.
UNet is a deep learning neural network architecture employed for image segmentation tasks.Its name is derived from its U-shaped network structure.A key feature is the combination of an encoder and a decoder, utilized for pixel-level image segmentation tasks.This facilitates the assignment of each pixel in the input image to a specific semantic category, achieving semantic segmentation at the pixel level.The structure of UNet, encompassing the encoding-decoding structure and the use of skip connections, has served as inspiration for many other network architectures, such as SegNet and PSPNet.Initially applied in the field of medical image segmentation, such as segmenting tissue structures in MRI and CT images, UNet has evolved into a widely employed algorithm in practical applications, including geotechnical engineering [34].
HRNet aims to uphold feature propagation at multiple resolutions for the improved capture of details and contextual information in images, thus enhancing segmentation performance.Its core idea involves preserving feature maps at multiple resolutions, addressing the resolution loss issue in traditional convolutional neural networks.HRNet is especially adept at handling fine-grained objects, retaining high-resolution information for better discrimination, and localization of small objects or intricate structures.The model's multibranch structure facilitates information capture at different scales, augmenting the model's comprehension of object context and overall structure.Detailed network configurations for HRNet can be found in the literature [35].

Improvements to SegFormer
While the SegFormer model performs feature fusion across different scales, the fusion process might not adequately account for differences between these features in certain scenarios, leading to suboptimal performance.The SegFormer model utilizes the hierarchical Transformer decoder and the lightweight ALL-MLP decoder in the decoder module for fusing features at different scales.Although these methods theoretically capture the relationship between features at different scales, in practice, the model's design and parameter settings may not fully address the distinctions between these features.
The hierarchical Transformer decoder is employed to model features at different scales, but the Transformer module may not effectively handle spatial differences in some cases.Additionally, the lightweight ALL-MLP decoder may not adequately distinguish the importance of features from different scales during the feature fusion process.In the instance segmentation of the rock CT images, although the original model incorporates self-attention processing and feature concatenation, it still falls short in fully considering the differences between features of different scales.For instance, the self-attention mechanism may average some spatial information instead of weighting it according to the scale differences.Furthermore, the feature concatenation process may not accurately discern the contributions of features from different scales.
To address the issue of overlooking feature differences when directly concatenating features of different scales and to achieve bidirectional fusion between these two types of features, we have designed the module illustrated in Figure 4. HRNet aims to uphold feature propagation at multiple resolutions for the improved capture of details and contextual information in images, thus enhancing segmentation performance.Its core idea involves preserving feature maps at multiple resolutions, addressing the resolution loss issue in traditional convolutional neural networks.HRNet is especially adept at handling fine-grained objects, retaining high-resolution information for better discrimination, and localization of small objects or intricate structures.The model's multi-branch structure facilitates information capture at different scales, augmenting the model's comprehension of object context and overall structure.Detailed network configurations for HRNet can be found in the literature [35].

Improvements to SegFormer
While the SegFormer model performs feature fusion across different scales, the fusion process might not adequately account for differences between these features in certain scenarios, leading to suboptimal performance.The SegFormer model utilizes the hierarchical Transformer decoder and the lightweight ALL-MLP decoder in the decoder module for fusing features at different scales.Although these methods theoretically capture the relationship between features at different scales, in practice, the model's design and parameter settings may not fully address the distinctions between these features.
The hierarchical Transformer decoder is employed to model features at different scales, but the Transformer module may not effectively handle spatial differences in some cases.Additionally, the lightweight ALL-MLP decoder may not adequately distinguish the importance of features from different scales during the feature fusion process.In the instance segmentation of the rock CT images, although the original model incorporates self-attention processing and feature concatenation, it still falls short in fully considering the differences between features of different scales.For instance, the self-attention mechanism may average some spatial information instead of weighting it according to the scale differences.Furthermore, the feature concatenation process may not accurately discern the contributions of features from different scales.
To address the issue of overlooking feature differences when directly concatenating features of different scales and to achieve bidirectional fusion between these two types of features, we have designed the module illustrated in Figure 4. Firstly, two hierarchical features are inputted, as shown in Figure 5.The symbols ①, ②, ③, and ④ represent the feature maps of different scales, which are illustrated in Figure 5.When ① is used as the high input, ② becomes the low input.Both features pass through the conv module to adjust their channel numbers.The low feature undergoes a sigmoid activation and is then multiplied elementwise with the high feature, aiming to enhance edge information.The resulting feature is then combined with the low feature to obtain richer semantic information.The high feature is also passed through a sigmoid activation and multiplied elementwise with the previous feature to enhance attention towards the region of interest.Finally, the resulting feature is added to the high feature to obtain the output.This achieves bidirectional fusion.The output is used as the high feature, while ③ is used as the low feature, repeating the aforementioned process.Ultimately, the features are combined at each level, fully utilizing features of different scales.Essentially, we integrate features at various levels.Initially, ① and ② are fused, and the Firstly, two hierarchical features are inputted, as shown in Figure 5.The symbols 1 , 2 , 3 , and 4 represent the feature maps of different scales, which are illustrated in Figure 5.When 1 is used as the high input, 2 becomes the low input.Both features pass through the conv module to adjust their channel numbers.The low feature undergoes a sigmoid activation and is then multiplied elementwise with the high feature, aiming to enhance edge information.The resulting feature is then combined with the low feature to obtain richer semantic information.The high feature is also passed through a sigmoid activation and multiplied elementwise with the previous feature to enhance attention towards the region of interest.Finally, the resulting feature is added to the high feature to obtain the output.This achieves bidirectional fusion.The output is used as the high feature, while 3 is used as the low feature, repeating the aforementioned process.Ultimately, the features are combined at each level, fully utilizing features of different scales.Essentially, we integrate features at various levels.Initially, 1 and 2 are fused, and the outcome is subsequently combined with 3 .The resulting output is then further integrated with 4 , completing the entire feature-splicing process.
Appl.Sci.2023, 13, x FOR PEER REVIEW 8 of 20 outcome is subsequently combined with ③.The resulting output is then further integrated with ④, completing the entire feature-splicing process.The learning rate, a hyperparameter controlling the step size of the model parameter updates, was chosen.A smaller learning rate usually results in more stable training but requires more training time.The learning rate of the network was set to 0.001 in this project.Additionally, a weight decay of 0.01 was applied as a regularization term to prevent overfitting of the model to the training data.

Modelling
The binary image obtained in the previous steps needs to be vectorized before it can be used for numerical simulation.Previous studies have proposed a method.Meng et al. developed an open-source software package for mechanical behavior analysis of heterogeneous materials, which includes geometrical and analytical routines for vectorizing bitmap images [36].The vector map provides geometric statistics of the model, enabling the construction of a high-accuracy DEM model.

Bitmap Vectorization
The transformation from a binary image to a vector image can be achieved using the following method.The learning rate, a hyperparameter controlling the step size of the model parameter updates, was chosen.A smaller learning rate usually results in more stable training but requires more training time.The learning rate of the network was set to 0.001 in this project.Additionally, a weight decay of 0.01 was applied as a regularization term to prevent overfitting of the model to the training data.

Modelling
The binary image obtained in the previous steps needs to be vectorized before it can be used for numerical simulation.Previous studies have proposed a method.Meng et al. developed an open-source software package for mechanical behavior analysis of heterogeneous materials, which includes geometrical and analytical routines for vectorizing bitmap images [36].The vector map provides geometric statistics of the model, enabling the construction of a high-accuracy DEM model.

Bitmap Vectorization
The transformation from a binary image to a vector image can be achieved using the following method.Figure 6 illustrates four steps, including connected-component labeling, the boundary extraction technique, boundary refinement, and geometry scaling.
In the connected-component labeling implementation, a digital image's microstructure is stored as discrete pixels and grouped by a procedure.A binary image of n × m pixels is represented by f (i, j) in the Cartesian coordinate system [37]: In the connected-component labeling implementation, a digital image's microstructure is stored as discrete pixels and grouped by a procedure.A binary image of n × m pixels is represented by (, ) in the Cartesian coordinate system [37]: To obtain the microstructure's geometry, such as pores or gravel, each pixel's group is determined.A fast two-pass method algorithm segments the geometry into regions.A zero matrix of size f, a label variable l, and a label set variable li are initialized.The neighborhood Ω is shown in Figure 7, and a 4-neighbor domain is utilized.The first pass labels each pixel and its domain.A loop constructs domains with the same label.To implement the boundary extraction technique, we utilized a method proposed for extracting the microstructure boundary from CT or NMR images [38].The method labels the connected pixels using a region-growing algorithm, then divides each pixel into four edges, storing them in a list.Subsequently, the algorithm removes edges that appear twice in the list and connects the remaining edges to form a closed curve.
After determining the edges of the microstructure, in the boundary refinement step, microstructure edges are connected into a closed polyline using the Douglas-Peucker To obtain the microstructure's geometry, such as pores or gravel, each pixel's group is determined.A fast two-pass method algorithm segments the geometry into regions.A zero matrix of size f, a label variable l, and a label set variable li are initialized.The neighborhood Ω is shown in Figure 7, and a 4-neighbor domain is utilized.The first pass labels each pixel and its domain.A loop constructs domains with the same label.In the connected-component labeling implementation, a digital image's microstructure is stored as discrete pixels and grouped by a procedure.A binary image of n × m pixels is represented by (, ) in the Cartesian coordinate system [37]: To obtain the microstructure's geometry, such as pores or gravel, each pixel's group is determined.A fast two-pass method algorithm segments the geometry into regions.A zero matrix of size f, a label variable l, and a label set variable li are initialized.The neighborhood Ω is shown in Figure 7, and a 4-neighbor domain is utilized.The first pass labels each pixel and its domain.A loop constructs domains with the same label.To implement the boundary extraction technique, we utilized a method proposed for extracting the microstructure boundary from CT or NMR images [38].The method labels the connected pixels using a region-growing algorithm, then divides each pixel into four edges, storing them in a list.Subsequently, the algorithm removes edges that appear twice in the list and connects the remaining edges to form a closed curve.
After determining the edges of the microstructure, in the boundary refinement step, microstructure edges are connected into a closed polyline using the Douglas-Peucker smooth algorithm (see Figure 8a).This polyline is jagged and has too many points, which To implement the boundary extraction technique, we utilized a method proposed for extracting the microstructure boundary from CT or NMR images [38].The method labels the connected pixels using a region-growing algorithm, then divides each pixel into four edges, storing them in a list.Subsequently, the algorithm removes edges that appear twice in the list and connects the remaining edges to form a closed curve.
After determining the edges of the microstructure, in the boundary refinement step, microstructure edges are connected into a closed polyline using the Douglas-Peucker smooth algorithm (see Figure 8a).This polyline is jagged and has too many points, which are undesirable for numerical modeling.To smooth and reduce the number of points, we employed a boundary refinement algorithm described in Algorithm 1 [39].This algorithm divides the vertices into two sections using the longest line (see Figure 8b).Subsequently, it calculates the line equation and the maximum distance from a vertex to the line as follows: are undesirable for numerical modeling.To smooth and reduce the number of points, we employed a boundary refinement algorithm described in Algorithm 1 [39].This algorithm divides the vertices into two sections using the longest line (see Figure 8b).Subsequently, it calculates the line equation and the maximum distance from a vertex to the line as follows: where  +  +  = 0 is the line equation and ( ,  ) is the vertex coordinates.It repeats this for each section until the maximum distance is below a threshold (see Figure 8c).Finally, it obtains the refined boundary by connecting the edges (see Figure 8d).Algorithm 1 Boundary refinement using Douglas-Peucker algorithm [39] 1: Set dt as the threshold distance, dp as perpendicular distance; 2: Find a separation line lm with the maximum length; 3: Separate El into two polylines as El (1) and El(2) with the line lm; The microstructure geometry in digital images is measured in pixels rather than physical distance (mm or µm).We employed a geometry scaling procedure to convert the geometry to the real size.Typically, the digital image has a proportional scale indicating the resolution.The real distance of a pixel can be estimated based on this information.

DEM Model Generation
The procedures in Section 2.2.1 transform a digital image into a vector graph consisting of numerous polygons.For heterogeneous materials with pores or aggregates, computing geometric statistics for the microstructure is crucial [40].Three parameters, namely, the area, minimum bounding rectangle (MBR), and size and direction, are analyzed in the program.

•
The area of a polygon with n vertices and coordinates (x k , y k ) for the kth vertex can be calculated as follows: By looping over all the polygons, we can obtain the statistical information of the area.

•
MBR is a rectangle that encloses all the points in a given set of points and has the smallest area of all the enclosing rectangles.It is frequently used as an indicator of geographic features [41].For each boundary polygon, a convex hull is computed.For each edge in the hull, the hull is rotated to obtain a possible rectangle.This process is performed for all the edges, and the smallest rectangle is selected.It is then rotated back to obtain the MBR.

•
The particle size and direction are computed from the MBR of each particle.The short and long axes of the MBR correspond to the width and length of the particle, respectively.The width can serve as an approximation of the particle radius for composite materials.The direction is defined as the angle between the long axis and the horizontal line.
Statistical geometric information is crucial for creating high-quality numerical models.This information can be represented using vector graphics and converted to a drawing exchange format (DXF) file format.The conversion process maps the geometric shapes and paths from the source vector format to the corresponding entities in the DXF file.

Performance of the Instance Segmentation
The size distribution of the Xiyu conglomerate was analyzed by segmenting the CT images.After iterative adjustments, the best algorithms were selected based on the validation results from UNet, HRNet, SegFormer, and the modified SegFormer.As depicted in Figure 9, the improved SegFormer generates binary and grayscale segmented images.
Three metrics, namely, intersection over union (IoU), dice coefficient, and F1 score, are employed to evaluate the predictive performance of different models.The intersection over union (IoU) measures the extent of overlap between the predicted positive samples and the actual positive samples.
IoU = (TP)/(TP + FP + FN) In this context, TP represents the true positive count, FP represents the false positive count, and FN represents the false negative count.The IoU ranges from 0 to 1, with higher values indicating a greater overlap between the predicted results and the actual labels.An IoU of 1 indicates a perfect match between the predicted results and the actual labels.
In this context, TP represents the true positive count, FP represents the false positive count, and FN represents the false negative count.The IoU ranges from 0 to 1, with higher values indicating a greater overlap between the predicted results and the actual labels.An IoU of 1 indicates a perfect match between the predicted results and the actual labels.
The dice coefficient measures the similarity between the predicted positive samples and the actual positive samples, similar to IoU.The dice coefficient is more robust in handling segmentation tasks with class imbalance.
The F1 score combines the precision and recall of a model.
Precision is the ratio of true positive predictions to all the positive predictions, while recall is the ratio of true positive predictions to all the actual positive samples.
The evaluation results of the three metrics for the four methods are presented in Table 1.It is evident that the SegFormer model outperforms UNet and HRNet in terms of accuracy by a significant margin.Additionally, HRNet exhibits slightly better accuracy than UNet.UNet performs exceptionally well in low-resolution images or small-sample segmentation tasks.However, when it comes to high-resolution CT image instance segmentation tasks, the decoder of UNet may not effectively handle long-range dependencies within high-resolution images.For certain instance segmentation tasks, instances may exhibit extensive feature connections, and the local feature propagation of UNet may be The F1 score combines the precision and recall of a model.
Precision is the ratio of true positive predictions to all the positive predictions, while recall is the ratio of true positive predictions to all the actual positive samples.
The evaluation results of the three metrics for the four methods are presented in Table 1.It is evident that the SegFormer model outperforms UNet and HRNet in terms of accuracy by a significant margin.Additionally, HRNet exhibits slightly better accuracy than UNet.UNet performs exceptionally well in low-resolution images or small-sample segmentation tasks.However, when it comes to high-resolution CT image instance segmentation tasks, the decoder of UNet may not effectively handle long-range dependencies within highresolution images.For certain instance segmentation tasks, instances may exhibit extensive feature connections, and the local feature propagation of UNet may be insufficient to capture such global dependencies.Moreover, in terms of multi-scale processing, the images may contain subtle structures and regions at different scales.HRNet is capable of handling these subtle differences without sacrificing resolution.In comparison, UNet primarily relies on the encoder-decoder architecture for information propagation and lacks the advantage of HRNet's multi-branch approach for handling multi-scale information.The SegFormer model demonstrated significant improvements in all the evaluation metrics compared to HRNet, particularly in accurately identifying pores.The SegFormer model incorporates a hierarchical Transformer encoder, efficiently handling multi-scale information.This is crucial for high-resolution CT image instance segmentation tasks, where images may contain details and structures at different scales.By effectively capturing these subtle differences, SegFormer enhances the accuracy of instance segmentation.
Through progressively integrating different features, the improved SegFormer model pays more attention to the distinctions between features at different scales, further enhancing segmentation accuracy, especially in recognizing matrices.The dice coefficient improved by 6.57, and the IoU improved by 8.79.The improved recognition results of SegFormer are displayed in Figure 10.
cessing, the images may contain subtle structures and regions at different scales.HRNet is capable of handling these subtle differences without sacrificing resolution.In comparison, UNet primarily relies on the encoder-decoder architecture for information propagation and lacks the advantage of HRNet's multi-branch approach for handling multi-scale information.The SegFormer model demonstrated significant improvements in all the evaluation metrics compared to HRNet, particularly in accurately identifying pores.The SegFormer model incorporates a hierarchical Transformer encoder, efficiently handling multi-scale information.This is crucial for high-resolution CT image instance segmentation tasks, where images may contain details and structures at different scales.By effectively capturing these subtle differences, SegFormer enhances the accuracy of instance segmentation.
Through progressively integrating different features, the improved SegFormer model pays more attention to the distinctions between features at different scales, further enhancing segmentation accuracy, especially in recognizing matrices.The dice coefficient improved by 6.57, and the IoU improved by 8.79.The improved recognition results of SegFormer are displayed in Figure 10.

Numerical Simulation Model Generation
To study the mechanical properties of the Xiyu conglomerate, a numerical model was generated based on the porosity statistics and real CT reconstruction images in the previous section.The size of the model was the same as the actual size of the sample.To generate an accurate numerical simulation model, the first step is to import the previously generated DXF file into PFC.Next, according to the geometry, different properties of particles are filled into the model.Finally, the model is servoed.The specific steps to generate the model are as follows: 1.
Based on the porosity statistics in the previous section, we set the porosity to 0.2 and set the maximum-to-minimum particle size ratio Rmax/Rmin to 1.66.The file containing the coordinate points of the polygons generated earlier was imported into the PFC.The balls were automatically filled by approximating the medial surface of the geometry outlines.2.
To simulate the different properties of various parts of the Xiyu conglomerate, we classified the contacts of the Xiyu conglomerate model into three categories.By judging the group to which the two ends of the contact belong, it can be divided into gravel contacts, matrix contacts, and gravel-matrix contacts.
The generated model is then subjected to servo control to achieve an ideal state of contact between the particle systems as quickly as possible, and a loading analysis is performed based on this state.This process is essential for using the discrete element method for mechanical analysis.For biaxial compression tests on the Xiyu conglomerate, the loading plates around the rock sample need to be adjusted to complete the servo control and stabilize its internal structure.This can be done by adjusting the servo speed in horizontal and vertical directions continuously.Meanwhile, the number of contacts between the particles and boundary walls is counted, and the servo parameter G for the next time step is calculated.The wall normal velocity u w n is obtained again, and servo control for the next time step is started.This cycle is repeated until the average contact stress on all the boundary walls reaches the specified requirements.Assuming that N c is the number of particles in contact with the boundaries, k n is the average contact stiffness, δ t is the time step length, A is the wall area, α is the stress release factor, and σ measured and σ required are the measured stress and target stress on the loading plates, respectively.The servo adjustment coefficient and loading speed can be expressed, respectively, as follows:

Parameter Calibration
In this study, a linear parallel bond model was used to characterize the macroscopic mechanical properties of the Xiyu conglomerate.The parallel-bond component acts in parallel with the linear component and establishes an elastic interaction between the pieces.The existence of a parallel bond does not preclude the possibility of slipping.Parallel bonds can transmit both force and moment between the pieces [19,42,43].This model can provide two different interfaces of distinct mechanical behavior by simulating a finite-sized piece of cement-like material deposited between the two contacting pieces [44,45].Due to the above-mentioned properties, this model can be applied to simulate the Xiyu conglomerate.
The discrete element numerical simulation of particles requires the input of microparameters that characterize the material properties.A common method for determining the micro-parameters for the discrete element simulation is based on the macro-mechanical parameters of the material and using numerical tests to calibrate the micro-parameters of the material [46].
Several parameters were calibrated in this study: the effective modulus (E * ) and parallel bond effective modulus (E * ) are the effective moduli for both the frictional interface and bonded interface, respectively, and they are roughly equal.The k ratio (K * ) and parallel bond k ratio (K * ) are the ratios between the normal and shear stiffness for each interface, respectively.The parallel bond tensile strength (σ c ) and parallel bond cohesion c are the micro-mechanical parameters that allow the bonded surface to withstand tensile and shear forces, respectively.The friction angle (ϕ) is analogous to the internal friction angle in the macro-mechanical parameters.The friction coefficient (µ) is the ratio between normal force and friction force.The micro-parameters of the Xiyu conglomerate are presented in Table 2, where the corresponding mechanical parameter ratio of the matrix to the gravel is 0.45 and the gravel-matrix interface to the matrix is 0.2.

Destruction Process and Mechanical Characteristics
The internal friction angle and cohesion of rock are important indicators for evaluating the mechanical strength of the Xiyu conglomerate; therefore, based on the results of the parameter calibration, a biaxial compression test was conducted in this study on a Xiyu conglomerate model simulating a porosity of 20%.The lateral pressure plate applied a constant stress σ 3 to the rock sample, and the loading cover plate was set at a certain speed, which was equivalent to applying σ 3 + ∆σ = σ 1 to the rock sample.The confining pressures of the test were 1 MPa, 2 MPa, and 4 MPa, respectively.

Destruction Phase
During the pre-peak phase (point A), the stress-strain curve forms an approximately straight line, signifying the model's presence in the elastic phase.The formation of cracks is relatively limited, and those that do appear tend to align with the direction of maximum compression.These cracks are distributed across the specimen and appear to be noninteracting.A non-linear curve (point B) emerges as the strain increases before reaching the peak, signifying the model's partial entry into the plastic phase with rapidly developing cracks.One or more macroscopic failure planes develop, cutting across the specimen.As the stress-strain curve transitions into the post-peak phase (point C), cracks continue evolving and join to form through cracks, leading to a rapid drop in the compressive stress and damage to the model.Figure 11 presents the resultant stress-strain curve, along with the entire crack development process across different stages.

Comparison of Mechanical Parameters
To comprehensively investigate and validate the mechanical properties of the Xiyu conglomerate, biaxial tests were conducted on samples subjected to varying confining pressures.Subsequently, the Mohr-Coulomb criterion was applied: where τ is the shear stress on any section of the rock, τ f is the critical shear stress at the onset of failure, c is the rock cohesion, and ϕ is the rock internal friction angle.The major and minor principal stresses σ 1 and σ 3 were obtained by monitoring the stress on the loading plates.Because the applied confining pressure was less than 10 MPa, the rock envelope line could be simplified as a straight line [47].The internal friction angle of the rock sample was calculated to be 34.70 • , and the rock cohesion was 0.64 MPa.The resultant stress-strain curves obtained through biaxial compression tests at different confinements are shown in Figure 12.

Comparison of Mechanical Parameters
To comprehensively investigate and validate the mechanical properties of the Xiyu conglomerate, biaxial tests were conducted on samples subjected to varying confining pressures.Subsequently, the Mohr-Coulomb criterion was applied: where  is the shear stress on any section of the rock,  is the critical shear stress at the onset of failure, c is the rock cohesion, and φ is the rock internal friction angle.The major and minor principal stresses  and  were obtained by monitoring the stress on the loading plates.Because the applied confining pressure was less than 10 MPa, the rock envelope line could be simplified as a straight line [47].The internal friction angle of the rock sample was calculated to be 34.70°, and the rock cohesion was 0.64 MPa.The resultant stressstrain curves obtained through biaxial compression tests at different confinements are shown in Figure 12.Fan [48] conducted a shear test on the Wuyi Reservoir in Xinjiang, from which the shear strength used in this paper was derived.The experiment determined that the natural internal friction angle of the Xiyu conglomerate is 39.0°, and the rock cohesion is 0.56 MPa.The comparison between the numerical experiment and the physical experiment is shown in Table 3.The results show that the numerical experiment results are 14.29% higher than the laboratory results in cohesion and 11.29% lower in the internal friction angle φ.Fan [48] conducted a shear test on the Wuyi Reservoir in Xinjiang, from which the shear strength used in this paper was derived.The experiment determined that the natural internal friction angle of the Xiyu conglomerate is 39.0 • , and the rock cohesion is 0.56 MPa.The comparison between the numerical experiment and the physical experiment is shown in Table 3.The results show that the numerical experiment results are 14.29% higher than the laboratory results in cohesion and 11.29% lower in the internal friction angle ϕ.

Conclusions
Based on the CT images of the Xiyu conglomerate, we used various image processing techniques to conduct biaxial compression tests on the high-precision numerical model using the discrete element method.By analyzing the experimental data and the complete data processing workflow, we obtained the following conclusions: 1.
Based on the CT scan images, we proposed an improved SegFormer segmentation algorithm, which achieved high-accuracy instance segmentation by training the network after manually annotating the pores, matrix, and gravel.The performance of our method was compared with different algorithms under various evaluation methods, and the results show that our method was more accurate than the original one.

2.
The rock cohesion c of the Xiyu conglomerate is about 0.64 MPa, and the rock internal friction angle ϕ is about 34.70 • .It has complex mechanical properties, manifested by various failure modes.The factors affecting the failure modes include the loading speed of the sample, the composition and distribution of the heterogeneous material, etc.
This study helps us to understand the causes and influencing factors of the anisotropy of the Xiyu conglomerate from the microstructure perspective and provides a microobservation basis for establishing an anisotropic mechanical model of the Xiyu conglomerate.At the same time, the segmentation algorithm based on deep learning showed its power in this study, which will help us to obtain relevant information of rocks more conveniently in our future work.

Figure 1 .
Figure 1.The example of labeling in LabelMe.(a) Original image, (b) image after labeling.

Figure 1 .
Figure 1.The example of labeling in LabelMe.(a) Original image, (b) image after labeling.

Figure 2 .
Figure 2.An overview of image segmentation process from data collection to prediction and statistics.

Figure 2 .
Figure 2.An overview of image segmentation process from data collection to prediction and statistics.
Appl.Sci.2023, 13, x FOR PEER REVIEW 5 of 20 network architecture of SegFormer, comprising the hierarchical Transformer encoder and lightweight ALL-MLP decoder modules, is shown in Figure 3.

Figure 4 .
Figure 4.The implementation method of feature fusion.

Figure 4 .
Figure 4.The implementation method of feature fusion.

Figure 5 .
Figure 5.The process of multi-scale feature fusion.2.1.4.Network Implementation All the code was developed using the PyTorch deep learning framework for constructing and training the model.The experiments were performed on a single graphical processing unit (GPU) with an RTX 3080 model.The initial parameters for the models were obtained from a pretrained model on the COCO dataset.The transfer learning approach was used, leveraging a pretrained model on a large-scale dataset as a starting point and fine-tuning it for the specific task by adjusting the model parameters.During the training process, the AdamW optimizer was utilized.AdamW, a commonly used deep learning optimizer, was employed to optimize the model parameters and minimize the loss function.The data were batched into sizes of 6 for each training iteration.A total of 200 training epochs were completed, meaning the entire training dataset was iterated 200 times to enable the model to learn and adapt to the given task.The learning rate, a hyperparameter controlling the step size of the model parameter updates, was chosen.A smaller learning rate usually results in more stable training but requires more training time.The learning rate of the network was set to 0.001 in this project.Additionally, a weight decay of 0.01 was applied as a regularization term to prevent overfitting of the model to the training data.

Figure 6
illustrates four steps, including connected-component labeling, the boundary extraction technique, boundary refinement, and geometry scaling.

Figure 5 .
Figure 5.The process of multi-scale feature fusion.2.1.4.Network Implementation All the code was developed using the PyTorch deep learning framework for constructing and training the model.The experiments were performed on a single graphical processing unit (GPU) with an RTX 3080 model.The initial parameters for the models were obtained from a pretrained model on the COCO dataset.The transfer learning approach was used, leveraging a pretrained model on a large-scale dataset as a starting point and fine-tuning it for the specific task by adjusting the model parameters.During the training process, the AdamW optimizer was utilized.AdamW, a commonly used deep learning optimizer, was employed to optimize the model parameters and minimize the loss function.The data were batched into sizes of 6 for each training iteration.A total of 200 training epochs were completed, meaning the entire training dataset was iterated 200 times to enable the model to learn and adapt to the given task.The learning rate, a hyperparameter controlling the step size of the model parameter updates, was chosen.A smaller learning rate usually results in more stable training but requires more training time.The learning rate of the network was set to 0.001 in this project.Additionally, a weight decay of 0.01 was applied as a regularization term to prevent overfitting of the model to the training data.

Figure 6 .
Figure 6.Flow chart depicting the steps of bitmap vectorization.

Figure 6 .
Figure 6.Flow chart depicting the steps of bitmap vectorization.

Figure 6 .
Figure 6.Flow chart depicting the steps of bitmap vectorization.

Figure 9 .
Figure 9.Comparison of the segmentation results with different networks.(a) Original image, (b) gray-scale image, (c) improved SegFormer-generated image, (d) UNet-generated image, (e) HRNetgenerated image, (f) SegFormer-generated image.Three metrics, namely, intersection over union (IoU), dice coefficient, and F1 score, are employed to evaluate the predictive performance of different models.The intersection over union (IoU) measures the extent of overlap between the predicted positive samples and the actual positive samples. = ()/( +  + )(13)

Figure 9 .
Figure 9.Comparison of the segmentation results with different networks.(a) Original image, (b) gray-scale image, (c) improved SegFormer-generated image, (d) UNet-generated image, (e) HRNetgenerated image, (f) SegFormer-generated image.The dice coefficient measures the similarity between the predicted positive samples and the actual positive samples, similar to IoU.The dice coefficient is more robust in handling segmentation tasks with class imbalance.Dice = (2 × TP)/(2 × TP + FP + FN)(14)

Figure 11 .
Figure 11.Comparative analysis of stress-strain curve and crack development process.Cracks are shown in green and blue lines, indicating tensile and shear failure, respectively.

Figure 11 .
Figure 11.Comparative analysis of stress-strain curve and crack development process.Cracks are shown in green and blue lines, indicating tensile and shear failure, respectively.

Figure 12 .
Figure 12.Stress-strain curves obtained through biaxial compression tests at confinements of 1, 2 and 4 MPa for a 50 mm × 100 mm specimen of the PFC Xiyu conglomerate model.

Figure 12 .
Figure 12.Stress-strain curves obtained through biaxial compression tests at confinements of 1, 2 and 4 MPa for a 50 mm × 100 mm specimen of the PFC Xiyu conglomerate model.

Table 1 .
Segmentation accuracy of the four methods.

Table 1 .
Segmentation accuracy of the four methods.

Table 2 .
Micro-mechanical DEM parameters of different parts of Xiyu conglomerate.

Table 3 .
Comparative analysis of parameters from numerical simulation and physical experi-

Table 3 .
Comparative analysis of parameters from numerical simulation and physical experiments.