Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization

Yang, Wenhao; Li, Ang; Zhang, Liyan; Qin, Xiaoyao

doi:10.3390/electronics15071509

Open AccessArticle

Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization

Petroleum Institute, China University of Petroleum-Beijing at Karamay, Karamay 834000, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1509; https://doi.org/10.3390/electronics15071509

Submission received: 2 March 2026 / Revised: 30 March 2026 / Accepted: 1 April 2026 / Published: 3 April 2026

Download

Browse Figures

Versions Notes

Abstract

Quantitative microstructure evaluation of sedimentary rock thin sections is essential for revealing reservoir flow mechanisms and assessing reservoir quality. However, traditional manual identification is inefficient and prone to subjectivity. Although current deep learning approaches have improved efficiency, most remain confined to single tasks and lack a pathway to translate image recognition into quantifiable geological parameters. Moreover, these methods struggle with cross-scale feature extraction and accurate grain boundary localization in complex textures. To overcome these limitations, this study proposes a three-stage automated analysis framework integrating intelligent lithology identification, sandstone grain segmentation, and quantitative analysis of fabric parameters. To address scale discrepancies in lithology discrimination, Rock-PLionNet integrates a Partial-to-Whole Context Fusion (PWC-Fusion) module and the Lion optimizer, which mitigates cross-scale feature inconsistencies and enables accurate screening of target sandstone samples. Subsequently, to correct boundary deviations caused by low contrast and grain adhesion, the PetroSAM-CRF strategy integrates polarization-aware enhancement with dense conditional random field (DenseCRF)-based probabilistic refinement to extract precise grain contours. Based on these outputs, the framework automatically calculates key fabric parameters, including grain size and roundness. Experiments on 3290 original multi-source thin-section images show that Rock-PLionNet achieves a classification accuracy of 96.57% on the test set. Furthermore, PetroSAM-CRF reduces segmentation bias observed in general-purpose models under complex texture conditions, enabling accurate parameter estimation with a roundness error of 2.83%. Overall, this study presents an intelligent workflow linking microscopic image recognition with quantitative analysis of geological fabric parameters, providing a practical pathway for digital petrographic evaluation in hydrocarbon exploration.

Keywords:

rock thin sections; deep learning; lithology classification; grain segmentation; quantitative fabric analysis

1. Introduction

Variations in the microstructural parameters of sedimentary reservoirs (especially sandstones) alter grain packing and pore structure, thereby controlling petrophysical properties such as porosity and permeability and ultimately influencing reservoir storage and flow capacity [1,2,3,4]. Thin-section analysis provides critical microscopic mineralogical and structural data, enabling the identification of mineral compositions and diagenetic features. Thus, it serves as a cornerstone for elucidating the relationship between depositional environments and reservoir quality [5,6]. However, as hydrocarbon exploration extends into unconventional and highly heterogeneous reservoirs, geological analysis and modeling require the integration of multi-source, heterogeneous, and cross-scale large datasets. Consequently, traditional manual methods are ill-equipped to meet the rigorous demands of modern geological modeling for multidimensional quantitative data, particularly in processing complex data structures and performing multidimensional correlation analysis [7,8]. Manual identification is not only labor-intensive and time-consuming but is also prone to subjective bias from operator experience, hindering the consistency and reliability of quantitative analysis [9,10,11]. Therefore, developing efficient, objective, and automated petrographic analysis workflows has become a key focus in digital rock physics. Recent studies in petroleum, subsurface, and broader georesource engineering have increasingly highlighted the value of data-driven screening, integrated quantitative analysis, and mechanism-oriented optimization for more intelligent exploration and development workflows [12,13,14,15].

Although computer vision techniques have demonstrated potential in geological image analysis, achieving fully automated and accurate characterization remains challenging. First, accurate lithology classification of sedimentary thin-section requires the coordinated extraction of petrographic information across different observational scales, ranging from local textural and morphological details to field-of-view-scale fabric characteristics [16,17,18]. This cross-scale heterogeneity places higher demands on the representation capability of classification models. Second, under low-contrast conditions, grain boundaries in thin-section images often become blurred or even merge, which complicates automated segmentation and biases statistics such as grain size [19]. Third, existing Al-based studies predominantly focus on qualitative identification of image content, with limited attention given to the quantitative characterization of geological fabric parameters, thereby providing insufficient data support for reservoir quality evaluation.

To address these critical issues, extensive research has been conducted in recent years. For lithology classification, Polat et al. [20] achieved high accuracy in 2021 by applying transfer learning with ResNet50 and DenseNet121 to volcanic rock thin-section classification, validating the effectiveness of convolutional neural networks (CNNs) in rock image recognition. In the same year, Xu et al. [21] proposed a faster R-CNN-based method, further highlighting the potential of deep learning frameworks for intelligent lithology identification. However, traditional CNNs primarily focus on local features and often struggle to capture the global context and long-range dependencies required to characterize complex sedimentary structures. This restricts their recognition performance in scenarios with intricate stratigraphic relationships [22]. To overcome this limitation, Koeshidayatullah et al. [23] introduced the transformer-based FaciesViT in 2022, leveraging self-attention to enhance core image classification. Subsequently, in 2024, Cao et al. [24] proposed CoreViT, which combines parallel transformer encoders with a class encoder, further showing the promise of Vision Transformers (ViT) for lithofacies identification. Transformers excel at capturing global image relationships, offering advantages for classifying thin-section images with complex textures [25]. However, pure ViTs are less effective at capturing low-level local features, which may limit generalization [26]. Additionally, thin-section image analysis often lacks high-quality annotated datasets [27,28]. In such data-scarce scenarios, ViTs may be unstable or prone to overfitting, typically incurring higher training costs and longer training times [29]. To combine the strengths of both CNNs and ViTs, subsequent research has explored hybrid architectures. For instance, Appiah-Twum et al. [30] proposed DenseViT, demonstrating that hybrid designs can improve classification performance. However, existing hybrid models often rely on fixed feature selection and fusion rules, which limit their ability to adapt to texture variations in complex thin sections and leave room for improvement in feature utilization efficiency [31,32]. To address these issues, this paper proposes the Rock-PLionNet classification network. To balance training efficiency and generalization under limited data, the network adopts the lightweight EfficientNetV2-S [33] as the backbone, thereby reducing the reliance of pure transformer architectures on massive training data. To compensate for the limited long-range modeling of traditional CNNs, an innovative PWC-Fusion module is integrated into the network to mitigate representation inconsistency between micro-scale textures and macro-scale fabrics. Using a lightweight attention mechanism, PWC-Fusion adaptively reweights shallow detail features and deep structural features, enabling cross-scale complementarity within a unified feature space. However, rock thin section images often exhibit substantial intra-class variation in texture and scale (such as sandstones with different grain sizes), whereas inter-class differences (like micritic limestone and dolomite) can sometimes be extremely subtle. In preliminary experiments, this property, coupled with the dynamic reweighting in PWC-Fusion, led to gradient oscillations when using the conventional AdamW optimizer. To address this optimization challenge, inspired by the symbolic momentum update mechanism reported by Chen et al. [34], we introduce the Lion optimizer. Owing to its insensitivity to gradient magnitude, Lion smooths the training trajectory and improves convergence stability when processing thin sections with complex feature distributions.

In grain segmentation, Saxena et al. [35] employed a CNN in 2021 for pixel-level segmentation of sandstone microscopic images, establishing a methodological basis for automatic extraction of sandstone constituents and pore structures. Subsequently, in 2022, Das et al. [36] used the proposed DSGSN network to perform binary segmentation of sandstone thin sections under plane-polarized and cross-polarized light, demonstrating its superior accuracy. However, such data-driven methods rely heavily on large-scale, high-quality annotations, which limit their applicability in digital rock physics [37,38]. To alleviate this data bottleneck, the Segment Anything Model (SAM) proposed by Kirillov et al. [39] has attracted increasing attention because of its strong zero-shot generalization capability. In 2025, Sylvester et al. [40] proposed Segmenteverygrain, which combines SAM with U-Net for thin-section image segmentation. Their results showed that SAM-generated masks are generally more robust and more accurate than those produced by U-Net, although errors persist in cases with complex backgrounds or insufficient contrast. In the same year, Zhang et al. [41] quantitatively demonstrated that SAM’s segmentation accuracy degrades markedly as target-to-background contrast decreases and that SAM is more likely to fail when boundaries are difficult to discern. To address inaccurate boundary delineation in low-contrast sandstone thin section images, this study proposes the PetroSAM-CRF segmentation strategy. The method introduces an adaptive preprocessing workflow to enhance input quality by increasing grain-to-background contrast. It further incorporates lithological discrimination information from the classification stage as a prior to filter and constrain candidate masks generated by SAM, thereby producing initial segmentation results that are more geologically plausible. However, SAM may still yield uncertain predictions and jagged pixel-level boundaries; such errors can directly compromise subsequent parameter calculations. Therefore, DenseCRF [42] is integrated to exploit color and spatial consistency at the pixel level and correct local errors. This step mitigates grain adhesion and internal fragmentation, enabling refined delineation of grain boundaries. Rather than serving as a generic post-processing chain for a foundation model, PetroSAM-CRF is designed as a sandstone-oriented segmentation strategy in which preprocessing, lithology-constrained mask screening, and boundary refinement are jointly organized to improve the reliability of downstream quantitative fabric characterization.

To achieve quantitative evaluation, recent efforts have attempted to develop automated workflows that integrate segmentation with downstream evaluation. For example, Azzam et al. [43] developed GrainSight to compute grain morphological statistics, and Barbosa et al. [44] further derived grain-size parameters after segmentation. However, these studies typically assume that the lithology of the input images is known and uniform. In practical geological exploration involving large, unlabeled thin-section datasets, rock types are often unknown. Because segmentation strategies and parameter settings often vary significantly with lithological characteristics, applying fixed-parameter segmentation models directly to mixed lithology data without accurate classification priors can degrade generalization in complex scenarios. In the absence of classification priors, boundary adhesion may occur because cement and grain margins can exhibit similar colors in thin-section images, which further affects segmentation results. To address this issue, this study proposes a three-stage cascaded automated framework that integrates lithology classification, grain segmentation, and quantitative evaluation. The framework utilizes classification results as priors to constrain segmentation and parameter computation. By precisely selecting target lithologies, it optimizes subsequent analysis and provides a feasible pathway toward robust, fully automated quantitative evaluation.

Overall, the three-stage cascaded framework proposed in this study enables a transition from manual qualitative identification to intelligent quantitative evaluation. By integrating the PWC-Fusion module with the Lion optimizer in Rock-PLionNet, we effectively resolve the challenges in existing methods, including multi-scale lithological feature extraction and training convergence. This integration improves identification accuracy for three sedimentary rock types. Meanwhile, PetroSAM-CRF alleviates the limited adaptability of general foundation models to complex geological textures, thereby ensuring the accuracy of subsequent quantitative metric calculations. The technical workflow proposed in this paper enhances the efficiency of geological thin-section analysis and provides a practical pathway for intelligent quantitative evaluation of sedimentary rock thin sections.

2. Methods

2.1. Automated Analysis Workflow

Petrographic thin-section analysis is fundamental to sedimentology, reservoir geology, and mineralogy research. However, traditional manual identification methods are inefficient and highly subjective, making them inadequate for large-scale quantitative analysis. To address this limitation, this study constructs a three-stage cascaded automated analysis workflow that enables automated processing from image input to quantitative evaluation.

Figure 1 illustrates the overall framework of the proposed workflow. The workflow begins with lithology classification. Raw thin-section images are preprocessed and then fed into the Rock-PLionNet network, which outputs identification results for three categories: sandstone, limestone, and dolomite. Because sandstone is a crucial hydrocarbon reservoir rock and its reservoir quality is closely related to its microscopic fabric characteristics, this study selects sandstone as the target for subsequent detailed analysis. The screened sandstone images then enter the second stage, PetroSAM-CRF. This lithological screening step defines the activation condition of the second-stage branch: only the samples identified as sandstone are further processed by PetroSAM-CRF, whereas limestone and dolomite samples are excluded from grain-scale segmentation. After domain-adaptive preprocessing, binary masks are extracted by the segmentation model, followed by refinement to provide high-quality inputs for quantitative evaluation. In the third stage, a series of key parameters is automatically calculated from the segmentation results, generating an evaluation report that includes statistical data and visualization charts.

Compared with traditional manual methods and single-task deep learning models, this framework achieves a closed loop through its cascaded design. Although the proposed workflow forms a closed automated pipeline from image input to quantitative evaluation, the three stages are implemented sequentially rather than jointly optimized within a single end-to-end training process. The classification stage ensures the target lithology for subsequent processing, the segmentation stage provides precise pixel-level inputs for parameter calculation, and the evaluation stage transforms image information into geologically meaningful quantitative metrics. The entire workflow operates without human intervention and supports the automated processing of large-scale rock thin sections, providing a feasible technical pathway for digital petrographic analysis.

2.2. Rock-PLionNet for Lithology Classification

2.2.1. Overall Framework

To address the complexity of rock thin-section images in terms of texture morphology and mineral assemblages while balancing feature representation capability and computational efficiency, this study adopts EfficientNetV2-S as the backbone network. Through components such as Fused-MBConv, it reduces memory access overhead when processing intricate textures [33], providing practical deployment advantages for high-resolution microscopic image analysis. Lithology identification in thin sections often requires attention to texture and boundary information at the grain scale as well as fabric and structural features at the field-of-view scale. However, the layer-wise progressive representation of standard CNNs lacks explicit cross-scale fusion, which may hinder the full utilization of complementary relationships between microscopic details and macroscopic fabrics within the same discriminative pathway.

To address these issues, this paper proposes Rock-PLionNet, whose overall architecture is illustrated in Figure 2. Input images are first normalized and then fed into EfficientNetV2-S for hierarchical feature extraction. Stages 0 to 3 constitute the low-level feature extraction phase, where texture and boundary details are progressively extracted via 3 × 3 convolutions and Fused-MBConv modules, yielding a feature map

F_{low}

with dimensions of 28 × 28 × 64. Stages 4 to 7 further aggregate semantic and fabric information through MBConv modules, producing a high-level feature map

F_{high}

with dimensions of 7 × 7 × 1280. Subsequently,

F_{low}

and

F_{high}

are fed into the PWC-Fusion module for multi-scale context aggregation and adaptive fusion to obtain the fused feature

F_{fused}

, which is then fed into a lightweight classification head for lithology identification. Here, the low-level pathway mainly preserves grain-scale local details, whereas the high-level pathway aggregates broader fabric information at the field-of-view scale. Their adaptive fusion enables lithology identification to benefit from both detail-sensitive and structure-aware representations. During training, the Lion optimizer and cross-entropy loss, together with data augmentation, are used to enhance convergence stability and generalization.

2.2.2. PWC-Fusion Module

The proposed PWC-Fusion module is designed to facilitate the complementarity between low-level and high-level features before classification. By generating a coherent fused representation, it improves feature stability and discriminative capability. The architecture and forward computation process of the module are illustrated in Figure 3. To enable direct fusion, a dimensional alignment transformation is first applied to the low-level features:

F_{low}^{'} = A (F_{low}),

(1)

where

A (\cdot)

performs spatial and channel alignment to match

F_{low}^{'}

with

F_{high}

in spatial scale and channel dimension. After alignment, fusion weights are adaptively allocated based on image content differences. A lightweight weight-generation subnetwork takes

F_{high}

as input and applies global average pooling followed by two 1 × 1 convolution layers. A Softmax normalization then produces a two-dimensional weight vector:

α = [α_{low}, α_{high}] = Softmax (W_{2} \cdot ReLU (W_{1} \cdot GAP (F_{high}))),

(2)

where

W_{1} \in R^{C_{h} / 8 \times C_{h}}

and

W_{2} \in R^{2 \times C_{h} / 8}

are the weight parameters of the two convolutional layers, and GAP denotes global average pooling. Based on

α

, the module adaptively adjusts the relative contributions of low-level and high-level features according to the feature distribution of the input. The final fused feature is computed by weighted summation:

F_{fused} = α_{low} \cdot F_{low}^{'} + α_{high} \cdot F_{high}

(3)

When classification depends more on fine-grained features, such as grain boundary clarity or crystal morphology, the module assigns a larger weight to the low-level path to emphasize texture and edge details. Conversely, when recognition relies more on global organizational features, the high-level semantic representation receives a higher weight to strengthen global structural information. The difference heatmap

|F_{fused} - F_{high}|

shown in Figure 3d illustrates the impact of low-level features on the fusion result. High-response areas correspond to regions that undergo significant changes relative to

F_{high}

, reflecting the module’s introduction of stronger texture and edge details at these locations. In terms of complexity, PWC-Fusion introduces approximately 0.29 M additional parameters, mainly from the channel alignment layer and the weight-generation network. The overall parameter increase is limited to 1.5%, which facilitates modular integration and reuse across different backbones. Beyond performance and efficiency, PWC-Fusion provides a computable fusion strategy that reflects the interpretation logic commonly used by geologists in thin-section identification. Traditional fusion methods often rely on simple concatenation or summation, which may fail to capture the intrinsic correlations involved in linking microscopic details to macroscopic fabrics. In contrast, PWC-Fusion offers a clearer formal expression of this cognitive path, enabling domain knowledge to be integrated into the network as differentiable operations that function jointly within the discrimination process.

2.2.3. Lion Optimizer

In thin-section lithology classification, samples within the same category may exhibit substantial variations in texture and scale, whereas differences between categories can be subtle and often concentrate in fabric and grain morphology. These data characteristics make gradient magnitudes and update directions more sensitive to batch composition, thereby imposing stricter requirements on convergence stability. AdamW, which is widely used in vision tasks, updates parameters using adaptive estimates of the first and second moments and applies decoupled weight decay to enhance training stability, typically showing reliable performance across diverse scenarios. However, for thin-section images, gradient fluctuations induced by inter-sample texture and scale variations can be amplified during training and lead to pronounced update instability. This issue becomes more prominent when the network must simultaneously balance fine-grained texture cues and higher-level semantic information within each iteration, making it difficult to maintain consistent update directions. To address these issues, this paper adopts Lion as the parameter update strategy. Lion performs updates primarily based on the momentum, with the update rule defined as follows [34]:

θ_{t + 1} = θ_{t} - η \cdot sign (m_{t})

(4)

Here

θ_{t}

denotes the current model parameter,

η

is the learning rate, and

m_{t}

is the exponential moving average of the first moment. This directional update mechanism, driven by the sign function

sign (\cdot)

, mitigates the influence of instantaneous gradient magnitude fluctuations on the update direction. As a result, it yields a smoother optimization trajectory, even under complex texture distributions. In the proposed Rock-PLionNet, Lion also works effectively with the cross-layer adaptive fusion introduced by the PWC-Fusion before classification. The fusion mechanism dynamically adjusts gradient contributions from different hierarchical paths based on sample content, while Lion promotes directionally stable updates. Together, they enable the backbone network and fusion module to maintain consistent convergence behavior during joint training, thereby improving training stability and generalization.

2.3. Sandstone Grain Segmentation Method

Building on the first-stage lithology classification and the localization of sandstone samples, this study performs refined grain-scale segmentation only for the sandstone subset retained after lithological screening. The segmentation results provide critical inputs for subsequent quantitative characterization, because boundary accuracy and connectivity directly affect statistics such as grain size, count, and area proportion. If grain boundaries are misaligned or connectivity is misinterpreted, the resulting statistical analysis can be biased, which in turn affects the interpretation of sedimentary structures and diagenetic features. Therefore, compared with intelligent identification that only outputs lithology types, high-quality grain-oriented segmentation is a critical step for quantitative evaluation of thin sections. However, under polarized light, mineral interference colors and brightness can vary significantly with crystal orientation and illumination. As a result, the same mineral may exhibit different hues and intensities at different locations, which increases the difficulty of appearance-based segmentation. In addition, in areas with dense cementation and tight grain contact, local boundary blurring and color texture aliasing between the matrix and grain edges are common. These effects make general segmentation models more prone to undersegmentation or oversegmentation, and it becomes difficult to satisfy the boundary closure and morphological consistency required for subsequent quantitative calculations. To address these challenges, this study proposes PetroSAM-CRF, a domain-tailored segmentation strategy for sandstone thin-section scenarios, as illustrated in Figure 4. For this lithologically screened sandstone subset, the framework first applies adaptive preprocessing to stabilize polarized-light-induced color and illumination variations and to improve the separability of grain boundaries and interstitial regions, thereby generating enhanced inputs for subsequent grain-oriented segmentation. Next, without pixel-level supervised training, the framework uses the SAM prompt mechanism to drive mask generation via regular grid point prompts. The first-stage classification result is not used to alter the prompt type or grid configuration in SAM; rather, it restricts the execution of this branch to the sandstone subset, under which the adopted preprocessing and brightness-guided filtering assumptions are defined. Because grain regions and the background are typically separable in brightness statistics under the imaging and enhancement settings used in this study, we introduce a weak prior filtering strategy guided by brightness. This weak prior is implemented in an image-adaptive manner, because the threshold is derived from the brightness statistics of each preprocessed image and is used together with the SAM stability score, rather than being defined as a fixed absolute value across all samples. Specifically, global mean brightness is used as an adaptive threshold and combined with stability scores to suppress unreliable predictions. Finally, the initial segmentation is refined with CRF using the enhanced RGB context to enforce boundary consistency, and morphological opening and closing are applied to remove isolated noise. The resulting grain masks better support subsequent quantitative parameter calculation and evaluation in terms of boundary adherence and morphological coherence.

2.3.1. Image Preprocessing

Under polarized imaging conditions for sandstone thin sections, color and illumination effects are often superimposed, which reduces the separability between grains and the background. To provide more stable inputs for the subsequent segmentation module, we design a two-stage preprocessing workflow (Figure 5) that transforms the raw image

I_{bgr}

into an enhanced input image

I_{final}

. In Stage I, the input BGR image is converted to the HSV color space. Only the saturation channel (S) is enhanced, while the hue (H) and value (V) channels are kept unchanged to avoid color distortion and shifts in the overall brightness distribution. Specifically, min-max stretching is applied to S:

S^{'} stretched = \frac{(S - Smin) \times 255}{S_{\max} - S_{\min}}

(5)

This operation normalizes the saturation range to [0, 255] and highlights differences in color purity between mineral grains and the background. To prevent the enhanced result from deviating excessively from the original appearance, weighted blending is then performed:

I_{blend} = 0.4 \times I_{raw} + 0.6 \times I_{enhanced}

(6)

The larger enhancement weight improves boundary contrast, whereas the original image component preserves realistic transitions near boundaries and reduces noise amplification caused by over-enhancement. In Stage II,

I_{blend}

is converted to the LAB color space. Contrast enhancement is applied only to the luminance channel L, while the chromaticity channels A and B are kept unchanged to preserve mineral color properties. To mitigate uneven illumination and enhance weak edges, CLAHE (ClipLimit = 2.0) is applied to the L channel to increase local contrast. In addition, gamma correction (

γ = 1.2

) is used to improve the visibility of grain contours in low-saturation regions:

L^{'} corrected (i) = 255 \times {(\frac{i}{255})}^{1 / γ}

(7)

The final image produced by this workflow is better adapted to the subsequent mask generation by SAM in terms of brightness balance, local contrast, and edge clarity. It provides more reliable boundary information for initial segmentation and reduces grain fragmentation and missed detections caused by unstable input distributions. Accordingly, this preprocessing workflow is intended not merely to improve visual contrast but to enhance the separability of grain boundaries while maintaining realistic boundary transitions under polarized-light conditions, thereby supporting subsequent quantitative characterization.

2.3.2. Segmentation Result Refinement

Although preprocessing enhances the separability between grains and interstitial materials, the binary masks produced by SAM can still exhibit defects such as jagged boundaries, internal discontinuities, and holes. These issues are particularly pronounced in regions with tight grain contacts or where cement and grains share similar colors, which directly affects subsequent grain-size statistics and morphological parameter calculations. To address these problems, global consistency refinement and morphological cleaning are introduced following the SAM output to enhance boundary adherence and regional coherence. Specifically, the SAM output mask

M (i) ϵ {0, 2 55}

is normalized to generate foreground and background probability maps. The foreground probability is defined as:

P_{fg} (i) = \frac{M (i)}{255}

(8)

The background probability is:

P_{bg} (i) = 1 - P_{fg} (i)

(9)

Subsequently, the probability values undergo linear scaling and clipping to the range [0.05, 0.95] to prevent numerical instability. Based on this initialization, we formulate refinement as an energy minimization problem in DenseCRF [42], with the energy function:

E (x) = \sum_{i} ψ_{u} (x_{i}) + \sum_{i < j} ψ_{p} (x_{i}, x_{j}),

(10)

where

x

denotes the pixel label vector. The unary potential function

ψ_{u} (x_{i}) = - \log P (x_{i})

is constructed from the foreground and background probabilities, providing pixel-level priors. The pairwise potential function

ψ_{p} (x_{i}, x_{j})

includes two Gaussian kernels. The spatial kernel depends only on pixel positions to suppress isolated noise, while the appearance bilateral kernel incorporates information from the enhanced BGR image and jointly measures spatial distance and color differences between pixels. This appearance constraint encourages label consistency in contiguous regions with similar colors while preserving genuine grain boundaries where abrupt color changes occur. To reduce over-smoothing, the number of inference iterations is limited to no more than three, and an area protection strategy is introduced. Specifically, if the number of foreground pixels in the CRF outputs is less than 50% of that in the input mask, the refinement is reverted to the original mask to improve stability and robustness. Although CRF primarily improves boundary adherence, small holes and fragmented noise may remain. Therefore, lightweight morphological cleaning is applied to further enhance connectivity. An elliptical structuring element with a kernel size of 3 is used. One opening operation is first performed to remove small, isolated points, followed by two closing operations to fill tiny holes. Finally, connected components with an area smaller than 50 pixels are removed. This procedure produces grain masks with more regular shapes that are directly usable for subsequent quantitative calculations.

2.4. Quantitative Analysis of Sandstone Fabric

Taking the segmentation results from the previous stage as input, the geometric and area information of grains in thin-section images is converted into computable fabric parameters and reported in a standardized format. In implementation, the outer contour of each grain region is extracted from the binary mask (Figure 6a,b) to calculate its area

S_{i}

and perimeter

P_{i}

. The major and minor axis lengths are then estimated using the minimum-area bounding rectangle (MABR) (Figure 6c,d). In this study, the major-axis length is used as a major-axis grain-size descriptor

d_{i}

, rather than as a strict equivalent grain diameter.

We adopted this descriptor for two reasons. First, in conventional petrographic thin-section analysis, grain size is commonly measured manually along the longest visible dimension of a grain in a 2D section view. The major axis of the minimum-area bounding rectangle can therefore be regarded as the automated image-analysis counterpart of this long-axis measurement convention. Second, for a given segmented contour, the minimum-area bounding rectangle provides a geometrically unique and fully reproducible measurement, thereby reducing subjectivity introduced by manual endpoint selection and orientation judgment.

Previous thin-section image-analysis studies have also used long-axis measurements as grain-size descriptors [19,43,44,45], although such quantities should not be interpreted as strict equivalents of the equivalent circular diameter (ECD). Accordingly, in the present study, the MABR major axis is treated as a long-axis-based image grain-size descriptor. For approximately equant grains, the difference from ECD or conventional grain-size intuition is usually limited; however, for elongated grains, the measured value may be larger. Therefore, the grain-size values reported here are intended primarily for automated grain-size statistics and relative comparison across samples.

Before grain-size and grain-morphology statistics were computed, grains intersecting the image boundary were excluded from particle-level quantitative analysis. Specifically, a segmented grain was treated as an edge grain if its connected region touched the image border. Because such grains do not preserve complete two-dimensional geometry in the field of view, they were not included in measurements or statistics based on complete particle contours and bounding rectangles, including grain-size distribution, cumulative grain-size curves, aspect ratio, roundness, and the corresponding sample-level summary values. In contrast, the areal proportion of interstitial materials was still calculated from the full image, because this index describes phase proportion at the image scale rather than the complete geometry of individual grains.

According to the standards in Identification for thin section of rocks [46], grain-size grades are classified into fine gravel, coarse sand, medium sand, fine sand, very fine sand, coarse silt, and fine silt. In this study, these class boundaries are used as a grading reference for the measured long-axis values. The standard provides the grain-size grade boundaries, whereas the dominant grain-size interval used below is defined here as the grain-size range corresponding to 25–75% on the cumulative area percentage curve. This interval is introduced as an operational descriptor to characterize the concentration range of the dominant grain size and to facilitate cross-sample comparison. Concurrently, the maximum area proportion of a single grain size grade serves as a sorting indicator. Based on defined thresholds, sorting is classified as good when the area proportion is at least 75%, moderate when it is 50% to 75%, and poor when it is less than 50%.

Grain morphological parameters characterize equiaxiality, elongation, and the degree of abrasion modification from a geometric perspective, and they provide a link between grain-size statistics and the interpretation of sedimentary and diagenetic processes. Based on the contour and minimum-area bounding rectangle, two complementary indices are introduced: the grain aspect ratio to quantify elongation and the roundness index to assess boundary regularity. The aspect ratio is defined as the ratio of the minor axis to the major axis of the minimum-area bounding rectangle:

A R_{i} = \frac{W_{i}}{L_{i}}

(11)

where

W_{i}

and

L_{i}

denote the minor and major-axis lengths of the

i

-th grain. Roundness is measured using the classic circularity metric:

R_{i} = \frac{4 π S_{i}}{{P_{i}}^{2}}

(12)

This ratio is invariant to uniform scaling, which makes it suitable for comparisons across samples and magnification settings. The mean of all

R_{i}

values are computed as the sample roundness index R, which is then assigned a corresponding grade. Following Identification for thin section of rocks [46] and Ren et al. [45], the classification boundaries are set to 0.69, 0.76, 0.79, and 0.85. Specifically, R ≤ 0.69 indicates angular, 0.69 < R ≤ 0.76 sub-angular, 0.76 < R ≤ 0.79 sub-rounded, 0.79 < R ≤ 0.85 rounded, and 0.85 < R ≤ 1 well-rounded. Beyond geometric features, the relative content of interstitial materials directly affects pore structure and reservoir physical properties. Based on the segmentation mask, the proportion of interstitial materials is calculated to automatically determine the cementation type. Following Ren et al. [45], basal cementation corresponds to 25% to 100% interstitial content, basal-pore cementation to 20% to 25%, pore cementation to 10% to 20%, contact-pore cementation to 5% to 10%, and contact cementation to 0% to 5%. Finally, the indices obtained from the above calculations are automatically summarized into a standardized quantitative evaluation report. This structured output reduces variability introduced by manual recording, supports lateral comparison across samples, and enables large-scale statistical analysis, thereby forming a closed loop from image segmentation to the interpretation of geological fabric parameters.

3. Experimental Setup

3.1. Dataset

The data used in this study were collected from open-access microscopic image resources provided by the China Scientific Data. To cover diverse sedimentary environments and diagenetic features, several representative regional datasets were screened and integrated, including (1) a photomicrograph dataset of rocks for petrology teaching at Nanjing University [47], (2) a photomicrograph dataset of Upper Paleozoic tight sandstone from Linxing block, eastern margin of Ordos Basin [48], and (3) polarized light micrograph dataset of Late Cretaceous-Eocene rock thin sections from Western Tarim Basin, Xinjiang [49], as well as other publicly available datasets. These samples span a range of geological ages and tectonic settings, supporting model robustness under variations in mineral compositions, cementation types, and imaging conditions. The resulting dataset contains 3290 images, including 1510 limestone images, 795 dolomite images, and 985 sandstone images. Figure 7 presents representative samples of the three rock types in the dataset.

To reduce information leakage and ensure objective source-wise evaluation, the dataset was partitioned by grouping images from the same specimen or acquisition source rather than splitting at the individual-image level. The target proportions for the training, validation, and test subsets were approximately 70%, 10%, and 20%, respectively. Because the partitioning was performed on grouped specimens or sources, the final image counts did not exactly match the nominal proportions. The resulting subsets contained 2490 training images, 275 validation images, and 525 test images. The validation subset was used only for model selection, whereas all final performance metrics reported in this study were computed on the held-out test set, so that the final classification results were evaluated on sources unseen during training rather than on image-level mixtures of pooled sources. Data augmentation was applied only after dataset partitioning and only to the training subset.

For grain segmentation evaluation, a manually annotated reference subset of 10 sandstone samples was used, containing a total of 1634 grain instances. The annotations were produced by authors involved in this study, all of whom have geological backgrounds, and were subsequently reviewed by geological professors. For ambiguous grain boundaries and tight grain-contact regions, a consensus-based quality-control procedure was adopted, including initial annotation, secondary review, discussion of discrepant regions, and generation of a final agreed-upon version. The selected images cover representative difficult cases encountered in sandstone thin sections, including blurred boundaries, different degrees of grain contact, and cementation-related interference, and the resulting masks were used as the manual reference annotations for segmentation evaluation.

3.2. Implementation Details

All experiments were conducted on an NVIDIA RTX 4060 GPU using PyTorch 2.0.1 and Python 3.10. All classification models used a unified basic configuration with 50 training epochs and a batch size of 16. For the main comparative experiments, all classifiers were trained with AdamW to provide a better-controlled benchmark across architectures. The Lion optimizer was further evaluated only within the ablation study of the proposed framework. Both AdamW and Lion were used with the ReduceLROnPlateau scheduler. If validation performance did not improve for five consecutive epochs, the learning rate was multiplied by

γ = 0.5

, with the minimum learning rate set to

η_{\min} = 1 \times 10^{- 7}

. For classification, we used cross-entropy loss with label smoothing (

α = 0.2

).

The input preprocessing and data augmentation strategies are shown in Figure 8. During training, all images were first resized to 256 × 256 pixels. A 224 × 224 region was then randomly cropped. In Figure 8a, the red frame indicates the cropping window on the resized image, and Figure 8b shows the cropped patch. We further applied data augmentation to enhance robustness to spatial orientation changes in thin-section images, including horizontal flipping (Figure 8c), vertical flipping (Figure 8d), and random rotation (Figure 8e). In addition, random erasing was applied with probability

P_{erase} = 0.3

(Figure 8f), where the black rectangle denotes the occluded region. This prevents the model from over-relying on texture or grain features at specific locations. For validation and testing, images were only resized to 224 × 224 pixels without augmentation to ensure objective evaluation. All inputs were normalized using the ImageNet mean (

μ = [0.485, 0.456, 0.406]

) and standard deviation (

σ = [0.229, 0.224, 0.225]

) before being fed into the network.

3.3. Evaluation Metrics

To comprehensively evaluate the proposed method, this study defines separate quantitative metrics for the two key tasks: lithology classification and grain segmentation.

3.3.1. Evaluation Metrics for Lithology Classification

For multi-class lithology classification, this study reports Accuracy, Precision, Recall, and F1-Score on the test set as the primary performance metrics. The mathematical definitions and calculation formulas of these evaluation metrics are provided in Appendix A.1. Specifically, Accuracy is computed as the sample-level proportion of correctly classified images over the entire test subset, whereas Precision, Recall, and F1-Score are calculated from the confusion matrix of the full test subset and then macro-averaged across the three lithology classes. In addition, to assess computational efficiency and deployment feasibility, we report the parameters and floating-point operations (FLOPs) as measures of model complexity.

3.3.2. Evaluation Metrics for Grain Segmentation

For regional overlap, Pixel Accuracy (PA), Intersection over Union (IoU), and Dice Coefficient are used to measure overall agreement between predicted regions and ground truth. However, these metrics alone are insufficient to precisely evaluate grain boundary delineation, which is critical for subsequent quantitative analysis of grain morphology. Therefore, we further report the average symmetric surface distance (ASSD) and 95% Hausdorff Distance (HD95). By combining regional overlap and boundary-distance metrics, the segmentation performance can be evaluated more comprehensively, from macroscopic region agreement to microscopic boundary details. ASSD measures the average positional difference between the predicted contour and the reference contour, and therefore reflects the overall accuracy of grain-boundary localization. HD95 describes the relatively large boundary deviations that still remain over most contour points; lower HD95 values indicate that obvious local misplacements along particle edges are better controlled, so the extracted contour is less likely to deviate markedly from the true particle outline at grain contacts or blurred boundaries. This is important for subsequent particle-size measurement, because such local contour misplacement can directly enlarge or shrink the measured particle outline, or affect the separation of adjacent grains, thereby introducing error into contour-based size descriptors. All segmentation metrics reported here were computed on complete predicted and reference masks over the full image extent; exclusion of edge grains was applied only in the subsequent particle-level fabric quantification described in Section 2.4. The mathematical definitions and calculation formulas of all evaluation metrics are provided in Appendix A.2.

4. Experimental Results and Analysis

4.1. Performance Evaluation of Lithology Classification

As the first stage of the proposed three-stage framework, lithology classification directly affects the accuracy of subsequent sandstone grain segmentation and quantitative evaluation. To validate the effectiveness of Rock-PLionNet, we compared it with five widely used deep learning models, including traditional convolutional networks (ResNet50, VGG19), modern efficient architectures (ConvNeXt-Base), and transformer-based models (Swin-T, ViT-Base). The validation subset was used only for model selection, while all results reported in this section were evaluated on the held-out test set. The results show that Rock-PLionNet outperforms these baselines on key metrics, including Accuracy, Precision, Recall, and F1-Score, while also achieving advantages in parameter count and computational cost (FLOPs). These results indicate a favorable balance between predictive performance and resource efficiency. Furthermore, ablation studies quantify the respective contributions of the PWC-Fusion module and the Lion optimizer, confirming the effectiveness of both components in improving classification accuracy.

4.1.1. Comparative Experiments

As shown in Table 1, under the controlled benchmark in which all classifiers were trained with AdamW and evaluated on the held-out test set, the proposed architecture outperforms the comparison models on the core metrics of Accuracy, Precision, Recall, and F1-Score. These results indicate that the proposed cross-scale fusion design itself contributes to improved lithology classification performance, rather than the gain arising solely from optimizer choice. Additionally, the proposed architecture achieves the lowest loss value, suggesting better optimization behavior quality under the same optimizer setting. Moreover, it maintains a relatively low parameter count and computational cost (FLOPs), which are significantly lower than those of high-complexity models such as ViT-Base and VGG19. These results demonstrate a favorable balance between performance and efficiency, making the proposed framework more suitable for deployment in resource-constrained scenarios.

Furthermore, the confusion matrices in Figure 9, evaluated on the test set, show that Rock-PLionNet further mitigates the mutual misclassification between limestone and dolomite. In the thin-section dataset used in this study, these two lithologies are prone to confusion because of similarities in hue and textural appearance. Compared with the baseline models, Rock-PLionNet shows less confusion between limestone and dolomite, with fewer misclassifications between the two classes in the confusion matrix. This indicates a stronger ability to distinguish lithologies with similar visual characteristics.

Visualization analysis further helps explain the performance differences among models. The Grad-CAM results in Figure 10 show that ConvNeXt tends to respond more strongly near image boundaries, indicating a relatively greater sensitivity to boundary cues. In contrast, VGG19 exhibits a more diffuse response over broad regions, suggesting a greater reliance on global appearance features and less attention to local microstructural details. By comparison, Rock-PLionNet highlights multiple grain-related regions across the image, with stronger responses focused on representative grain boundaries, local texture variations, and structurally informative areas, rather than being dominated by peripheral edges or broad diffuse regions. This pattern suggests that the cross-scale fusion enabled by PWC-Fusion supports the joint use of local details and global structural information, thereby providing a more reliable basis for classification.

4.1.2. Ablation Experiments

To better distinguish the contribution of the PWC-Fusion module from that of the optimizer, Table 2 evaluates the effects of architectural modification and optimizer replacement in separate comparisons. Comparing the first two configurations shows that introducing the PWC-Fusion module to the EfficientNetV2-S baseline under AdamW leads to clear improvements in metrics such as Accuracy and F1-Score, confirming the effectiveness of this multi-scale feature fusion module. Relative to the baseline configuration, the PWC-Fusion variant trained with AdamW also achieves a lower loss value, indicating that the proposed fusion strategy improves both discriminative performance and optimization quality. To further assess the contribution of the optimizer, we compared AdamW and Lion under identical learning rate and weight decay settings (

lr = 1 \times 10^{- 5}

,

weight decay = 0.1

) while keeping the PWC-Fusion architecture unchanged. Under this controlled setting, replacing AdamW with Lion further improves performance metrics and yields a lower loss value, suggesting that Lion provides an additional gain beyond the architectural improvement brought by PWC-Fusion. In contrast, adjusting the learning rate and weight decay within AdamW alone does not improve classification performance; under the setting of

lr = 1 \times 10^{- 5}

and

weight decay = 0.1

, the AdamW-based configuration yields lower metrics than both the stronger AdamW setting with

lr = 1 \times 10^{- 4}

and

weight decay = 0.01

and the baseline model. Overall, the results in Table 2 show that PWC-Fusion contributes the main architectural improvement, while Lion provides an additional optimization benefit within the proposed framework.

Differences in optimization behavior across configurations are further illustrated by the loss curves in Figure 11. The two AdamW settings in Figure 11a,b exhibit noticeable fluctuations in validation loss, indicating less stable optimization under these configurations. Figure 11c shows a smoother downward trend after adjusting the AdamW hyperparameters; however, this apparent improvement in training stability does not translate into better final performance. Instead, the corresponding configuration yields the weakest quantitative results in Table 2, indicating that modifying the AdamW hyperparameters alone is insufficient to improve generalization on this task. In contrast, the configuration using Lion (Figure 11d) shows both a smoother convergence pattern and a lower final validation loss. This result is consistent with the quantitative results in Table 2 and further supports the advantage of Lion in improving convergence behavior for this task.

4.2. Performance Evaluation of Grain Segmentation

4.2.1. Comparative Experiments

To evaluate the effectiveness of the proposed segmentation framework, we compared it on the manually annotated reference subset with several baseline segmentation methods, including classic image processing techniques such as K-means clustering, Otsu, and the Watershed algorithm, as well as an unsupervised CNN-based method [44] and plain SAM.

Figure 12 presents qualitative results on representative sandstone thin-section samples. Overall, K-means, Otsu, and Watershed methods are sensitive to grayscale and color variations, which often lead to fragmented noise and mis-segmentation within grains. The unsupervised CNN suppresses noise to some extent, but boundary discontinuities and local jagged artifacts frequently remain at grain-contact boundaries. In contrast, the proposed PetroSAM-CRF produces more coherent grain contours with less noise and higher boundary adherence in grain-contact regions, yielding results that are visually closer to manual annotations. The manually annotated reference masks used in this evaluation are described in Section 3.1.

Table 3 summarizes the quantitative results of the six segmentation methods and enables comparison from two perspectives: regional overlap and boundary fidelity. For regional overlap metrics, PetroSAM-CRF achieves the best performance in PA, IoU, and Dice. Among the traditional methods, Watershed performs best in PA and Dice, while plain SAM achieves the highest IOU. For boundary-distance metrics, PetroSAM-CRF again performs best, and plain SAM ranks second in both ASSD and HD95. Compared with plain SAM, PetroSAM-CRF further reduces ASSD and HD95 by approximately 22.8% and 22.3%, respectively. This result indicates that the CRF-based refinement effectively suppresses boundary jaggedness and local discontinuities. At the same time, the quantitative pattern is consistent with the qualitative observations in Figure 13: although plain SAM already shows strong boundary fidelity, adjacent grains are still prone to merging at contact boundaries, which limits further improvement in regional consistency. By contrast, the conventional methods remain more sensitive to grayscale and color variations, while the Unsup-CNN shows the largest ASSD and a markedly high HD95 value, which is consistent with the boundary blurring and loss of local structures observed in Figure 12. Overall, Table 3 demonstrates that PetroSAM-CRF provides superior performance in both regional consistency and boundary precision.

Figure 13 illustrates the output differences across the stages of PetroSAM-CRF on multiple samples. Column (b) shows the segmentation produced by SAM alone. Although most grain regions are detected, grain adhesion is common at contact boundaries, causing adjacent grains to be merged into a single connected component. For the red-boxed region, the boundaries between neighboring grains in (b) are ambiguous, which leads to insufficient grain separation. This qualitative behavior is also consistent with the quantitative results in Table 3, where plain SAM already shows strong boundary fidelity but still remains inferior to PetroSAM-CRF in both regional overlap and boundary-distance metrics. After introducing preprocessing, column (c) shows that enhanced contrast and clearer edges help distinguish grains that were previously adhered, thereby producing more distinct boundaries. However, contrast enhancement can also amplify local brightness variations and texture discrepancies within thin-section images, introducing small amounts of noise in the masks. Finally, column (d) refines the mask boundaries, suppresses internal mis-segmentation, and improves boundary continuity, producing contours that are smoother and more closed. Furthermore, because morphological indices such as roundness are sensitive to perimeter estimation, reducing boundary jaggedness and local discontinuities mitigates perimeter overestimation, thereby ensuring that subsequent morphological parameter calculations more closely reflect the true grain contours.

4.2.2. Geological Parameter-Based Evaluation of Segmentation

Figure 14 compares the errors of different segmentation methods on two geological parameters: the mean absolute deviation of grain-size distribution and the relative deviation of the mean roundness index. As shown in Figure 14a, PetroSAM-CRF yields the lowest mean absolute deviation across six grain-size bins, indicating that its segmentation results are more consistent with manual annotations when used for grain-size statistics. Figure 14b further shows that the relative deviation of the mean roundness index for PetroSAM-CRF is only 2.83%, which is significantly lower than that of other methods. Specifically, the deviation for Unsup-CNN is 6.73%, whereas the relative deviations for K-Means, Otsu, and Watershed all exceed 17%. The substantial deviations in roundness estimates produced by traditional segmentation methods are mainly attributable to irregular and jagged boundary artifacts. Such irregularities can overestimate the perimeter while having a limited influence on area, thereby biasing roundness indices derived from these measurements. In contrast, PetroSAM-CRF better suppresses boundary level irregularities and preserves contour integrity, ensuring that morphological parameters such as roundness more closely match those computed from manual annotations. Overall, these results indicate that improved segmentation quality can enhance the consistency of downstream geological parameter estimation, providing more reliable inputs for quantitative analysis of sandstone thin sections.

4.3. Sample-Based Workflow Demonstration

This section selects a specific sandstone sample to demonstrate the complete workflow from segmentation to quantitative visualization, followed by an assessment of geological plausibility. Given that PetroSAM-CRF has been shown in the preceding experiments to reduce bias in parameters such as grain size and roundness, this section serves as a workflow demonstration. It illustrates how image data can be transformed into interpretable quantitative characterizations for sedimentology and petrological analysis. This sample-based analysis is intended to demonstrate workflow feasibility and the geological interpretability of the derived parameters, rather than to establish broad geological applicability across more diverse sandstone types or depositional settings.

4.3.1. Automatic Measurement of Grain Size

Figure 15 illustrates discrepancies in grain-size extraction between manual measurement and the proposed automated method. In Figure 15a, measurements of the same grain vary because operators may select different long-axis orientations and endpoints, reflecting uncertainty introduced by subjectivity in traditional manual measurement. Figure 15b shows the proposed automated scheme, which fits a minimum-area bounding rectangle (blue box) to the segmented grain contour and uses the major axis as the major-axis grain size

d_{i}

. Given a fixed segmented contour and uniform measurement rules, this procedure yields a unique output value, thereby reducing discrepancies caused by human selection and enhancing the reproducibility and consistency of grain-size statistics across large batches of samples.

4.3.2. Grain-Size Distribution

Figure 16 summarizes the grain-size class composition and cumulative grain-size distribution of the retained fully enclosed grains in the sample after exclusion of edge grains. As shown in Figure 16a, the sample is dominated by coarse sand-grade grains, whereas medium sand makes a secondary contribution and fine sand accounts for only a small proportion. Very fine sand is present only in trace amounts, and gravel- or silt-grade components are negligible. The area proportions, therefore, indicate an overall coarse-grained texture in terms of areal contribution. According to the sorting criteria defined in this paper, the area proportion of the dominant grain-size class falls within the 75% threshold for good sorting. The grain-size frequency distribution in Figure 16b shows that, in terms of number-based statistics, grain sizes are mainly concentrated between 0.10 and 0.40 mm, with frequency decreasing progressively toward the coarser end, although a few larger grains extend to above 1.0 mm. The main grain-size interval derived from the cumulative curve ranges from 0.52 to 1.00 mm, corresponding predominantly to coarse sand. In summary, the sample is characterized by a coarse-sand-dominated grain-size composition and a relatively concentrated main interval, providing quantitative support for subsequent interpretation of depositional processes and sedimentary texture.

4.3.3. Grain Shape Analysis

After exclusion of edge grains, Figure 17a shows the cumulative distribution of grain aspect ratios for the retained fully enclosed grains. The curve increases relatively smoothly, indicating a dispersed distribution without a distinct dominant interval. The mean aspect ratio is 0.669, suggesting that grains are moderately elongated. The distribution is neither dominated by near-equant grains (approaching 1) nor characterized by a substantial proportion of extremely slender grains (approaching 0). This morphological dispersion indicates that variation in grain elongation remains evident within the sample, even though the grain size in Figure 16 is relatively concentrated and dominated by coarse sand. Figure 17b presents the frequency distribution of the roundness index and its mean value for the same set of retained fully enclosed grains. The roundness index spans a wide range, from approximately 0.1 to about 0.9, indicating appreciable variability in boundary smoothness among grains. The mean roundness index is 0.571. According to the classification thresholds used in this study, the overall roundness grade is classified as angular. Although the histogram shows relatively higher frequencies in the 0.65 to 0.85 interval, low-roundness grains still account for a non-negligible proportion, indicating a mixed population dominated by moderate to relatively high roundness, but with a persistent contribution from poorly rounded grains. These characteristics may be related to mixed provenance, variations in transport conditions, and post-depositional crushing or modification.

Furthermore, the cementation type of this sample is identified as basal cementation, indicating a high content of interstitial material and poor preservation of intergranular pores. This assessment is consistent with the quantitatively derived fabric characteristics, including the coarse-sand-dominated grain-size framework, the relatively concentrated main grain-size interval, and the moderate grain-shape variability shown in Figure 16 and Figure 17. Although the sample exhibits good sorting in terms of dominant grain-size proportion, the high proportion of interstitial material still tends to occupy intergranular spaces and obstruct pore throats, thereby reducing pore connectivity and producing a tighter pore structure. Consequently, this example not only demonstrates the workflow’s ability to consistently derive multidimensional fabric parameters from segmentation results but also shows that these parameters agree with geological interpretations of sedimentation and diagenesis. This establishes a closed loop from image segmentation to fabric quantification and geological interpretation, serving as a reusable data basis for future cross-sample comparisons and reservoir quality prediction.

5. Discussion

5.1. Geological Implications

The geological grounding of the present framework is introduced mainly at two levels, namely lithology discrimination and sandstone fabric characterization. In the classification stage, the model is designed to integrate information from different observational scales in thin-section images, so that grain-scale local details and field-of-view-scale fabric information can jointly contribute to lithology identification. In the segmentation stage, the analysis is further constrained to the sandstone images screened by the preceding classifier, and the resulting grain-scale masks are intended to support subsequent quantitative characterization of sandstone fabrics. Therefore, the current framework should be understood not simply as a visual recognition pipeline, but as a lithology-conditioned workflow that links thin-section image analysis with downstream grain-boundary delineation and quantitative fabric evaluation.

5.2. Grain-Size Interpretation

Using the major axis of the minimum-area bounding rectangle as the grain-size descriptor also has implications for sedimentological interpretation. For approximately equant grains, the resulting size values are usually close to common grain-size intuition and therefore have limited influence on the overall interpretation. However, for elongated grains, a long-axis-based measurement may yield larger values than ECD or sieve-based intermediate-axis grain size. As a result, the proportion of grains assigned to coarser classes may be slightly increased, and sorting may appear somewhat poorer than it would under sieve-based classification. This means that direct comparison with traditional sieve data or with studies based on ECD should be made with caution.

In terms of grain-size classification, the present scheme still follows the same main hierarchical structure as the established Wentworth grain-size scale [50], namely the gravel–sand–silt framework and the standard sand subclasses of coarse, medium, fine, and very fine sand. The differences mainly concern the measurement basis and the degree of subdivision at some boundaries. In this study, grain size is derived from the major axis of the minimum-area bounding rectangle in two-dimensional thin-section view rather than from sieve diameter, and the grain-size grades adopted from Identification for Thin Section of Rocks are more condensed than the fully subdivided Wentworth scale near the sand–gravel transition and within the silt fraction. For the present purpose, however, this difference does not materially affect the main conclusions, because the analysis is based on a uniform image-derived descriptor applied consistently to all grains and is intended to characterize dominant interval, sorting tendency, and relative variation within and between samples, rather than to reproduce bulk sieve-size results point by point. In addition, the 25–75% interval is introduced here only as an auxiliary descriptor of the central concentration range on the cumulative area curve, rather than as part of the grain-size classification system itself. Similar thin-section grain-size statistics have also been reported in recent sandstone structure evaluation studies [45].

5.3. Scope and Validation Boundaries

At the same time, the scope of the present study remains limited by the granularity of the available annotations. Although the proposed framework achieves strong performance in the current three-class setting, its supervision is still defined at the lithology level, namely sandstone, limestone, and dolomite. The present dataset does not provide explicit annotations for finer sedimentological subclasses or component-level petrographic categories, such as oolitic or pisolitic fabrics, bioclastic constituents, or mineral-level classes in very fine-grained rocks. Accordingly, the results of this study should be interpreted as evidence of effective lithology-level discrimination and sandstone fabric-oriented quantitative analysis, rather than as a complete implementation of standardized sedimentological classification. In addition, the quantitative case study presented in Section 4.3 is based on a single sandstone sample and is intended as a workflow demonstration, rather than a full validation of broad geological applicability. Extending the framework toward finer geological subdivision and wider coverage of more heterogeneous lithologies, such as organic-rich shales or metamorphic rocks, will further require dedicated hierarchical annotations, additional datasets, and task-specific evaluation.

At the classification stage, the present evaluation adopts a source-wise partition rather than a random image-level split, so that images from the same specimen or acquisition source are confined to a single subset. This design reduces source leakage and provides a stricter estimate of model performance within the pooled multi-source dataset. However, this setting is still not equivalent to a fully external validation. Publicly available thin-section datasets typically vary simultaneously in source provenance, lithological composition, sample preparation, imaging conditions, and annotation criteria. Consequently, performance differences on an external dataset cannot be attributed solely to the model itself, because they may also result from differences in geological sampling, visual characteristics, and annotation conventions across datasets. Future work should therefore focus on more rigorous external validation using independently collected test datasets with better-controlled imaging conditions, sample-preparation records, and annotation criteria, so that differences in external-test performance can be interpreted more reliably.

5.4. Limitations of Segmentation and Quantitative Evaluation

A direct comparison with recent strong supervised or transformer-based segmentation networks was not included in the present study, because the currently available manually annotated sandstone subset was constructed as an evaluation reference rather than as a dedicated training corpus. At its present scale, this subset is sufficient for controlled metric-based evaluation of grain-mask quality, but it does not provide a reliable basis for a fair benchmark that would require model training, validation, model selection, and hyperparameter tuning across modern supervised architectures. This limitation is consistent with recent petrographic segmentation studies in which supervised thin-section mineral identification was developed on dedicated annotated datasets with expert-reviewed masks rather than on lightweight evaluation-only subsets [51]. Recent transformer-based thin-section segmentation studies further suggest that such models are typically developed under substantially larger annotated datasets and more specialized input settings, including paired polarized-light modalities and multi-class grain labels, while data insufficiency and class imbalance can still remain important constraints even under those conditions [52]. Under the current data conditions, a forced comparison of this kind would therefore be more likely to introduce a new source of unfairness than to provide a rigorous basis for evaluation. Future work will require a substantially expanded sandstone dataset with dense manual annotations that is built specifically for supervised segmentation. Based on such a benchmark, specimen-level training, validation, and test splits can be established, enabling fair comparisons with both a strong supervised convolutional model and a transformer-based model under the same input conditions, label definitions, and evaluation metrics.

The applicability of the present workflow is also related to acquisition-induced appearance variation. In PetroSAM-CRF, the brightness-guided weak prior is not implemented as a fixed absolute threshold shared across all images; instead, it is derived from the brightness statistics of each preprocessed image and used together with the SAM stability score during filtering. Combined with the two-stage preprocessing workflow, this design helps mitigate the influence of moderate fluctuations in illumination and contrast under the imaging conditions represented in the present dataset. At the same time, the current study was not designed to isolate microscope-setting variation, inter-laboratory illumination differences, or staining-related appearance shifts as controlled variables. These factors cannot be rigorously evaluated by simply pooling more images from different sources, because acquisition-related differences may be confounded with lithological heterogeneity, texture complexity, and sample-specific variation. A more rigorous assessment of robustness under such conditions will require cross-source petrographic datasets with controlled acquisition metadata and comparable imaging or staining protocols, which also defines an important direction for future work.

In addition, the current quantitative fabric characterization stage handles edge grains using a conservative exclusion strategy rather than geometric reconstruction. Future work may explore stereological correction or mosaic-based reconstruction to further improve robustness to field-of-view truncation.

5.5. Methodological Contribution and Computational Efficiency

The present study is implemented as a three-stage cascaded framework consisting of lithology classification, sandstone grain segmentation, and quantitative parameter evaluation. Although these stages together form a closed automated workflow from image input to geological parameter extraction, they are not jointly optimized within a single end-to-end training process. Within this methodological scope, the contribution of PetroSAM-CRF lies not in modifying the foundation model itself, but in organizing polarization-aware preprocessing, lithology-constrained candidate-mask screening, CRF-based refinement, and morphology cleaning into a sandstone-oriented segmentation strategy for quantitative fabric characterization. This design improves grain-boundary adherence and morphological coherence in complex thin-section scenes, and its effectiveness is reflected not only in conventional segmentation metrics but also in the consistency of downstream geological parameters such as grain-size distribution and roundness.

We also evaluated the practical runtime efficiency of the Rock-PLionNet lithology-classification stage as a standalone benchmark on single high-resolution thin-section images. In this benchmark, each original image was first resized to the standard network input size using the same preprocessing pipeline as in testing, and then processed by a single forward pass. On an NVIDIA GeForce RTX 4060 Laptop GPU, the mean total processing time was 93.85 ms per image. This value reflects the latency of the classification model itself under standalone inference conditions and provides a practical basis for applying the classifier to on-site assisted lithology screening and batch processing of large-scale exploration thin-section image databases.

In addition to the standalone runtime of the classification stage, we further evaluated the computational efficiency of the complete three-stage workflow on single thin-section images under the practical processing logic used in this study. In this workflow, all images first pass through lithology classification, and only those predicted as sandstone enter the subsequent segmentation branch. The SAM predictor is initialized only when the first sandstone image is encountered and is then reused for subsequent sandstone images. On the same NVIDIA GeForce RTX 4060 Laptop GPU, the mean classification time over all tested images was 0.147 s. For sandstone images, the image-enhancement stage required 0.415 s, the one-time SAM initialization required 8.793 s, the SAM-based segmentation stage required 307.934 s, the DenseCRF refinement stage required 8.420 s, the morphological cleanup stage required 0.436 s, and the quantitative evaluation stage required 0.054 s. Under this workflow, non-sandstone images required only 0.127 s on average because they were filtered out before segmentation. Across the mixed image stream used in this analysis, the overall average processing time was 106.881 s per image, which is approximately 1.78 min. These results show that the computational cost of the current implementation is dominated by the SAM-based segmentation stage, whereas DenseCRF contributes an additional but clearly smaller overhead. Therefore, the present workflow is more suitable for offline batch analysis and assisted petrographic evaluation than for real-time high-throughput deployment. Future acceleration should mainly focus on the segmentation stage, for example, by adopting a more efficient SAM variant and using a lighter refinement strategy.

6. Conclusions

This study developed and evaluated a three-stage cascaded analysis framework that integrates lithology classification, grain segmentation, and quantitative evaluation for sedimentary rock thin-section analysis. The proposed workflow overcomes common technical bottlenecks of existing studies, in which these tasks are often optimized independently and lack a systematic linkage between classification and segmentation. Such separation prevents downstream processing from fully leveraging prior lithological information and makes it difficult for generic models to align analysis strategies with lithology-specific characteristics when dealing with rock samples from diverse genetic origins.

To address the complex mineral compositions and variable microtextures in thin-section images, this study designed Rock-PLionNet. By integrating the PWC-Fusion module and the Lion optimizer, the model improves microfeature extraction and convergence stability, thereby enhancing discrimination among lithologies with similar visual characteristics. Building on this foundation, we further developed the PetroSAM-CRF segmentation strategy to specifically address blurred boundaries and grain adhesion caused by dense packing of sandstone grains. The contribution of this strategy lies not in redefining the foundation model itself, but in organizing sandstone-oriented preprocessing, lithology-constrained mask screening, and CRF-based refinement into a domain-tailored segmentation workflow for quantitative fabric characterization. This design improves the localization accuracy of contact boundaries between adhering grains and enhances the precision of geometric and morphological characterization.

Overall, the proposed cascaded pipeline provides a standardized route for converting microscopic image pixels into quantitative geological parameters. It improves the objectivity and reproducibility of key indices, such as grain-size distribution and sorting, and supports more reliable characterization of reservoir microscopic heterogeneity. Future work will extend the framework to larger cross-source thin-section datasets with denser manual annotations and more tightly controlled imaging, acquisition, and sample-preparation protocols, thereby improving the reliability of geological interpretation and enabling its application to more heterogeneous lithologies. In parallel, the segmentation branch will be further optimized through more efficient SAM-based inference, lighter refinement strategies, and improved treatment of edge grains, so as to strengthen robustness and practical deployability.

Author Contributions

Conceptualization, W.Y. and A.L.; methodology, W.Y. and A.L.; software, W.Y., A.L. and L.Z.; validation, W.Y. and X.Q.; formal analysis, W.Y., A.L. and L.Z.; investigation, W.Y.; resources, A.L.; data curation, W.Y. and X.Q.; writing—original draft preparation, W.Y.; writing—review and editing, A.L. and L.Z.; visualization, W.Y. and X.Q.; supervision, A.L.; project administration, A.L.; funding acquisition, A.L. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is jointly supported by the following projects: the Scientific Research Startup Fund Project of China University of Petroleum (Beijing) at Karamay Campus, “Evaluation of Shale Formation Fracturability Based on Seismic Rock Physics (XQZX20240015)”; the Distinguished Young Scientists Fund of the Natural Science Foundation of Xinjiang Uygur Autonomous Region, “Mechanism of Seismic Imaging and Omnidirectional Velocity Modeling Methods for Ultra-Deep Layers in the Central Tarim Basin (2024D01E08)”; the Scientific Research Startup Fund Project of China University of Petroleum (Beijing) at Karamay Campus, “Anisotropy Analysis and Correction Methods for Wide-Azimuth Seismic Data in Shale Oil Exploration (XQZX20240029)”; the Provincial Key Research and Development Plan of Xinjiang Uygur Autonomous Region (2024B01016, 2024B01016-2, 2024B01016-3); and the “Tianchi Talent” Innovation Leadership Program of Xinjiang Uygur Autonomous Region.

Data Availability Statement

The data used in this study were obtained from multiple open-access microscopic image datasets available through China Scientific Data. Representative dataset sources are described and cited in Section 3.1 of the manuscript. No proprietary dataset was used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PWC-Fusion	Partial-to-Whole Context Fusion
DenseCRF	Dense Conditional Random Field
CNN	Convolutional Neural Network
ViT	Vision Transformer
SAM	Segment Anything Model
AI	Artificial Intelligence
Fused-MBConv	Fused Mobile Inverted Bottleneck Convolution
GAP	Global Average Pooling
ReLU	Rectified Linear Unit
HSV	Hue, Saturation, Value
CLAHE	Contrast Limited Adaptive Histogram Equalization
FLOPs	Floating Point Operations
PA	Pixel Accuracy
IOU	Intersection over Union
ASSD	Average Symmetric Surface Distance
HD95	95% Hausdorff Distance
GT	Ground Truth

Appendix A. Evaluation Metrics

Appendix A.1. Evaluation Metrics for Lithology Classification

Accuracy: Measures the overall proportion of correctly classified samples across all lithology categories:

$Accuracy = \frac{1}{N} \sum_{n = 1}^{N} 1 (\hat{y_{n}} = y_{n})$

(A1)

where N is the total number of samples, $y_{n}$ is the ground-truth label of the n-th sample, $\hat{y_{n}}$ is the predicted label, and $1 (\cdot)$ is the indicator function.
Precision: Quantifies the proportion of correctly predicted samples among all samples predicted as a specific lithology. For multi-class classification, the reported Precision is the macro-averaged Precision across all lithology classes:

$Precision = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k}}$

(A2)

where K is the number of lithology classes, $T P_{k}$ denotes the number of true positives for the k-th class, and $F P_{k}$ denotes the number of false positives for the k-th class.
Recall: Indicates the proportion of actual samples of a specific lithology that are correctly identified by the network. For multi-class classification, the reported Recall is the macro-averaged Recall across all lithology classes:

$Recall = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F N_{k}}$

(A3)

where $F N_{k}$ denotes the number of false negatives for the k-th class.
F1-Score: The harmonic mean of Precision and Recall, serving as a comprehensive metric that balances the trade-off between the two. For multi-class classification, the reported F1-Score is the macro-averaged F1 value across all lithology classes:

$F 1 - Score = \frac{1}{K} \sum_{k = 1}^{K} (2 \cdot \frac{{Precision}_{k} \cdot {Recall}_{k}}{{Precision}_{k} + {Recall}_{k}})$

(A4)

where

${Precision}_{k} = \frac{T P_{k}}{T P_{k} + F P_{k}}, {Recall}_{k} = \frac{T P_{k}}{T P_{k} + F N_{k}}$

(A5)

where $T P_{k}$ , $F P_{k}$ , and $F N_{k}$ represent the numbers of true positives, false positives, and false negatives for the k-th lithology class, respectively. For multi-class lithology classification, Precision, Recall, and F1-Score are first computed for each class and then macro-averaged to evaluate the overall model performance. When the denominator of any class-specific metric equals zero, the corresponding metric is set to 0.

Appendix A.2. Evaluation Metrics for Grain Segmentation

Pixel Accuracy (PA): Measures the overall proportion of correctly classified pixels in the segmentation result:

$PA = \frac{TP + TN}{TP + TN + FP + FN}$

(A6)

where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative pixels, respectively. In this study, the grain region is treated as the foreground class and the non-grain region is treated as the background class.
Intersection over Union (IoU): Quantifies the overlap between the predicted grain region and the ground-truth grain region:

$IoU = \frac{TP}{TP + FP + FN}$

(A7)

where TP, FP, and FN represent the numbers of true positive, false positive, and false negative pixels for the foreground grain class, respectively.
Dice Coefficient: Measures the similarity between the predicted grain mask and the ground-truth mask, serving as a robust overlap metric for binary segmentation:

$Dice = \frac{2 TP}{2 TP + FP + FN}$

(A8)

where TP, FP, and FN denote the numbers of true positive, false positive, and false negative pixels for the foreground grain class, respectively. A larger Dice value indicates better agreement between the predicted mask and the ground-truth mask.
Average Symmetric Surface Distance (ASSD): Measures the average bidirectional boundary distance between the predicted mask and the ground-truth mask:

$ASSD = \frac{1}{2} (\frac{1}{|B_{g}|} \sum_{x ϵ B_{g}} d (x, B_{p}) + \frac{1}{|B_{p}|} \sum_{y ϵ B_{p}} d (y, B_{g}))$

(A9)

where $B_{g}$ and $B_{p}$ denote the boundary-point sets of the ground-truth mask and the predicted mask, respectively; $|B_{g}|$ and $|B_{p}|$ denote the number of boundary points in the two sets; and $d (x, B_{p})$ is the shortest Euclidean distance from boundary point x to the boundary set $B_{p}$ , defined as:

$d (x, B_{p}) = \min_{z ϵ B_{p}} ∥ {x - z ∥}_{2}$

(A10)

The term $d (y, B_{g})$ is defined analogously. A smaller ASSD value indicates better boundary alignment.
95% Hausdorff Distance (HD95): Evaluates boundary discrepancy by computing the 95th percentile of all bidirectional nearest-boundary distances:

$HD95 = P_{95} ({d (x, B_{p}) | x ϵ B_{g}} \cup {d (y, B_{g}) | y ϵ B_{p}}),$

(A11)

where $P_{95} (\cdot)$ denotes the 95th percentile operator, $B_{g}$ and $B_{p}$ denote the boundary-point sets of the ground-truth mask and the predicted mask, respectively, and $d (x, B_{p})$ and $d (y, B_{g})$ denote the shortest Euclidean distances from a boundary point to the opposite boundary set. A smaller HD95 value indicates better boundary precision.

References

Avseth, P.; Mukerji, T.; Mavko, G.; Dvorkin, J. Rock-physics diagnostics of depositional texture, diagenetic alterations, and reservoir heterogeneity in high-porosity siliciclastic sediments and rocks—A review of selected models and suggested work flows. Geophysics 2010, 75, 75A31–75A47. [Google Scholar] [CrossRef]
Worden, R.H.; Utley, J.E. Automated mineralogy (SEM-EDS) approach to sandstone reservoir quality and diagenesis. Front. Earth Sci. 2022, 10, 794266. [Google Scholar] [CrossRef]
Payton, R.L.; Chiarella, D.; Kingdon, A. The influence of grain shape and size on the relationship between porosity and permeability in sandstone: A digital approach. Sci. Rep. 2022, 12, 7531. [Google Scholar] [CrossRef] [PubMed]
Torskaya, T.; Shabro, V.; Torres-Verdín, C.; Salazar-Tio, R.; Revil, A. Grain shape effects on permeability, formation factor, and capillary pressure from pore-scale modeling. Transp. Porous Media 2014, 102, 71–90. [Google Scholar] [CrossRef]
Liu, H.; Ren, Y.-L.; Li, X.; Hu, Y.-X.; Wu, J.-P.; Li, B.; Luo, L.; Tao, Z.; Liu, X.; Liang, J. Rock thin-section analysis and identification based on artificial intelligent technique. Pet. Sci. 2022, 19, 1605–1621. [Google Scholar] [CrossRef]
Niegel, S.; Franz, M. Depositional and diagenetic controls on porosity evolution in sandstone reservoirs of the Stuttgart Formation (North German Basin). Mar. Pet. Geol. 2023, 151, 106157. [Google Scholar] [CrossRef]
Zhang, K.; Wang, C.; Tan, F.; Sun, M. The research progress on shale oil geological analysis driven by big data: Multisource integration methods, key applications, and technical challenges. Adv. Resour. Res. 2025, 5, 2702–2742. [Google Scholar]
Umoren, N.; Odum, M.I. Exploring the Role of Big Data in Petroleum Exploration: Using Advanced Analytics for More Efficient Decision-Making in Exploration Projects. Int. J. Multidiscip. Res. Growth Eval. 2020, 1, 173–179. [Google Scholar] [CrossRef]
Fan, J.; Yu, X.; Di, Y.; Lv, T.; Zhang, R.; Bao, J.; Liu, Y.; Li, L.; Pan, X. A foundation model for rock thin-section images analysis. Commun. Eng. 2025, 5, 9. [Google Scholar] [CrossRef]
Rubo, R.A.; de Carvalho Carneiro, C.; Michelon, M.F.; dos Santos Gioria, R. Digital petrography: Mineralogy and porosity identification using machine learning algorithms in petrographic thin section images. J. Pet. Sci. Eng. 2019, 183, 106382. [Google Scholar] [CrossRef]
Külekçi, G. Geological thin sections and mineral analysis using light microscopy a comprehensive study. Bull. Miner. Res. Explor. 2025, 177, 1–2. [Google Scholar] [CrossRef]
Ali, J.; Ansari, U.; Ali, F.; Javed, T.; Hullio, I.A. Application of Machine Learning for Effective Screening of Enhanced Oil Recovery Methods. Reserv. Sci. 2026, 2, 65–80. [Google Scholar] [CrossRef]
Hu, Y.; Yang, Y. A comparative study on drag reduction methods for continental shale drilling in the Fuxing Block, southeastern Sichuan Basin. Reserv. Sci. 2026, 2, 81–96. [Google Scholar] [CrossRef]
Yang, Y.; Huang, F.; Kang, S. Mechanism of Penetration Rate Improvement in Hot Dry Rock Under the Coupling of Impact Load and Confining Pressure Release. Reserv. Sci. 2026, 2, 52–64. [Google Scholar] [CrossRef]
Külekçi, G. Madencilik Operasyonlarında Segmentasyon Teknolojileri: Uydu ve Dron Verilerinden Bilgi Çıkarmada Derin Öğrenme Yaklaşımları. Int. J. Adv. Soc. Sci. Educ. (IJASSE) 2024, 8, 732–740. [Google Scholar]
Wang, C.; Li, P.; Long, Q.; Chen, H.; Wang, P.; Meng, Z.; Wang, X.; Zhou, Y. Deep learning for refined lithology identification of sandstone microscopic images. Minerals 2024, 14, 275. [Google Scholar] [CrossRef]
Guo, X.; Chen, Y.; He, S.; Zhang, X.; Zhou, J.; Bao, X. Multi-scale channel enhanced transformer for rock thin sections identification and sequence consistency optimization. Comput. Geosci. 2025, 29, 19. [Google Scholar] [CrossRef]
Lv, P.; Chen, W.; Zou, X. Precision recognition of rock thin section images with multi-head self-attention convolutional neural networks. J. Geophys. Res. Mach. Learn. Comput. 2025, 2, e2025JH000617. [Google Scholar] [CrossRef]
Van den Berg, E.; Meesters, A.; Kenter, J.; Schlager, W. Automated separation of touching grains in digital images of thin sections. Comput. Geosci. 2002, 28, 179–190. [Google Scholar] [CrossRef]
Polat, Ö.; Polat, A.; Ekici, T. Automatic classification of volcanic rocks from thin section images using transfer learning networks. Neural Comput. Appl. 2021, 33, 11531–11540. [Google Scholar] [CrossRef]
Xu, Z.; Ma, W.; Lin, P.; Shi, H.; Pan, D.; Liu, T. Deep learning of rock images for intelligent lithology identification. Comput. Geosci. 2021, 154, 104799. [Google Scholar] [CrossRef]
Wu, H.; Dai, Y.-J.; Liu, X.-Y. Efficient Sedimentary Facies Recognition Using Vision Transformer and Weakly Supervised Deep Multi-View Clustering. IEEE Access 2025, 13, 77522–77538. [Google Scholar] [CrossRef]
Koeshidayatullah, A.; Al-Azani, S.; Baraboshkin, E.E.; Alfarraj, M. FaciesViT: Vision transformer for an improved core lithofacies prediction. Front. Earth Sci. 2022, 10, 992442. [Google Scholar] [CrossRef]
Cao, Z.; Ma, C.; Tang, W.; Zhou, Y.; Zhong, H.; Ye, S.; Wu, K.; Chen, X.; Zheng, D.; Hou, L. CoreViT: A new vision transformer model for lithofacies identification in cores. Geoenergy Sci. Eng. 2024, 240, 213012. [Google Scholar] [CrossRef]
Aydın, İ.; Kılıç, A.D.; Şener, T.K. Improving Rock Type Identification Through Advanced Deep Learning-Based Segmentation Models: A Comparative Study. Appl. Sci. 2025, 15, 1630. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
Wang, M.; Guo, W.; Yang, F.; Yan, B.; Xu, Y.; Jiang, J.; Huang, J. Rock thin section image classification in low data scenarios using few-shot learning. Comput. Geosci. 2025, 203, 105962. [Google Scholar] [CrossRef]
Liu, T.; Liu, Z.; Zhang, K.; Li, C.; Zhang, Y.; Mu, Z.; Mu, M.; Xu, M.; Zhang, Y.; Li, X. Research on the generation and annotation method of thin section images of tight oil reservoir based on deep learning. Sci. Rep. 2024, 14, 12805. [Google Scholar] [CrossRef]
Lu, K.; Xu, Y.; Yang, Y. Comparison of the potential between transformer and CNN in image classification. In Proceedings of the ICMLCA 2021, 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–6. [Google Scholar]
Appiah-Twum, M.; Xu, W.; Acheampong, E.M. DenseViT: A Hybrid CNN-Vision Transformer Model for an Improved Multisensor Lithological Classification. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 3418–3422. [Google Scholar]
Zhang, D.; Qian, X.; Shi, C.; Zhang, Y.; Qian, Y.; Zhou, S. Iron Ore Image Recognition Through Multi-View Evolutionary Deep Fusion Method. Future Internet 2025, 17, 553. [Google Scholar] [CrossRef]
Singh, N.; Singh, T.; Tiwary, A.; Sarkar, K.M. Textural identification of basaltic rock mass using image processing and neural network. Comput. Geosci. 2010, 14, 301–310. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Chen, X.; Liang, C.; Huang, D.; Real, E.; Wang, K.; Pham, H.; Dong, X.; Luong, T.; Hsieh, C.-J.; Lu, Y. Symbolic discovery of optimization algorithms. Adv. Neural Inf. Process. Syst. 2023, 36, 49205–49233. [Google Scholar]
Saxena, N.; Day-Stirrat, R.J.; Hows, A.; Hofmann, R. Application of deep learning for semantic segmentation of sandstone thin sections. Comput. Geosci. 2021, 152, 104778. [Google Scholar] [CrossRef]
Das, R.; Mondal, A.; Chakraborty, T.; Ghosh, K. Deep neural networks for automatic grain-matrix segmentation in plane and cross-polarized sandstone photomicrographs. Appl. Intell. 2022, 52, 2332–2345. [Google Scholar] [CrossRef]
Karimpouli, S.; Tahmasebi, P. Segmentation of digital rock images using deep convolutional autoencoder networks. Comput. Geosci. 2019, 126, 142–150. [Google Scholar] [CrossRef]
Yu, J.; Wellmann, F.; Virgo, S.; von Domarus, M.; Jiang, M.; Schmatz, J.; Leibe, B. Superpixel segmentations for thin sections: Evaluation of methods to enable the generation of machine learning training data sets. Comput. Geosci. 2023, 170, 105232. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Sylvester, Z.; Stockli, D.F.; Howes, N.; Roberts, K.; Malkowski, M.A.; Poros, Z.; Martindale, R.C.; Bai, W. Segmenteverygrain: A Python module for segmentation of grains in images. J. Open Source Softw. 2025, 10, 7953. [Google Scholar] [CrossRef]
Zhang, Y.; Konz, N.; Kramer, K.; Mazurowski, M.A. Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects. arXiv 2024, arXiv:2412.04243. [Google Scholar]
Krähenbühl, P.; Koltun, V. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 109–117. [Google Scholar]
Azzam, F.; Blaise, T.; Brigaud, B. Automated petrographic image analysis by supervised and unsupervised machine learning methods. Sedimentologika 2024, 2, 1594. [Google Scholar] [CrossRef]
Barbosa, R.T.; Faria, E.; Klatt, M.; Silva, T.C.; Coelho, J.M.; Matos, T.F.; Santos, B.C.; Gonzalez, J.; Bom, C.R.; de Albuquerque, M.P. Unsupervised segmentation for sandstone thin section image analysis. Comput. Geosci. 2024, 28, 1049–1057. [Google Scholar] [CrossRef]
Ren, Y.; Zeng, C.; Li, X.; Liu, X.; Hu, Y.; Su, Q.; Wang, X.; Lin, Z.; Zhou, Y.; Hu, H. Intelligent evaluation of sandstone rock structure based on a visual large model. Pet. Explor. Dev. 2025, 52, 548–558. [Google Scholar] [CrossRef]
SY/T 5368-2016; Identification for Thin Section of Rocks. Petroleum Industry Press: Beijing, China, 2016.
Lai, W.; Jiang, J.; Qiu, J.; Yu, J.; Hu, X. A photomicrograph dataset of rocks for petrology teaching at Nanjing University. China Sci. Data 2020, 5, 26–38. [Google Scholar] [CrossRef]
Li, P.; Li, Y.; Chen, X.; Wang, Y.; Li, C.; Liu, Z. A photomicrograph dataset of Upper Paleozoic tight sandstone from Linxing block, eastern margin of Ordos Basin. China Sci. Data 2020, 5, 163–169. [Google Scholar] [CrossRef]
Zhang, S.; Hu, X. Polarized light micrograph dataset of Late Cretaceous-Eocene rock thin sections from western Tarim Basin, Xinjiang. China Sci. Data 2020, 5, 59–69. [Google Scholar] [CrossRef]
Wentworth, C.K. A Scale of Grade and Class Terms for Clastic Sediments. J. Geol. 1922, 30, 377–392. [Google Scholar] [CrossRef]
Külekçi, G.; Hacıefendioğlu, K.; Başağa, H.B. Enhancing mineral processing with deep learning: Automated quartz identification using thin section images. Int. J. Miner. Metall. Mater. 2025, 32, 802–816. [Google Scholar] [CrossRef]
Zheng, D.; Hou, L.; Hu, X.; Hou, M.; Dong, K.; Hu, S.; Teng, R.; Ma, C. Sediment grain segmentation in thin-section images using dual-modal Vision Transformer. Comput. Geosci. 2024, 191, 105664. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the proposed three-stage automated framework for petrographic thin-section analysis.

Figure 2. Overall architecture of the proposed Rock-PLionNet for thin-section lithology classification. The stage-wise Fused-MBConv and MBConv settings follow the default EfficientNetV2-S configuration, and the dimensions shown in the figure correspond to the output feature-map size of each stage.

F_{low}

and

F_{high}

denote the low-level feature extracted from Stage 3 (28 × 28 × 64) and the high-level feature extracted from Stage 7 (7 × 7 × 1280), respectively.

Figure 2. Overall architecture of the proposed Rock-PLionNet for thin-section lithology classification. The stage-wise Fused-MBConv and MBConv settings follow the default EfficientNetV2-S configuration, and the dimensions shown in the figure correspond to the output feature-map size of each stage.

F_{low}

and

F_{high}

denote the low-level feature extracted from Stage 3 (28 × 28 × 64) and the high-level feature extracted from Stage 7 (7 × 7 × 1280), respectively.

Figure 3. Schematic of the proposed PWC-Fusion module for cross-scale feature integration. (a) Input low-level feature map

F_{low}

. (b) Input high-level feature map

F_{high}

. (c) Fused feature map

F_{fused}

. (d) Difference map

|F_{fused} - F_{high}|

.

Figure 3. Schematic of the proposed PWC-Fusion module for cross-scale feature integration. (a) Input low-level feature map

F_{low}

. (b) Input high-level feature map

F_{high}

. (c) Fused feature map

F_{fused}

. (d) Difference map

|F_{fused} - F_{high}|

.

Figure 4. Overall workflow of PetroSAM-CRF for thin-section grain segmentation.

Figure 5. A two-stage image enhancement workflow used in PetroSAM-CRF.

Figure 6. Illustration of geometric measurements for an individual grain extracted from the segmentation mask: (a) isolated grain region, (b) boundary contour, (c) minimum-area bounding rectangle, and (d) major and minor axes.

Figure 7. Representative thin-section images of the three lithology classes used in this study (limestone, dolomite, and sandstone).

Figure 8. Training data preprocessing and augmentation strategy: (a) random crop window on the resized image, (b) cropped patch, (c) horizontal flip, (d) vertical flip, (e) random rotation, and (f) random erasing.

Figure 9. Confusion matrices of different models on the test set for the thin-section lithology classification task: (a) ResNet50, (b) VGG19, (c) ConvNeXt-Base, (d) Swin-T, (e) ViT-Base, and (f) Rock-PLionNet.

Figure 10. Grad-CAM visualizations comparing discriminative regions highlighted by ConvNeXt, VGG19, and Rock-PLionNet for representative thin-section images.

Figure 11. Training and validation loss curves for the four ablation configurations of Rock-PLionNet; (a–d) correspond to the settings in Table 2 (from top to bottom).

Figure 12. Visual comparison of segmentation results on sandstone thin-section images produced by GT, K-means, Otsu, watershed, an unsupervised CNN, and the proposed PetroSAM-CRF.

Figure 13. Qualitative comparison of intermediate outputs in the PetroSAM-CRF pipeline on multiple thin-section images. (a) Original images with regions of interest highlighted in red; (b) segmentation produced by SAM alone; (c) SAM results after the proposed pre-processing; (d) final results of PetroSAM-CRF with segmentation refinement.

Figure 14. Error comparison of geological-parameter estimation derived from different segmentation methods. (a) Mean absolute deviation of grain-size distribution across size bins. (b) Relative deviation of the mean roundness index from the ground truth. Lower values indicate better agreement.

Figure 15. Illustration of major-axis grain-size measurement. (a) Manual long-axis measurement by two operators (colored line segments). (b) Automated measurement based on the major axis of the minimum-area bounding rectangle (blue).

Figure 16. Grain-size statistics of the sandstone thin-section sample. (a) Area-weighted proportions of grain-size classes. (b) Number-based grain-size histogram with the cumulative percentage curve.

Figure 17. Grain-shape statistics for the analyzed thin-section sample. (a) Cumulative distribution of aspect ratio; (b) histogram of roundness index. The red dashed line indicates the mean value.

Table 1. Controlled comparison of the proposed architecture and several widely used deep learning classifiers for thin-section lithology classification on the test set. All models were trained with AdamW under the same training setting.

Method	Accuracy	Precision	Recall	F1-Score	Loss	Parameters (M)	FLOPs (G)
ResNet50	0.9467	0.9473	0.9457	0.9464	0.6364	23.51	4.13
VGG19	0.9219	0.9280	0.9136	0.9189	0.6602	139.58	19.63
ConvNeXt-Base	0.9086	0.9088	0.9111	0.9086	0.6785	87.55	15.37
Swin-T	0.9314	0.9296	0.9350	0.9319	0.6593	27.52	2.98
ViT-Base	0.8533	0.8532	0.8627	0.8537	0.7426	85.61	16.85
EfficientNetV2-S + PWC-Fusion (AdamW)	0.9638	0.9638	0.9645	0.9641	0.6235	20.47	2.90

Table 2. Ablation study of the proposed framework. The first two configurations evaluate the contribution of PWC-Fusion under AdamW, whereas the last two configurations compare AdamW and Lion under the same learning rate and weight decay settings while keeping the architecture unchanged. For each configuration, the best checkpoint was selected on the validation set, and the final metrics were reported on the test set.

Method	Optimizer	Learning Rate	Weight Decay	Accuracy	Precision	Recall	F1-Score	Loss
EfficientNet (Baseline)	AdamW	1 × 10⁻⁴	0.01	0.9543	0.9540	0.9540	0.9540	0.6425
EfficientNet + PWC-Fusion	AdamW	1 × 10⁻⁴	0.01	0.9638	0.9638	0.9645	0.9641	0.6235
EfficientNet + PWC-Fusion	AdamW	1× 10⁻⁵	0.1	0.9314	0.9301	0.9322	0.9310	0.6467
EfficientNet + PWC-Fusion (Rock-PLionNet)	Lion	1 × 10⁻⁵	0.1	0.9657	0.9691	0.9625	0.9655	0.6098

Table 3. Quantitative comparison of segmentation methods. Best results are highlighted in bold.

↑

/

↓

indicate higher/lower is better.

Table 3. Quantitative comparison of segmentation methods. Best results are highlighted in bold.

↑

/

↓

indicate higher/lower is better.

Method	$PA ↑$	$IOU ↑$	$Dice ↑$	$ASSD ↓$	$HD95 ↓$
K-means	0.8547	0.7665	0.8657	11.41	58.89
Otsu	0.8572	0.7683	0.8668	10.76	56.70
Watershed	0.8637	0.7725	0.8694	9.19	41.60
Unsup-CNN	0.7623	0.6620	0.7915	13.32	58.74
SAM	0.8620	0.7777	0.8670	7.42	29.84
PetroSAM-CRF	0.9113	0.8531	0.9185	5.73	23.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, W.; Li, A.; Zhang, L.; Qin, X. Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization. Electronics 2026, 15, 1509. https://doi.org/10.3390/electronics15071509

AMA Style

Yang W, Li A, Zhang L, Qin X. Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization. Electronics. 2026; 15(7):1509. https://doi.org/10.3390/electronics15071509

Chicago/Turabian Style

Yang, Wenhao, Ang Li, Liyan Zhang, and Xiaoyao Qin. 2026. "Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization" Electronics 15, no. 7: 1509. https://doi.org/10.3390/electronics15071509

APA Style

Yang, W., Li, A., Zhang, L., & Qin, X. (2026). Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization. Electronics, 15(7), 1509. https://doi.org/10.3390/electronics15071509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Intelligent Analysis of Rock Thin Sections: From Cross-Scale Lithology Classification to Grain Segmentation for Quantitative Fabric Characterization

Abstract

1. Introduction

2. Methods

2.1. Automated Analysis Workflow

2.2. Rock-PLionNet for Lithology Classification

2.2.1. Overall Framework

2.2.2. PWC-Fusion Module

2.2.3. Lion Optimizer

2.3. Sandstone Grain Segmentation Method

2.3.1. Image Preprocessing

2.3.2. Segmentation Result Refinement

2.4. Quantitative Analysis of Sandstone Fabric

3. Experimental Setup

3.1. Dataset

3.2. Implementation Details

3.3. Evaluation Metrics

3.3.1. Evaluation Metrics for Lithology Classification

3.3.2. Evaluation Metrics for Grain Segmentation

4. Experimental Results and Analysis

4.1. Performance Evaluation of Lithology Classification

4.1.1. Comparative Experiments

4.1.2. Ablation Experiments

4.2. Performance Evaluation of Grain Segmentation

4.2.1. Comparative Experiments

4.2.2. Geological Parameter-Based Evaluation of Segmentation

4.3. Sample-Based Workflow Demonstration

4.3.1. Automatic Measurement of Grain Size

4.3.2. Grain-Size Distribution

4.3.3. Grain Shape Analysis

5. Discussion

5.1. Geological Implications

5.2. Grain-Size Interpretation

5.3. Scope and Validation Boundaries

5.4. Limitations of Segmentation and Quantitative Evaluation

5.5. Methodological Contribution and Computational Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Evaluation Metrics

Appendix A.1. Evaluation Metrics for Lithology Classification

Appendix A.2. Evaluation Metrics for Grain Segmentation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI