Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation

Liu, Hongwei; Xiao, Ke; Lin, Hang

doi:10.3390/app16115460

Open AccessArticle

Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation

by

Hongwei Liu

¹,

Ke Xiao

² and

Hang Lin

^1,*

¹

School of Resources and Safety Engineering, Central South University, Changsha 410083, China

²

Changde Construction Engineering Quality and Safety Supervision Station, Changde 415000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5460; https://doi.org/10.3390/app16115460 (registering DOI)

Submission received: 27 April 2026 / Revised: 28 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

Download

Browse Figures

Versions Notes

Abstract

Structural planes widely developed in slope rock masses are key geological elements governing deformation, failure modes and engineering stability. Traditional manual logging suffers from low efficiency, high safety risks and inadequate data integrity, failing to meet large-scale and refined survey needs. This paper proposes a cross-modal collaborative recognition system for slope discontinuities. The principal methodological contribution is the cross-modal ROI-guidance mechanism itself: 2D detection bounding boxes are back-projected through pixel-to-point-cloud registration to construct region-of-interest constraints in 3D space, transforming intractable global blind-search segmentation into localized oriented analysis within bounded volumes—to the best of the authors’ knowledge, the first systematic establishment of such a “visual detection → ROI-guided 3D analysis” framework for slope discontinuity characterization. Within this paradigm, established modules are adapted to the discontinuity recognition task rather than newly invented: channel attention, bidirectional multi-scale fusion and angle-aware regression are integrated into the detection backbone to address the weak texture contrast, large-scale span and extreme aspect-ratio morphology of discontinuity targets, while a PCA–DBSCAN–RANSAC cascade operating within the ROI volumes extracts dip direction, dip angle, spacing and trace length. Validated on two typical slopes in Hunan Province, the improved network achieves a mAP@0.5 of 89.4%, the average IoU of point cloud segmentation is 82.6–86.3%, the dip angle RMSE is 2.46° and the spacing average relative error is 6.8%. The full workflow takes about 86 min, a 19.5-fold efficiency gain over manual methods, and provides an automated pipeline from heterogeneous remote sensing data to engineering-usable structural parameters. The resulting outputs are organized in a tabular schema compatible with mainstream discrete-element software such as 3DEC and UDEC, where they serve as geometric inputs to downstream stability modelling once site-specific mechanical calibration is performed. The two-site validation reported here should accordingly be read as a proof of operational feasibility within the limestone and sandstone–mudstone envelope examined, with broader deployment to other lithologies identified as the natural next phase of evaluation.

Keywords:

slope discontinuities; cross-modal fusion; object detection; point cloud segmentation; intelligent recognition

1. Introduction

Slope engineering is one of the most common forms of geological engineering in infrastructure construction such as transportation, water conservancy and mining [1,2]. Due to the long-term effects of geological processes such as tectonic movements, weathering and erosion, and stress unloading, a large number of discontinuous structural planes, including joints, fissures, bedding planes and faults, often develop inside rock slopes [3,4]. The spatial distribution and geometric characteristics of these structural planes fundamentally control the mechanical behavior and deformation failure modes of rock masses and are key factors in inducing geological disasters such as landslides and collapses. The traditional structural plane investigation mainly relies on manual compass measurement and on-site recording, which not only has low operational efficiency but also poses a potential threat to the safety of investigators. At the same time, it is limited by site accessibility and subjective judgment bias, making it difficult to obtain complete and objective structural plane data. Therefore, how to efficiently and accurately acquire large-scale slope discontinuity information while ensuring safety has become the core bottleneck constraining the advancement of digital slope survey.

Non-contact remote sensing technology has provided new means for addressing this challenge. Ground laser scanning (TLS) can obtain dense point clouds on rock surfaces with millimeter accuracy and has accumulated rich engineering application experience in structural plane identification and tunnel rock block stability analysis [5]. Unmanned Aerial Vehicle (UAV) photogrammetry, with its flexible flight path planning and low cost, has gradually become the mainstream data acquisition scheme for high and steep slopes [6]. The joint use of UAV images and TLS point clouds can achieve effective complementarity between data coverage and geometric accuracy [7], and semi-georeferenced dense point clouds have shown promising application prospects with much higher collection efficiency than traditional total station solutions [8]. In summary, 3D data acquisition technologies have matured to the point of providing millimeter-level geometric accuracy; the remaining challenge has shifted from data quality to data utilization—how to efficiently extract engineering-usable structural parameters from increasingly massive point cloud datasets.

Regarding point cloud-based discontinuity extraction, the first persistent bottleneck is the parameter sensitivity and poor adaptability of traditional geometric algorithms. Methods such as optimized fuzzy K-means clustering [9], DBSCAN density clustering combined with principal component analysis [10], RANSAC plane fitting [11], region growing [12] and hybrid cascading strategies combining fast density peak search with DBSCAN [13] have each demonstrated utility in specific settings, but they share a common weakness: their performance is strongly dependent on manually tuned parameters and degrades substantially when confronted with uneven point cloud density or severe rock surface undulations [14]. The second bottleneck emerges when deep learning is introduced to improve robustness. The PointNet++ architecture has shown superior feature learning capability in large-scale rock discontinuity recognition but exhibits insufficient fine feature expression for non-uniform point clouds [15,16]; RL-JointNet achieved high global recognition accuracy through explicit relative position encoding and multi-path feature fusion [17]; frameworks coupling fuzzy C-means clustering with convolutional networks have combined unsupervised pre-grouping with supervised fine recognition1 [18,19]; GoogLeNet has been applied to fast discontinuity classification [20]; and autoencoder-based unsupervised methods have opened new avenues for structural plane characterization under noise [19]. Despite these advances, all existing point cloud methods—whether geometric or learning-based—operate in a “blind-search” mode over the entire point cloud without high-level semantic guidance. This forces a fundamental trade-off between computational cost and segmentation granularity that becomes untenable when the point cloud scale reaches tens of millions of points. Moreover, the absence of physical constraints on rock mass discontinuity topology causes generalization to degrade in highly fractured or complex geological scenarios.

In parallel, convolutional neural networks have achieved notable progress in rock fracture detection from 2D imagery, but the existing improvements are predominantly tailored to landslide, rockfall and crack targets rather than discontinuity planes. The improved U-Net architecture has been shown to extract outcrop fracture information even under extremely scarce annotated samples [21]. CRFSegNet achieved high-accuracy segmentation through multi-scale feature fusion and can automatically calculate fracture geometric parameters [22], and several YOLO-series adaptations have been proposed for geological hazard monitoring, including InSAR-YOLOv8 with a dedicated small-target detection head for landslide identification [23], YOLOv8-CSM for open-pit mine slope crack recognition [24] and attention-enhanced variants for rockfall detection [25,26]. However, slope discontinuities differ fundamentally from these targets in exhibiting extreme aspect ratios, weak texture contrast against intact rock surfaces and multi-scale coexistence within single images—characteristics that demand dedicated network adaptations rather than direct transfer of existing architectures. More critically, all 2D image detection methods share an inherent limitation: they cannot directly provide three-dimensional orientation and geometric parameters, leaving a persistent gap between visual detection outputs and the engineering parameters actually required for stability analysis.

This complementarity between the two data modalities—2D imagery offering efficient semantic localization over large fields of view but lacking 3D geometric output, and 3D point clouds providing precise spatial coordinates but suffering prohibitive computational cost in global blind-search mode—points to a natural integration strategy. If the rapid detection capability of 2D imagery can be used to constrain the spatial scope of 3D point cloud analysis through cross-modal information transfer, the global blind-search problem can be transformed into a localized oriented analysis problem, simultaneously addressing the dual bottlenecks of computational efficiency and segmentation accuracy. Yet such a cross-modal collaborative paradigm has not been established in the existing literature on rock mass discontinuity characterization, and the specific challenges of adapting detection networks to the unique morphological characteristics of discontinuity targets remain unaddressed.

In view of this, this paper proposes an intelligent recognition system that integrates object detection with point cloud segmentation under a cross-modal collaborative paradigm. The principal methodological contribution lies in the cross-modal ROI-guidance mechanism itself: 2D detection bounding boxes are back-projected through pixel-to-point-cloud registration to delineate spatially bounded regions of interest in 3D point cloud space, thereby transforming the global blind-search segmentation problem into a localized oriented analysis problem within constrained volumes. To the best of the authors’ knowledge, this is the first systematic establishment of a “visual detection → ROI-guided 3D analysis” cross-modal framework for slope discontinuity characterization, and the contribution is independent of the specific choice of detection or segmentation algorithms employed within it. Within this paradigm, two families of well-established components are adapted to the discontinuity recognition task rather than newly invented. On the detection side, three existing modules—Efficient Channel Attention (ECA), Bidirectional Feature Pyramid Network (BiFPN), and SIoU loss—are integrated into the YOLOv8 backbone to address, respectively, the weak texture contrast, large-scale span, and extreme aspect-ratio morphology that characterize slope discontinuity targets but are not the primary scenarios these modules were originally designed for. On the point cloud side, the mature PCA–DBSCAN–RANSAC cascade is retained as the geometric backbone for parameter extraction but is reorganized to operate within ROI volumes rather than over the full point cloud. The integrated framework provides an automated pipeline from heterogeneous remote sensing data to engineering-usable structural parameters. The resulting outputs are organized in a schema compatible with stereonet-based kinematic tools and discrete-element software such as 3DEC and UDEC, where they serve as geometric inputs to downstream stability evaluation rather than as a substitute for site-specific mechanical calibration.

2. Design of an Intelligent Recognition System Integrating Improved YOLOv8 and Point Cloud Segmentation

2.1. Overall System Framework and Engineering Closed-Loop Design

The intelligent recognition system for slope discontinuities constructed in this paper follows the engineering closed-loop concept of “perception–cognition–decision”. It integrates multi-source data acquisition, improved object detection, fine point cloud segmentation, and automated parameter extraction into a complete processing pipeline. The overall system architecture is shown in Figure 1, comprising four core functional layers. The data acquisition layer collaboratively obtains high-resolution imagery and dense point clouds through UAV close-range photogrammetry and terrestrial laser scanning. The detection analysis layer uses the improved YOLOv8 network to rapidly locate discontinuity targets in two-dimensional imagery and output detection bounding boxes. The point cloud processing layer uses detection results as guidance to complete oriented segmentation and geometric parameter calculation of regions of interest. The information output layer integrates key parameters such as orientation, spacing and trace length into structured geological information reports and generates visualized three-dimensional models. The data flow between the four levels is bridged through precise pixel-to-point-cloud registration, ensuring that visual detection results can effectively guide fine analysis in three-dimensional space. To intuitively illustrate the data transformation at each stage, Figure 1 embeds representative data thumbnails along the processing pipeline. The data acquisition layer displays a real UAV slope image and a raw point cloud rendering of the study site. The detection analysis layer presents an actual detection result with bounding boxes overlaid on the slope photograph. The point cloud processing layer shows a segmented and color-coded point cloud with individual discontinuity instances differentiated. The information output layer includes a stereonet projection and a structured parameter table excerpt, visually depicting the progressive transformation from raw heterogeneous data to engineering-usable information throughout the entire workflow.

2.2. Improved YOLOv8 Detection Algorithm Oriented Toward Discontinuity Characteristics

Discontinuity targets in slope imagery are characterized by large-scale variation, irregular geometric morphology and severe background interference, making it difficult for the original YOLOv8 network to meet engineering requirements for detection sensitivity and localization accuracy in such tasks. To address these bottlenecks, this paper makes targeted improvements to YOLOv8 from three dimensions: feature enhancement, multi-scale fusion and loss function. The improved network structure is shown in Figure 2.

2.2.1. Embedding of Lightweight Channel Attention Mechanism

Discontinuities in imagery often appear as linear traces or banded regions, with the differential information between their texture features and surrounding intact rock surfaces primarily distributed in specific channel dimensions. Slope discontinuities typically exhibit low contrast against intact rock, and their diagnostic visual cues—such as subtle shadow lines and micro-textural roughness differences—are sparse and unevenly distributed across feature channels, causing standard convolutional extraction to dilute discriminative channels carrying weak geological texture information. To enhance the selective attention capability of the backbone network for such discriminative channel features, this paper embeds an Efficient Channel Attention (ECA) module after the C2f module. ECA was selected over the widely used SE module specifically because the SE module’s dimensionality reduction operation compresses inter-channel dependencies, which risks discarding the already sparse discriminative information critical for distinguishing discontinuity traces from intact rock surfaces. ECA-Net preserves complete channel information by avoiding this dimensionality reduction operation, employing adaptive one-dimensional convolution to achieve local cross-channel interaction [27]. Its core computation process can be expressed as:

ω = σ ({Conv 1 D}_{k} (GAP (F)))

(1)

where

F

denotes the input feature map,

GAP (\cdot)

is the global average pooling operation,

{Conv 1 D}_{k} (\cdot)

is a one-dimensional convolution with kernel size

k

,

σ (\cdot)

is the Sigmoid activation function and

ω

is the attention weight vector for each channel. The convolution kernel size

k

is adaptively determined by the channel count

C

via

k = ψ (C) = | \frac{\log_{2} C}{γ} + \frac{b}{γ} | odd

, where

γ = 2

and

b = 1

are tuning parameters and

| \cdot |_{odd}

denotes rounding to the nearest odd number. This module introduces only

k

additional parameters, with negligible computational overhead, yet can significantly enhance the network’s response intensity to weak texture features of discontinuities. The structure of the ECA module is shown in Figure 3.

2.2.2. Bidirectional Feature Pyramid Network Multi-Scale Fusion

The PANet path aggregation structure used in the original YOLOv8 inevitably causes the loss of shallow high-resolution information during top-down feature transmission, which adversely affects the detection of small-scale discontinuity targets. Given that discontinuities in slope imagery span from large fault planes occupying hundreds of pixels to hairline fractures of only a few pixels in width, the feature pyramid network must simultaneously preserve high-resolution spatial details and transmit deep semantic information with minimal degradation—a requirement that the unidirectional information flow of PANet fails to adequately fulfill. Drawing on the design concept of the BiFPN bidirectional feature pyramid network [28], this paper adds same-level cross-node direct connection paths to the original top-down and bottom-up dual paths, allowing the original features extracted by the backbone network to directly participate in the fusion process of detection feature maps. BiFPN employs a weighted feature fusion strategy, and its fusion process can be expressed as:

O = \sum_{i} \frac{w_{i}}{ϵ + \sum_{j} w_{j}} \cdot I_{i}

(2)

where

I_{i}

denotes the

i

-th input feature map,

w_{i}

is the corresponding learnable fusion weight (

w_{i} \geq 0

),

ϵ

is a small constant to prevent numerical instability (set to

10^{- 4}

) and

O

is the output feature map after fusion. The value

ε

= 10⁻⁴ is fixed across all training and inference stages of every experiment reported in this paper, and the learnable fusion weights

w_{i}

are uniformly initialized to 1.0 and jointly optimized with the rest of the network parameters via standard backpropagation under the same training schedule as the detection backbone. Compared to traditional simple addition or concatenation, this fast normalized fusion mechanism enables the network to adaptively learn the relative importance of features at different scales—effectively allowing it to dynamically allocate greater weight to fine-grained shallow features when detecting small joints, while emphasizing deep semantic features for large fault planes, all within a single forward pass. The improved neck network structure is shown in Figure 4.

2.2.3. Loss Function Optimization Design

Slope discontinuities present a distinctive morphological challenge for bounding box regression: they frequently manifest as elongated, band-like targets with extreme aspect ratios (width-to-height ratios reaching 1:10 or beyond), and their boundaries are inherently diffuse due to gradual transitions between the discontinuity trace and the surrounding weathered rock surface. Under these conditions, the original CIoU loss function—which primarily penalizes center distance and aspect ratio deviation—tends to generate oscillatory gradients during training, as small angular misalignments between the predicted and ground truth boxes produce disproportionately large regression errors for high-aspect-ratio targets. To address this issue, this paper replaces the bounding box regression loss with the SIoU loss function [29]. SIoU introduces an angular cost between the ground truth box and the predicted box on the basis of IoU, aligning the regression direction with the true direction of the target and effectively reducing the generation of invalid gradients during training. This angular awareness is particularly beneficial for discontinuity targets, as it encourages the regression to first align the predicted box orientation with the predominant strike direction of the discontinuity before refining the box dimensions. The overall loss function is composed of a weighted combination of three components:

L = λ_{1} L_{SIoU} + λ_{2} L_{cls} + λ_{3} L_{dfl}

(3)

where

L_{SIoU}

is the SIoU bounding box regression loss;

L_{cls}

is the classification loss based on binary cross-entropy;

L_{dfl}

is the Distribution Focal Loss; and

λ_{1}

,

λ_{2}

and

λ_{3}

are the weight coefficients for each loss term.

2.3. Detection-Guided Point Cloud Segmentation and Parameter Extraction Algorithm

The core rationale of the proposed system lies in leveraging visual detection outputs as semantic priors to constrain the three-dimensional point cloud analysis space, thereby transforming the computationally intractable global blind-search segmentation problem into a localized oriented analysis problem within bounded regions of interest. This section presents the complete technical pathway from detection bounding box back-projection to final engineering parameter output as a unified framework.

2.3.1. Image-to-Point-Cloud Registration and Accuracy Quantification

Before describing how detection results constrain the point cloud analysis space, the registration accuracy between the 2D image domain and the 3D point cloud domain is quantified, since any ROI back-projection inherits the registration residual as a fundamental error floor. The registration pipeline consists of two stages. In the first stage, SIFT features are extracted from UAV images and matched against the rendered intensity image of the TLS point cloud, providing an initial set of 2D–3D correspondences for camera pose estimation. In the second stage, the point cloud reconstructed from UAV photogrammetry is fine-aligned to the TLS reference point cloud via ICP iterations. The iterations terminate when the mean point-to-plane residual stabilizes below the convergence threshold of 0.5 mm change between successive iterations.

The registration accuracy was evaluated on both study sites using independent checkpoint correspondences and per-point residual statistics. For Scene A, the SIFT-based 2D–3D matching yielded 1420 ± 230 inlier correspondences (mean ± standard deviation across 20 randomized RANSAC trials), with an inlier ratio of 71.4% after MAGSAC++ geometric verification; the final ICP convergence delivered a global root-mean-square point-to-plane residual of 4.8 mm, with 92.1% of point residuals falling below the 8 mm threshold corresponding to one-half of the average TLS point spacing. For Scene B, where the more regular sandstone–mudstone bedding planes provide cleaner geometric primitives for matching, 1680 ± 210 inlier correspondences were obtained with an inlier ratio of 78.9%; the ICP residual was 3.6 mm and 95.4% of residuals were below 8 mm. The complete registration accuracy statistics for both scenes are summarized in Table 1.

The spatial distribution of registration residuals is shown in Figure 5, where residual magnitudes are encoded as a color map overlaid on the slope surface mesh. In both scenes, the residuals exhibit a clear spatial pattern: the lowest residuals (below 3 mm) are concentrated on planar, well-exposed slope faces with abundant texture features, while the maximum residuals (10–15 mm) occur at the slope crests and along high-curvature edges where occlusion-induced point cloud sparsity reduces local matching support. Notably, no systematic directional bias is observed in the residual field, confirming that the ICP convergence has reached a geometrically isotropic optimum rather than being trapped in a one-sided local minimum.

These registration accuracy values are propagated into the downstream ROI analysis. For the data acquisition geometries used in this study (shooting distances of 8–15 m, equivalent focal length 35 mm), a 4.8 mm 3D registration residual translates to approximately 5–15 mm of spatial offset at the ROI boundary in the point cloud space. This is consistent with the empirical detection box localization offsets reported in Section 4.3 and provides the quantitative basis for the error propagation analysis in that section.

2.3.2. ROI-Constrained Segmentation and Parameter Extraction

The prerequisite for transmitting visual detection results into three-dimensional point cloud space is precise registration between the image and the point cloud. This paper adopts a fusion registration strategy based on SIFT feature point matching and iterative closest point (ICP) algorithm [30]. The two-dimensional detection box output by YOLOv8 is back-projected to the three-dimensional point cloud coordinate system through the camera’s internal and external parameter matrix, thereby delineating the point cloud region of interest (ROI) to be analyzed. This ROI-constrained approach confines the subsequent segmentation computation to the bounded three-dimensional volumes defined by back-projected detection boxes, rather than processing the entire point cloud of tens of millions of points, typically reducing the computational scale by one to two orders of magnitude.

Within each established ROI, principal component analysis (PCA) is used to estimate the local normal vector of each point [31]. For point

p_{i}

and its

k

-nearest neighbor point set

N (p_{i})

, the covariance matrix is constructed and its eigenvalue decomposition solved; the eigenvector corresponding to the minimum eigenvalue is the normal vector estimate

{\hat{n}}_{i}

. Following normal vector estimation, the DBSCAN density clustering algorithm [32] is used to group the point cloud in the normal vector space, assigning points with similar normal vector directions to the same structural plane candidate set. Within each candidate set, the RANSAC algorithm is used for plane fitting to eliminate non-planar points and obtain the best fitting plane equation for each structural surface [33]. In this two-stage cascade, DBSCAN provides topologically coherent initial partitions without requiring a priori knowledge of the number of discontinuity sets, while RANSAC refines each partition by enforcing geometric planarity constraints. The detailed analysis of why this ROI-constrained architecture simultaneously improves both efficiency and accuracy is presented in Section 4.1. The point cloud segmentation process is shown in Figure 6.

For each fitted plane, the structural plane orientation—the most fundamental geometric parameter in rock engineering stability analysis—is derived from the plane normal vector

\hat{n} = (n_{x}, n_{y}, n_{z})

obtained through RANSAC fitting. The dip direction

α

and dip angle

β

are calculated according to the following formulas [34]:

α = \arctan (\frac{n_{x}}{n_{y}}), β = \arccos (| n_{z} |)

(4)

where

n_{x}

,

n_{y}

,

n_{z}

are the components of the normal vector in the east, north and vertical directions, respectively. When

n_{y} < 0

, 180° must be added to

α

for quadrant correction. Spacing is obtained by calculating the normal distance between adjacent parallel discontinuities within the same group, and trace length is determined based on the maximum extension range of the projected discontinuity point cloud onto the fitting plane. The calculation strategies for each parameter are summarized in Table 2.

2.4. Multi-Source Data Fusion and Rock Mass Structure Analysis

All discontinuity orientations and geometric parameters obtained through the above workflow are uniformly organized into a structured database. The system performs discontinuity grouping statistical analysis on the basis of this database, presenting the dominant orientation directions of each discontinuity group in the form of stereonet projections to provide input for kinematic analysis [35]. On this basis, combined with block theory, key removable blocks formed by the intersection of multiple discontinuity sets are identified, and their stability is preliminarily assessed using the limit equilibrium method [36]. Furthermore, the structured parameter database—containing the dip direction, dip angle, spacing, trace length, aperture and roughness for each discontinuity instance—is formatted in a standardized tabular schema compatible with mainstream geotechnical numerical simulation software such as 3DEC and UDEC. This schema reduces the manual parameter-entry burden in the transition from automated recognition to numerical stability modelling, although project-specific mechanical calibration of the imported parameters remains a prerequisite before they can support load-bearing stability decisions. The final output of the system includes: an accurate geometric parameter table for each structural plane instance, a three-dimensional visualization point cloud model with orientation annotation, stereonet projections with kinematic admissibility assessment and a preliminary stability evaluation report for key blocks. The key links and data flow relationship of the entire processing flow are shown in Figure 7.

The system described in this section organically integrates the two technical routes of visual object detection and point cloud spatial analysis through a registration mapping mechanism, establishing a cross-modal collaborative process from fast positioning of two-dimensional images to fine parameter extraction of three-dimensional point clouds. The ECA attention module and BiFPN multi-scale fusion path embedded in the improved YOLOv8 have been specifically adapted to the geological characteristics of structural plane targets, while the detection box guided ROI directional segmentation strategy effectively reduces the computational scale of point cloud processing and improves segmentation accuracy. All aspects of the entire pipeline are ultimately guided by the output of engineering-usable parameters, ensuring the applicability and practicality of the system in actual slope investigation scenarios.

3. Experimental Design and Results Analysis

3.1. Experimental Data and Environment Configuration

3.1.1. Study Sites, Data Acquisition, and Dataset Construction

This paper selects the slope of the waste disposal site from Majitang to Anhua Expressway in Hunan Province (referred to as Scene A) and the CP2 road cut slope in Chuiquan Service Area (referred to as Scene B) as the research areas. Scene A is a fragmented limestone slope with a height of about 45 m. Three sets of dominant structural planes are developed, with a good degree of exposure but locally covered by weathered debris; Scene B is a layered slope with interbedded sandstone and mudstone, with a height of about 32 m. The structural surface morphology is relatively regular but is significantly affected by vegetation obstruction. Representative field photographs of both study sites are presented in Figure 8, showing the overall slope morphology, exposed discontinuity traces, and typical structural surface distribution patterns. The data collection adopts DJI Matrice 300 RTK (DJI, Shenzhen, China) equipped with Zenmuse P1 camera (DJI, Shenzhen, China) for multi-angle close-proximity photogrammetry, with the ground resolution controlled within 2 mm. At the same time, Leica RTC360 ground laser scanner (Leica Geosystems, Heerbrugg, Switzerland) is used to obtain reference point cloud data, with a scanning point spacing of about 3 mm.

In terms of image dataset construction, a total of 3340 images containing structural plane targets were initially extracted from the two scenes. To ensure annotation reliability, three engineers with more than five years of geological logging experience independently annotated the entire dataset. Inter-annotator agreement was first quantified on the bounding box presence task using Cohen’s kappa, which yielded an averaged pairwise value of 0.84 (interpreted as “almost perfect agreement” under the standard Landis–Koch convention), and on the bounding box localization task using pairwise mean IoU, which reached 0.82. For each image, the three sets of annotations were merged through a structured consensus protocol. Bounding boxes with three-way IoU exceeding 0.7 were directly accepted, constituting 83.2% of all annotations. Boxes with two-way agreement and IoU exceeding 0.6 between any two annotators were retained after coordinate averaging (14.6%). The remaining 2.7% of images, for which no two annotators reached the agreement threshold, were excluded from the dataset to avoid introducing labeling noise into the training process. After this resolution step, the final training set consists of 3240 images, partitioned into training, validation and test subsets in a 7:2:1 ratio with no scene-level leakage between the partitions. In terms of point cloud datasets, Scene A contains approximately 12.8 million points, while Scene B contains approximately 8.6 million points. The experimental platform is a workstation equipped with NVIDIA RTX 4090 GPU (24 GB of video memory), operating system Ubuntu 20.04, deep learning framework PyTorch 2.1.0, and CUDA version 12.1. It should be noted at the outset that all in-distribution accuracy values reported in Section 3.2, Section 3.3 and Section 3.4 reflect performance under the two-site acquisition envelope examined in this study; the cross-site behavior relevant to broader engineering deployment is examined separately in Section 3.1.2 below, and the resulting applicability boundary is summarized again at the end of Section 3.4 and in the Conclusions.

3.1.2. Cross-Site Generalization Test Setup

Although the random 7:2:1 partition described above ensures that no individual image is shared between training and testing, all three partitions still draw from the same two physical scenes, which limits the strict generalization evaluation. To complement this in-distribution evaluation and explicitly address the cross-site generalization concern raised in the peer review, two cross-site experiments were additionally conducted on top of the standard mixed-scene benchmark. In Experiment A → B, the detection network was trained exclusively on images from Scene A (limestone, fragmented morphology) and tested on the held-out images from Scene B (sandstone–mudstone, layered morphology). In Experiment B → A, the roles were reversed. Within-site control experiments were also performed, in which each scene’s training and test partitions came from the same scene with a 7:1 split. These four configurations together with the original mixed-scene baseline are summarized in Table 3.

Within-site performance is essentially indistinguishable from the mixed-scene baseline (from −0.5 to +0.3 percentage points), confirming that the random 7:2:1 partition does not artificially inflate accuracy via memorization of scene-specific patterns. In contrast, both cross-site configurations show a substantial accuracy drop: 8.7 percentage points for A → B and 12.9 percentage points for B → A. The asymmetry between the two cross-site directions—with B → A producing a larger drop—is consistent with the geological heterogeneity of the two scenes: Scene A contains three intersecting discontinuity sets producing more visually complex foreground patterns, so a network trained only on the more regular Scene B fails to capture the fragmented texture variability of Scene A. The cross-site drops of 8–13 percentage points fall within the range typically reported for CNN cross-domain transfer in geological imagery. They explicitly delineate the applicable boundary of the proposed method: in-distribution deployment to scenes with similar lithology and weathering style maintains the headline 89.4% mAP, whereas cross-lithology deployment without fine-tuning would degrade to the 76–81% range. A small-sample fine-tuning step on a few representative images from the target site is therefore recommended when transferring the method to a new lithology family in routine practice.

3.2. Improved YOLOv8 Detection Performance Evaluation

3.2.1. Detection Accuracy Comparative Experiment

To comprehensively evaluate the detection performance of improved YOLOv8, this paper conducted comparative experiments on the same test set with four representative models: original YOLOv8, YOLOv5s, Faster R-CNN and RT-DETR. All models were trained on the same dataset and training strategy, with an initial learning rate of 0.01, batch size of 16 and 300 epochs of training. The comparison results are shown in Table 4. As shown in Table 4, the proposed method achieves a mAP@0.5 of 89.4%, an improvement of 3.8 percentage points over the original YOLOv8, and 11.1 and 6.7 percentage points higher than Faster R-CNN and YOLOv5s, respectively. On the more stringent mAP@0.5:0.95 metric, the advantage of the proposed method is even more pronounced, reaching 67.2%, an improvement of 4.9 percentage points over the original YOLOv8. The model parameter count increases by only approximately 0.86 M, and the inference speed is 71.8 FPS, still meeting real-time requirements. Faster R-CNN demonstrates relatively weak balance between accuracy and speed due to its two-stage architecture. RT-DETR, despite employing a Transformer-based global self-attention mechanism with 32.01 M parameters—nearly three times that of the proposed method—achieves a mAP@0.5 of only 84.1%, which is 5.3 percentage points lower. This performance gap can be attributed to the nature of the discontinuity detection task. Slope discontinuity targets are characterized by localized weak texture cues and elongated morphology. The Transformer architecture excels at modelling global contextual dependencies yet provides limited advantage when the critical discriminative information resides in fine-grained local channel responses. By contrast, the proposed CNN-based architecture with targeted ECA channel attention is more effective at amplifying these sparse, localized texture signals. It therefore achieves a better accuracy–efficiency balance on the relatively modest-scale geological dataset used in this study.

To move beyond aggregate metric comparisons and understand the conditions under which each model succeeds or fails, a diagnostic analysis was conducted on three geologically challenging scenarios: shadow-occluded zones, vegetation-edge interference zones, and heavily weathered debris-covered zones. The Precision–Recall curve comparison for each model is shown in Figure 9, and the representative detection results in these challenging scenarios are presented in Figure 10.

In shadow-occluded zones, the original YOLOv8 produced 2 missed detections and 1 low-confidence false detection. The reduced illumination suppresses the already weak texture contrast between discontinuity traces and surrounding rock. The proposed method benefits from ECA’s channel-wise re-weighting, which selectively amplifies the residual texture signals in shadow-affected channels while suppressing background noise channels. As a result, it recovers the two missed detections and elevates the confidence of all detections above 0.85. In vegetation-edge zones, the original YOLOv8 misidentified tree root shadows as discontinuity traces—a failure attributable to the similar linear morphology of root shadows and genuine geological traces at the P3 feature scale. The BiFPN’s cross-scale direct connections enable the network to simultaneously access both the high-resolution spatial context from P3, which reveals the irregular branching pattern of roots, and the semantic information from P5, which distinguishes geological from biological features. This dual access resolves the ambiguity. In weathered debris-covered zones, both models face performance degradation. Even so, the proposed method still achieves a 0.78 confidence detection where the original YOLOv8 fails entirely, suggesting that the angle-aware SIoU regression enables more robust localization of partially obscured elongated targets. These observations collectively confirm that each improvement module addresses a specific geological challenge rather than providing redundant general-purpose enhancement.

3.2.2. Ablation Study

To quantify the independent contribution of each improvement module, this paper designs an ablation study using the original YOLOv8s as the baseline, progressively adding three improvements: the ECA attention module, the BiFPN multi-scale fusion structure, and the SIoU loss function. The experimental results are shown in Table 5. As can be seen from Table 5, the BiFPN module contributes the largest improvement to mAP@0.5 (+2.2%), indicating that the optimization of the multi-scale feature fusion path plays a critical role in capturing discontinuity targets of different scales. The ECA attention module contributes a 1.6% improvement while introducing only approximately 0.08 M additional parameters, demonstrating a good balance between lightweight design and high efficiency. The SIoU loss function contributes 0.8% when used independently but still provides additional gains when combined with the first two improvements, demonstrating the complementary value of angle-aware regression constraints for the localization of irregularly shaped discontinuity targets. The combined effect of the three improvements (+3.8%) is slightly lower than the sum of their individual contributions, indicating a certain degree of positive overlap between modules, but the overall synergistic gain is significant.

To more intuitively present the progressive accuracy improvement and parameter cost of each module addition, the ablation results are visualized as a waterfall chart with dual axes, as shown in Figure 11. The left vertical axis represents mAP@0.5 (%) displayed as incremental bars showing the stepwise accuracy gain from each module (Baseline 85.6% → +ECA + 1.6% → +BiFPN + 2.2% → +SIoU + 0.8% → Final 89.4%), while the right vertical axis represents parameter count (M) displayed as a line plot. The chart clearly demonstrates that the detection accuracy climbs steadily with each module addition, whereas the parameter count remains nearly flat—increasing from 11.17 M to only 12.03 M across all improvements—confirming that the proposed modifications achieve substantial performance gains at minimal computational overhead.

3.3. Point Cloud Segmentation and Parameter Extraction Accuracy Verification

3.3.1. Point Cloud Segmentation Accuracy Evaluation

Point cloud segmentation accuracy is assessed by comparing the discontinuity instances output by the algorithm point-by-point against manually interpreted results. To validate the effectiveness of the detection-guided ROI constrained segmentation strategy, the proposed method is compared against three baseline approaches: (a) RANSAC-only segmentation applied to the full point cloud without any detection guidance [11]; (b) region growing segmentation using normal vector similarity as the growth criterion [12]; and (c) unguided DBSCAN + RANSAC cascade segmentation applied to the full point cloud without ROI constraints (i.e., the same algorithmic pipeline as the proposed method but without the detection-guided ROI dimensionality reduction). Evaluation metrics include Intersection over Union (IoU), Precision and Recall. The comparative segmentation results for Scene A and Scene B are shown in Table 6.

As shown in Table 6, the proposed detection-guided method consistently outperforms all three baseline approaches across both scenes. RANSAC-only segmentation produces the lowest accuracy, with excessive false planes (58 identified faces versus 47 ground truth in Scene A) due to its random sampling nature operating on the unconstrained full point cloud. Region growing achieves moderate improvement but remains sensitive to seed point selection and suffers from over-segmentation in high-curvature areas. The unguided DBSCAN + RANSAC cascade, which employs the same algorithmic pipeline as the proposed method but without ROI constraints, achieves 74.8% and 79.2% average IoU in Scenes A and B, respectively—substantially lower than the 82.6% and 86.3% achieved by the proposed method. This 7.1–7.8 percentage point improvement directly attributable to ROI-constrained dimensionality reduction confirms that the detection-guided strategy is the primary contributor to segmentation quality enhancement, rather than the specific choice of clustering or fitting algorithms. From a theoretical perspective, the ROI constraint effectively reduces the entropy of the point cloud search space: by excluding irrelevant background rock mass and adjacent discontinuity surfaces from the analysis volume, the normal vector distribution within each ROI becomes more homogeneous and the signal-to-noise ratio for plane fitting improves substantially, which explains why the same DBSCAN + RANSAC algorithmic pipeline yields markedly different results with and without ROI guidance. In the more geometrically regular Scene B, segmentation accuracy is overall better across all methods compared to the more highly fractured Scene A. The four discontinuities not correctly identified by the proposed method in Scene A are mainly distributed in high-curvature areas at slope surface inflection points, where point cloud density is relatively low and normal vector variation is intense.

To provide a more direct visual comparison across methods, the segmentation accuracy metrics from Table 6 are presented as grouped clustered bar charts, as shown in Figure 12. The figure contains two sub-panels corresponding to Scene A and Scene B, respectively. Within each sub-panel, the four methods are arranged along the horizontal axis, with three grouped bars per method representing Average IoU, Precision and Recall in distinct colors. The visualization clearly highlights that the proposed method achieves the highest values across all three metrics in both scenes, with a particularly pronounced advantage in Precision—reaching 91.5% and 93.5% in Scenes A and B, substantially exceeding the second-best method (unguided DBSCAN + RANSAC at 76.0% and 82.4%)—demonstrating the effectiveness of ROI constraints in suppressing false-positive segmentation.

To further validate the necessity of the cross-modal fusion strategy, a controlled ablation experiment was designed to compare three data utilization modes: (a) image-only mode, where improved YOLOv8 detection results are used to count and locate discontinuities but no 3D parameters are extracted; (b) point-cloud-only mode, where the unguided DBSCAN + RANSAC pipeline processes the full point cloud without any image-derived guidance; and (c) the proposed cross-modal fusion mode. The comparison focuses on the final engineering output quality, as shown in Table 7.

As shown in Table 7, the image-only mode achieves rapid detection but cannot provide any three-dimensional geometric parameters, rendering it insufficient for engineering applications. The point-cloud-only mode can extract orientation and spacing parameters but at substantially degraded accuracy (dip angle RMSE of 3.41° versus 2.46°, spacing error of 10.5% versus 6.8%) and requires 48% more processing time due to the full-scale blind-search computation. The proposed cross-modal fusion mode achieves the best balance across all metrics, confirming that the “visual detection guided point cloud analysis” paradigm yields synergistic improvements that neither modality can achieve independently. The 0.95° reduction in dip angle RMSE from point-cloud-only to cross-modal mode is particularly noteworthy: this improvement does not originate from a different fitting algorithm, but from the reduction of geometric interference within the analysis volume. When the full point cloud is processed without ROI constraints, DBSCAN clustering in normal vector space inevitably merges points from adjacent but distinct discontinuity sets that happen to share similar orientations, leading to contaminated plane fitting results. The ROI constraint spatially isolates each discontinuity neighborhood, thereby preserving the geometric purity of each cluster and reducing the systematic bias in normal vector estimation.

Mapping the three modes onto a common processing-time–accuracy plane brings the same trade-off into sharper relief. The image-only mode sits at the speed extreme (1.2 min); however, it provides no 3D output and therefore cannot be placed on the IoU axis at all. The point-cloud-only mode sits at the accuracy–time extreme, with the longest processing time (127.6 min), an intermediate IoU of 74.8% and the highest dip angle RMSE of 3.41°. The proposed cross-modal fusion mode dominates the upper-left region of this trade-off space. It simultaneously delivers a much shorter processing time of 86 min—only 67% of that of the point-cloud-only mode—the highest IoU of 82.6%, which is 7.8 percentage points above the point-cloud-only mode, and the lowest orientation error of 2.46°, an absolute reduction of 0.95° in dip angle RMSE relative to the same baseline. Across all three axes considered together, no single-modal mode achieves a comparable balance, which confirms quantitatively the qualitative argument that cross-modal information transfer is the decisive enabler of the simultaneous improvement of efficiency, segmentation quality, and parameter extraction precision.

3.3.2. Accuracy Analysis of Orientation Parameter Extraction

The validation of orientation parameters uses 180 sets of discontinuity orientation data measured by three geological engineers using a compass as the benchmark. The error statistics between the dip direction and dip angle automatically extracted by the proposed method and the measured values are shown in Table 8.

The extraction accuracy of dip angle is significantly better than that of dip direction, which is consistent with conclusions from existing research—the calculation of dip direction is more sensitive to minor deviations in the horizontal components of the normal vector. Among the 180 data points, the proportion of samples with dip direction errors exceeding 5° is 12.2%, and the proportion with dip angle errors exceeding 3° is 8.9%.

From an engineering application perspective, the practical significance of these error magnitudes must be evaluated in the context of downstream stability analysis rather than treated as abstract statistical metrics. In stereonet-based kinematic analysis, the identification of dominant discontinuity sets relies on the clustering of pole points within orientation space. A dip angle RMSE of 2.46° and dip direction RMSE of 4.15° fall well within the typical ±5° scatter range of individual compass measurements. This means that the automatically extracted orientations would produce statistically indistinguishable pole point clusters on equal-area stereonet projections compared with those derived from manual survey data. More critically, for planar sliding and wedge failure mode identification, the controlling geometric criterion is typically the angular relationship between the discontinuity dip and the slope face angle. This relationship is insensitive to errors below approximately 5° when the angular margin between the discontinuity dip and the slope angle exceeds 10°, as is the case for both study sites. The achieved accuracy level is therefore sufficient to ensure that the dominant failure modes identified through automated analysis remain consistent with those determined by conventional manual geological survey, satisfying the analytical requirements of routine slope engineering investigations. The error distribution histogram is shown in Figure 13.

The extraction accuracy of spacing and trace length parameters is verified by comparing with the manual measurement data. The average relative error of spacing is 6.8%, and the average relative error of trace length is 9.3%. The reason why the trace length error is relatively large is that part of the structural plane extending to the edge of the slope cannot be completely collected in the point cloud, resulting in the smaller automatic calculation value.

Beyond the orientation, spacing, and trace length parameters reported above, two additional geometric attributes of engineering relevance were independently validated: aperture (the normal distance between the two opposing surfaces of an open discontinuity) and surface roughness (quantified here as JRC profile classes). The validation set comprised 36 selected discontinuities in Scene A and 24 in Scene B, where the walls were sufficiently exposed to allow handheld feeler-gauge aperture measurement and Barton comb roughness profiling. For aperture extraction, the proposed automatic method yields a root-mean-square error of 0.94 mm and a mean relative error of 18.3% with respect to the field-measured reference values (which ranged from 1.2 mm to 8.4 mm in the validated set). For surface roughness, the JRC values inferred from the point cloud residual analysis (±RMS of point-to-plane deviations mapped onto the standard JRC profile catalogue) deliver an RMSE of 1.6 JRC units and a mean relative error of 14.7%. The proposed automatic JRC matches the field-classified JRC within ±2 units in 81.9% of cases. These error magnitudes are larger than those of the orientation parameters (4.15° dip direction, 2.46° dip angle) for two physically interpretable reasons. Aperture extraction is fundamentally limited by the local point cloud spacing of 3 mm on both sides of the discontinuity, so apertures below approximately 2–3 mm cannot be resolved. JRC matching from point cloud residuals is intrinsically more uncertain than the original visual matching against the Barton chart, because the small-scale roughness structures (sub-millimeter asperities) that dominate JRC class assignment are partially attenuated by the TLS spot size. The reported 18.3% aperture error and 14.7% roughness error fall within the 10–25% range typically reported in the literature for point-cloud-based extraction of these two parameters. They are therefore considered acceptable for routine slope characterization, though they are not yet suitable for laboratory-grade joint mechanical testing input.

3.3.3. Robustness Testing Under Different Point Cloud Density Conditions

In order to evaluate the stability of the system under different data quality conditions, the original point cloud is reduced to 75%, 50% and 25% density levels by random down-sampling, and the segmentation and parameter extraction experiments are repeated on scene a. The robustness test results are shown in Figure 14.

The test results show that when the point cloud density decreases to 50%, the average IOU decreases from 82.6% to 77.4%, and the tendency RMSE increases from 4.15° to 5.82 °. Although the segmentation quality and parameter accuracy decrease, they are still within the acceptable range of engineering. When the density was further reduced to 25%, the IOU dropped sharply to 68.1%, and the occurrence error increased significantly, indicating that the sparse point cloud could not provide sufficient geometric constraints for normal vector estimation and plane fitting. The test results provide a quantitative reference for the selection of data acquisition density in practical engineering.

3.3.4. Image-Side Robustness Under Illumination, Geometric Perturbation, and View Overlap Variations

Beyond the point cloud density robustness assessed in Section 3.3.3, the robustness of the image-side detection stage to three additional perturbations of practical relevance was systematically evaluated: image brightness variation, in-plane image rotation and inter-image overlap rate during UAV photogrammetric acquisition. These three perturbation types correspond to common field-acquisition uncertainties—changing solar illumination over the survey time window, off-nadir tilt of the gimbal during low-altitude flight and adjustments to flight strip spacing imposed by terrain constraints—and therefore characterize the operational envelope within which the proposed system can be expected to maintain stable performance. The test set was the Scene A test partition (180 images); for each perturbation type, the network trained on the unperturbed full dataset was applied without re-training, and the resulting mAP@0.5, Precision and Recall were measured against the original benchmark. The results are summarized in Table 9.

Three operational findings emerge from this analysis. First, under moderate brightness variation (±15%) covering the typical illumination range encountered between morning and afternoon UAV flight slots, the mAP@0.5 degradation is only 1.8 percentage points, indicating that the ECA channel attention mechanism effectively suppresses illumination-induced texture-contrast variability across feature channels. Second, the network remains stable under small in-plane rotations (±5° produces only 1.3 percentage point drop), but its degradation accelerates approximately linearly with rotation magnitude beyond ±10°—this is expected given that the training set was not augmented with strong rotational variations and that elongated discontinuity targets are intrinsically orientation-sensitive. Third, image overlap rate exhibits the strongest performance gradient: while the 70% overlap commonly recommended in UAV manuals produces negligible degradation, reducing overlap to 50% causes an 8.1 percentage point mAP drop because the corresponding loss of multi-view redundancy degrades both the underlying photogrammetric reconstruction and the cross-modal registration that propagates into the detection pipeline. Practically, these results define a recommended field-acquisition envelope: brightness stable within ±15%, gimbal tilt within ±5° of nominal and image overlap maintained above 70%, within which the system is expected to deliver performance comparable to the headline benchmark.

3.4. Overall System Performance and Engineering Utility Analysis

The total process time from raw data input to structural surface parameter output is shown in Table 10. As shown in Table 10, completing the full workflow from raw data to engineering parameters for Scene A (approximately 12.8 million points, 47 discontinuities) takes approximately 86 min in total. By comparison, the average working time required for three geological engineers to perform conventional manual logging of the same scene is approximately 3.5 working days (approximately 28 h), representing an efficiency improvement of approximately 19.5 times by the proposed system. The two primary time bottlenecks are the three-dimensional reconstruction stage and the point cloud segmentation stage; the former is limited by the computational intensity of the SfM-MVS algorithm itself, while the latter is closely related to normal vector estimation and iterative clustering of large-scale point clouds. The detection stage of the improved YOLOv8 accounts for only 1.4% of the total processing time, fully demonstrating the real-time advantage of single-stage detectors.

Beyond processing speed, the engineering utility of the proposed system is fundamentally measured by the extent to which its outputs can directly serve downstream geotechnical analysis workflows. To demonstrate this capability, the automatically extracted orientation data from Scene A were used to generate an equal-area lower-hemisphere stereonet projection, as shown in Figure 15. The stereonet clearly resolves three dominant discontinuity sets: J1 (dip direction/dip angle: 285°/72°), J2 (156°/45°) and J3 (30°/12°), with pole point concentrations that are statistically consistent with those obtained from the manual compass survey of 180 measurements. Based on these three identified discontinuity sets and the slope face orientation (dip direction 200°, dip angle 65°), a preliminary kinematic analysis was performed following the block theory framework. The analysis identifies a potential wedge failure block formed by the intersection of J1 and J2, with the line of intersection plunging at 38° in the direction of 218°—daylighting on the slope face and thus kinematically admissible for sliding. The critical friction angle back-calculated from the limit equilibrium assessment of this wedge block is approximately 32°. It should be emphasized that this back-calculated 32° is reported here only as a consistency check against the range of values commonly cited for weathered limestone discontinuities in regional engineering practice (typically 30–38°); it is not an independent in situ measurement and therefore does not constitute a validation of the rock-mechanical parameters of the slope. Direct in situ confirmation of the actual friction angle for the J1–J2 wedge surface would require dedicated discontinuity-surface shear-box testing or back-analysis of an observed failure event, which is identified here as a necessary follow-up step before the automated parameter output can be used to support load-bearing stability decisions on this specific slope.

Furthermore, the structured parameter database generated by the system—containing the dip direction, dip angle, spacing, trace length, aperture and roughness for each individual discontinuity instance—is formatted in a tabular schema compatible with mainstream geotechnical numerical simulation software such as 3DEC (Itasca) for discrete element modeling and UDEC for two-dimensional discontinuum analysis. This structured interface substantially reduces the manual parameter-entry workload that has traditionally been one of the dominant time costs in the conventional workflow from geological survey to numerical simulation, while leaving the project-specific mechanical calibration of the imported parameters to the downstream simulation stage where it properly belongs.

To visually demonstrate the final output effect of the system, the 3D visualization result of Scene A is shown in Figure 16. Synthesizing the above experimental results, the intelligent recognition system integrating improved YOLOv8 and point cloud segmentation proposed in this paper demonstrates significant advantages in four dimensions: detection accuracy, parameter extraction accuracy, downstream engineering applicability and operational efficiency. The improved YOLOv8 achieves a mAP@0.5 of 89.4%, the dip angle RMSE of orientation extraction is controlled within 2.46° and the full workflow processing efficiency is improved by approximately one order of magnitude compared to manual methods. The system maintained stable performance in two actual engineering scenes with different geological conditions, verifying the engineering practicality of the cross-modal collaborative recognition framework in slope discontinuity surveys. Within the two-site validation envelope of this study—one fragmented limestone slope and one layered sandstone–mudstone slope, both located in Hunan Province—the system delivers the in-distribution accuracy and efficiency reported above. Wider engineering deployment, particularly to lithologies, weathering grades or vegetation conditions outside this envelope, would benefit from the small-sample fine-tuning step quantified by the cross-site experiments in Section 3.1.2. The two-site validation should therefore be read as a proof of operational feasibility rather than as a substitute for project-specific calibration in downstream geotechnical decision-making, and the headline figures of merit reported above should be interpreted within this clearly delimited applicability boundary.

4. Discussion

4.1. Effectiveness Mechanism of the Cross-Modal Collaborative Pathway

The fundamental reason why the cross-modal collaborative strategy of “visual detection-guided point cloud segmentation” proposed in this paper outperforms single-modal methods in slope discontinuity recognition lies in the full activation of the complementarity between two-dimensional imagery and three-dimensional point clouds at the information level. Two-dimensional imagery possesses rich texture and color information, enabling rapid recognition of the distribution pattern of discontinuity traces over a large field of view; however, it cannot directly provide orientation and geometric parameters in three-dimensional space. Three-dimensional point clouds contain precise spatial coordinate information but are far inferior to image data in global semantic understanding capability. Moreover, the computational cost of blind-search segmentation on point clouds at the tens-of-millions-of-points scale is extremely high. After completing the rapid localization of discontinuities at the image level through improved YOLOv8, this paper converts detection boxes into ROI constraints in point cloud space through registration mapping. Subsequent normal vector estimation and density clustering then proceed only within limited regions of interest. This mechanism simultaneously achieves improvements in both computational efficiency and segmentation accuracy, unifying the two seemingly contradictory objectives within the cross-modal information transfer framework. The ablation experiment in Table 7 provides direct quantitative evidence for this mechanism: the point-cloud-only mode without ROI guidance requires 48% more processing time while yielding a dip angle RMSE that is 0.95° worse, confirming that the semantic dimensionality reduction provided by visual detection is not merely a computational convenience but a substantive contributor to parameter extraction quality. From an information-theoretic standpoint, the ROI constraint functions as a spatial prior that reduces the entropy of the analysis domain: by confining the point cloud processing to bounded volumes where discontinuity surfaces are expected to exist, the normal vector distribution within each analysis region becomes substantially more concentrated compared to the full-cloud scenario, leading to more coherent DBSCAN clusters and higher-fidelity RANSAC plane fitting. This entropy reduction effect is more pronounced in Scene A (fragmented limestone with closely spaced, intersecting discontinuity sets) than in Scene B (regularly layered sandstone-mudstone), which explains the larger IoU improvement margin observed in Scene A (7.8 percentage points versus 7.1 percentage points). To more clearly present the differences between the proposed method and existing representative research in technical pathways and performance metrics, the systematic comparative results are shown in Table 11.

As shown in Table 11, the proposed method exhibits a certain gap compared to RL-JointNet proposed by Sun et al. in terms of the absolute accuracy of orientation extraction. This is primarily attributable to two factors: RL-JointNet directly performs end-to-end training on high-precision annotated point clouds, avoiding the propagation of registration errors during the image-to-point-cloud mapping process, whereas the point cloud segmentation accuracy of the proposed method is constrained by the cumulative effect of detection bounding box localization deviations and registration residuals. It is also important to acknowledge the absolute accuracy gap with respect to the dedicated point cloud deep-learning method RL-JointNet [17]: its reported dip direction RMSE of 2.8° and dip angle RMSE of 1.5° are approximately 1.35° and 1° lower than the values achieved here, indicating that for tasks where orientation accuracy is the single dominant criterion, end-to-end point cloud learning on high-quality annotated datasets remains the strongest option. The proposed cross-modal method therefore should not be positioned as a replacement for such methods but as a complementary route that exchanges a fraction of orientation accuracy for an order-of-magnitude reduction in annotation cost (image-side bounding boxes only, rather than per-point discontinuity labels) and for an automated closed loop from detection through to stereonet-ready engineering parameters. However, the proposed method has significant advantages in two dimensions—end-to-end engineering parameter output and processing efficiency. Existing pure point cloud deep learning methods typically only output segmentation masks or a subset of orientation parameters, lacking an automated closed loop from detection to complete engineering information, whereas the system proposed in this paper integrates detection, segmentation, parameter calculation and visualization output into a complete pipeline, capable of completing full workflow processing from raw data to structured geological reports within 86 min. As demonstrated in Section 3.4, the automatically extracted parameters can directly generate stereonet projections and support preliminary kinematic block consistency checks. This provides a structured transition from computational detection to the geometric-parameter input stage of geotechnical engineering evaluation—a capability that none of the compared methods currently provides. It should nevertheless be noted that project-specific mechanical calibration of the imported parameters remains a prerequisite for load-bearing stability decisions.

Combining these observations, the proposed method is most appropriate for engineering investigation scenarios where automated end-to-end output, large-area coverage, and rapid turnaround are valued above a 1–1.5° incremental gain in orientation precision; in particular, routine slope mapping, preliminary stability screening and digital-twin geometric model construction for medium-to-large open-pit and roadside slopes. For applications requiring the highest absolute accuracy on a small number of critical discontinuities—for example, back-analysis of an observed failure surface or the rock-mechanical parameter input for a single benchmark wedge—the methodological cost of producing per-point annotations for an end-to-end point cloud network such as RL-JointNet remains justified. The two routes are therefore complementary in the engineering workflow rather than competing, and the choice depends on which dimension of accuracy versus throughput is the binding constraint of a given project.

4.2. Performance Boundaries and Applicable Conditions

The experimental results in Section 3 reveal the correlation between the performance of this system and geological conditions. In the layered rock scene (Scene B) with regular structural plane morphology and good exposure degree, the segmentation IoU of the system reaches 86.3% and the accuracy of orientation extraction is also better than that of Scene A with a higher fragmentation degree. This indicates that the geometric regularity of structural planes has a positive promoting effect on the quality of normal vector estimation and the stability of plane fitting. When the curvature of the rock surface changes dramatically or there are local depressions and protrusions, the local neighborhood assumption of PCA normal vector estimation is broken, resulting in over-segmentation or misclassification in the segmentation results—the four unrecognized structural surfaces in Scene A are all distributed in the high curvature area of the slope turning point, which confirms this judgment. The impact of point cloud density on system performance cannot be ignored. Robustness testing shows that the system can still maintain acceptable accuracy when the density drops to 50% of the original level, but performance deteriorates significantly when it drops to 25%. This threshold provides a clear reference lower limit for the design of data acquisition schemes in practical engineering.

On the imaging side, improving the detection stability of YOLOv8 in scenarios with shadow occlusion and vegetation interference has been effectively achieved. However, there is still a risk of missed detection for closed structural surfaces with colors that are highly similar to the surrounding intact rock surface. Such targets almost do not produce distinguishable texture differences in two-dimensional images, which exceeds the capability boundary of pure visual detection methods. In addition, the effectiveness of this system has been verified in two engineering scenarios, but it has not been systematically tested under more lithological conditions such as granite and gneiss. The generalization ability still needs to be tested and confirmed in a larger range of geological scenarios. Specifically, the system may face degraded performance in three conditions. The first is heavily vegetated slopes where canopy coverage exceeds approximately 60%, severely occluding the underlying rock surface and reducing both detection recall and point cloud completeness. The second is extremely fragmented rock masses, such as fault breccia zones, where the distinction between individual discontinuity traces and pervasive micro-fracture networks becomes ambiguous at the image resolution employed. The third is slopes with highly repetitive or homogeneous surface textures, such as massive unweathered granite faces, where the SIFT-based registration strategy may suffer from feature matching degradation due to insufficient distinctive key points.

4.3. Error Propagation Mechanism and Physical Interpretability

A critical aspect that warrants explicit discussion is the error propagation pathway inherent to the cross-modal architecture. The proposed system follows a sequential processing chain—image detection → registration mapping → ROI extraction → point cloud segmentation → parameter calculation—in which errors at each stage accumulate and propagate to downstream outputs. The detection bounding box localization error is typically 3–8 pixels at the image scale employed. This pixel-level error is amplified during back-projection to 3D space by a factor that depends on the shooting distance and camera focal length. For the data acquisition configurations used in this study, the resulting spatial offset at the ROI boundary is approximately 5–15 mm. This boundary offset can cause the partial inclusion of an adjacent rock mass or partial exclusion of discontinuity margins, directly affecting the subsequent PCA normal vector estimation. In high-curvature slope regions where multiple discontinuity sets converge, the registration residual further compounds this effect, because the ICP convergence quality degrades when local geometric features lack distinctiveness. The cumulative impact of this error chain is most pronounced in the dip direction parameter, with RMSE 4.15° notably higher than the dip angle RMSE of 2.46°. Dip direction computation depends on the ratio of horizontal normal vector components, which makes it disproportionately sensitive to small perturbations in the fitted plane orientation.

To provide a clear geometric understanding of this error amplification process, the cross-modal error propagation mechanism is illustrated as a conceptual diagram in Figure 17. The diagram traces the error chain through five sequential stages from left to right: (1) at the image detection stage, the bounding box exhibits a localization offset of 3–8 pixels relative to the true discontinuity boundary; (2) during back-projection through the camera intrinsic and extrinsic parameter matrices, this pixel-level offset is amplified by a magnification factor dependent on the object distance and focal length, producing a 5–15 mm spatial displacement at the ROI boundary in 3D point cloud space; (3) the displaced ROI boundary either incorporates extraneous points from adjacent intact rock mass or truncates marginal points belonging to the target discontinuity surface; (4) within the contaminated ROI, PCA normal vector estimation is biased because the covariance matrix is computed over a point set that does not purely represent a single planar surface, causing the estimated normal vector to deviate from the true surface normal; and (5) the final dip direction calculation, which relies on the arctangent of the ratio between horizontal normal vector components

n_{x}

and

n_{y}

, exhibits disproportionate sensitivity to this angular perturbation—a small deviation

δ n

in the fitted plane normal produces an amplified dip direction error

δ α

that scales inversely with the magnitude of

n_{y}

, geometrically explaining why the dip direction RMSE (4.15°) consistently exceeds the dip angle RMSE (2.46°) across all experimental configurations.

4.3.1. Quantitative Sensitivity Analysis

To move from the qualitative chain description presented above to a quantitative characterization of how each error source contributes to the final orientation uncertainty, a controlled sensitivity analysis was conducted in which the detection bounding box offset and the ICP registration residual were independently varied while holding the downstream PCA–DBSCAN–RANSAC pipeline parameters fixed. Perturbations were injected at the corresponding stages of the pipeline by deliberately displacing the detection bounding boxes in pixel space and by adding Gaussian-distributed offsets to the registered point cloud coordinates with prescribed standard deviation; the resulting dip direction and dip angle RMSE were measured against the 180-point manual compass survey ground truth introduced in Section 3.3.2. The results are summarized in Table 12.

As shown in Table 12, the predicted dip direction RMSE at the current operating point of the system—corresponding to the ≈5 pixel mean detection offset and the ≈5 mm RMS ICP residual (rounded from the 4.8 mm RMSE of Scene A reported in Section 2.3.1)—is 4.15°, in close agreement with the experimentally observed dip direction RMSE of 4.15° reported in Section 3.3.2. The corresponding predicted dip angle RMSE of 2.46° similarly matches the observed value, confirming that the geometric error propagation model captures the dominant sources of the observed parameter uncertainty. Linearizing the sensitivity around the current operating point yields propagation coefficients of approximately 0.40°/pixel for dip direction with respect to detection offset, 0.20°/mm for dip direction with respect to the ICP residual, 0.16°/pixel for the dip angle with respect to the detection offset and 0.08°/mm for the dip angle with respect to the ICP residual—that is, the dip direction is roughly twice as sensitive to each of the two error sources as the dip angle, consistent with the analytical observation in Figure 17 that dip direction depends on the ratio between horizontal normal vector components and is therefore disproportionately sensitive to small perturbations in the fitted plane orientation.

The residual baseline RMSE of 3.50° in dip direction and 2.29° in dip angle, observed at the zero-detection-offset, zero-ICP-residual configuration, reflects the intrinsic noise of the PCA normal vector estimation and the RANSAC plane fitting steps, which together account for approximately 71% of the variance in dip direction error and 87% of the variance in dip angle error under the current operating conditions. This decomposition carries a direct practical implication: reducing detection box offset from 5 to 3 pixels while keeping ICP residual at the current 5 mm—a realistic gain achievable from a stronger detection backbone—would reduce dip direction RMSE by only 0.32° in absolute terms. Future accuracy improvements should therefore prioritize the point-cloud-side processing stages, for example, through denser TLS scanning to reduce the local neighborhood size used in PCA, improved local geometric estimators robust to high-curvature regions or the physics-informed constraints discussed in the next paragraph, rather than focusing primarily on image-side detection refinement.

4.3.2. Physical Interpretability and Future Directions

From the perspective of physical interpretability, the current system treats discontinuity recognition as a purely data-driven pattern matching problem without incorporating any rock mechanics principles into the algorithmic framework. This represents both a limitation and an opportunity for future development. In principle, discontinuity sets in natural rock masses are governed by the regional tectonic stress field and lithological mechanical properties, which impose physical constraints on permissible orientation distributions, spacing regularity and termination patterns. For example, conjugate joint sets formed under the same tectonic stress regime typically exhibit a predictable angular relationship: the acute angle bisector aligns with the maximum principal stress direction

σ_{1}

, and the conjugate angle is related to the internal friction angle of the rock material. This geometric regularity could be formulated as an angular consistency constraint during the DBSCAN clustering stage, penalizing candidate groupings that violate the expected conjugate geometry and thereby reducing spurious cluster assignments. Similarly, bedding planes in sedimentary sequences such as those in Scene B maintain approximate parallelism across the slope face, which could be encoded as a post-processing orientation coherence check—flagging and correcting individual bedding plane measurements whose dip angles deviate by more than a lithology-dependent threshold from the group mean. Such physics-informed constraints would effectively narrow the solution space of orientation estimation to geologically plausible configurations, potentially mitigating the error propagation effects described above without requiring additional data acquisition. Integrating these rock mechanics priors into the computational framework—whether as regularization terms in the clustering objective function or as Bayesian prior distributions in the parameter estimation step—represents a promising direction toward bridging data-driven computer vision with mechanism-based rock engineering analysis.

5. Conclusions

(1): The “visual detection guided point cloud analysis” paradigm establishes a cross-modal dimensionality-reduction mechanism. It transforms the intractable global blind-search segmentation problem into a localized ROI-constrained oriented analysis problem through detection bounding box back-projection. Controlled ablation experiments demonstrate that this mechanism—rather than the specific choice of downstream algorithms—is the primary contributor to performance enhancement. It yields a 7.1–7.8 percentage point IoU improvement and a 48% reduction in processing time compared to the identical algorithmic pipeline operating on the unconstrained full point cloud.
(2): The geological-texture-sensitive detection architecture incorporates channel attention for weak-contrast texture capture, bidirectional multi-scale fusion for scale-span accommodation and angle-aware regression for extreme-aspect-ratio targets. It collectively achieves a mAP@0.5 of 89.4% on the slope discontinuity dataset—a 3.8 percentage point improvement over the baseline network with only 0.86 M additional parameters. Diagnostic analysis across shadow-occluded, vegetation-interference and weathered-debris scenarios confirms that each module addresses a specific geological challenge.
(3): The point cloud segmentation average IoU reaches 82.6% and 86.3% in the fragmented limestone and layered sandstone-mudstone scenes, respectively; the dip angle RMSE is 2.46°, and the dip direction RMSE is 4.15°, with a spacing average relative error of 6.8%. These accuracy levels fall within the ±5° scatter range of individual compass measurements, ensuring that the dominant failure modes identified through automated analysis remain consistent with those determined by conventional manual geological survey and satisfying the analytical requirements of routine slope engineering investigations.
(4): The complete workflow from raw data acquisition to engineering parameter output requires approximately 86 min for a scene of 12.8 million points with 47 discontinuities—a 19.5-fold efficiency improvement over conventional manual methods. The automatically generated structured parameter database is formatted in a schema compatible with mainstream geotechnical numerical simulation software such as 3DEC and UDEC, where it serves as the geometric-parameter input to downstream discrete element modelling rather than as a substitute for site-specific mechanical calibration. The downstream applicability has been demonstrated through stereonet-based kinematic analysis and a preliminary wedge-block consistency check in Scene A, thereby establishing an automated workflow from field data acquisition to the geometric-parameter input stage of engineering stability evaluation. The cross-site experiments reported in Section 3.1.2 quantify the applicable boundary of the current system: within the two-site validation envelope examined in this study, the headline accuracy is maintained; deployment to a markedly different lithology family without fine-tuning would degrade detection mAP by 8.7–12.9 percentage points, indicating that a small-sample fine-tuning step on a few representative images from the target site is recommended when transferring the system to a new geological context.
(5): Two natural directions for future work are identified. First, the system should be tested on additional lithologies including granite, gneiss, and fault-breccia zones to validate the cross-lithology generalization boundary indicated by the cross-site experiments. Second, physics-informed constraints could be integrated into the clustering and plane-fitting stages. Such constraints include rock-mass mechanical regularization of permissible orientation distributions, conjugate joint-set angular consistency, and bedding-plane parallelism priors. Their integration could narrow the solution space to geologically plausible configurations and potentially mitigate the error-propagation effects characterized in Section 4.3.

Author Contributions

Conceptualization, H.L. (Hongwei Liu) and H.L. (Hang Lin); methodology, H.L. (Hongwei Liu); software, H.L. (Hongwei Liu); validation, H.L. (Hongwei Liu), K.X. and H.L. (Hang Lin); formal analysis, H.L. (Hongwei Liu); investigation, H.L. (Hongwei Liu) and K.X.; resources, K.X. and H.L. (Hang Lin); data curation, K.X.; writing—original draft preparation, H.L. (Hongwei Liu); writing—review and editing, H.L. (Hang Lin); visualization, H.L. (Hongwei Liu); supervision, H.L. (Hang Lin); project administration, H.L. (Hang Lin); funding acquisition, H.L. (Hang Lin). All authors have read and agreed to the published version of the manuscript.

Funding

This paper gets its funding from Project (NRMSSHR-2022-Z08) supported by Key Laboratory of Natural Resources Monitoring and Supervision in Southern Hilly Region, Ministry of Natural Resources.

Institutional Review Board Statement

On behalf of all authors, the corresponding author states that there are no conflicts of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Zhang, Y.; Wang, Q.; Sun, Y.J.; Cao, Y.B.; Chen, Z.Y. Investigation and susceptibility assessment of regional geological hazards along the Karakoram highway, northeast margin of Pamir Plateau. Geomat. Nat. Hazards Risk 2024, 15, 2341176. [Google Scholar] [CrossRef]
Chai, S.B.; Song, B.Y.; Liu, J.H.; Liu, K.; Liu, X.Y.; Shi, J.H. Theoretical analysis of dynamic sliding mechanism of rock slope with a bedding structural plane based on stress wave propagation. Sci. Rep. 2025, 15, 7349. [Google Scholar] [CrossRef]
Al-E’Bayat, M.; Guner, D.; Sherizadeh, T.; Asadizadeh, M. Numerical investigation for the effect of joint persistence on rock slope stability using a lattice spring-based synthetic rock mass model. Sustainability 2024, 16, 894. [Google Scholar] [CrossRef]
Liu, D.S.; Chen, S.Y.; Lin, H.; Chen, Y.F. Shear Behavior and Morphological Evolution of Rough Rock Joints under Non-uniform Loading. J. Mater. Eng. Perform. 2025. [Google Scholar] [CrossRef]
Wang, M.; Zhou, J.; Chen, J.; Jiang, N.; Zhang, P.; Li, H. Automatic identification of rock discontinuity and stability analysis of tunnel rock blocks using terrestrial laser scanning. J. Rock. Mech. Geotech. Eng. 2023, 15, 1810–1825. [Google Scholar] [CrossRef]
Pola, A.; Herrera-Díaz, A.; Tinoco-Martínez, S.R.; Macias, J.L.; Soto-Rodríguez, A.N.; Soto-Herrera, A.M.; Sereno, H.; Ramón Avellán, D. Rock characterization, UAV photogrammetry and use of algorithms of machine learning as tools in mapping discontinuities and characterizing rock masses in Acoculco Caldera Complex. Bull. Eng. Geol. Environ. 2024, 83, 260. [Google Scholar] [CrossRef]
Cirillo, D.; Zappa, M.; Tangari, A.C.; Brozzetti, F.; Ietto, F. Rockfall analysis from UAV-based photogrammetry and 3D models of a cliff area. Drones 2024, 8, 31. [Google Scholar] [CrossRef]
Temur, M.A.; Kocaman, S.; Nefeslioglu, H.A. On the use of semi-georeferenced photogrammetric dense point clouds in the investigation of rock mass discontinuity properties. Bull. Eng. Geol. Environ. 2024, 83, 451. [Google Scholar] [CrossRef]
Zhou, J.-W.; Chen, J.-L.; Li, H.-B. An optimized fuzzy K-means clustering method for automated rock discontinuities extraction from point clouds. Int. J. Rock. Mech. Min. Sci. 2024, 173, 105627. [Google Scholar] [CrossRef]
Kang, J.; Fu, X.; Sheng, Q.; Ge, Y.; Chen, J.; Wang, H. Semi-automatic identification of rock discontinuity orientation based on 3D point clouds and its engineering application. Bull. Eng. Geol. Environ. 2024, 83, 172. [Google Scholar] [CrossRef]
Daghigh, H.; Tannant, D.D.; Jaberipour, M. A computationally efficient approach to automatically extract rock mass discontinuities from 3D point cloud data. Int. J. Rock. Mech. Min. Sci. 2023, 172, 105603. [Google Scholar] [CrossRef]
Chen, N.; Wu, X.; Xiao, H.; Yao, C.; Cheng, Y. Semi-automatic recognition of rock mass discontinuity based on 3D point clouds. Discov. Appl. Sci. 2024, 6, 230. [Google Scholar] [CrossRef]
Kong, D.; Wu, F.; Saroglou, C. Automatic identification and characterization of discontinuities in rock masses from 3D point clouds. Eng. Geol. 2020, 265, 105442. [Google Scholar] [CrossRef]
Ji, Y.; Song, S.; Chen, J.; Xue, J.; Yan, J.; Zhang, Y.; Sun, D.; Wang, Q. Automatic identification of discontinuities and refined modeling of rock blocks from 3D point cloud data of rock surfaces. J. Rock. Mech. Geotech. Eng. 2025, 17, 3093–3106. [Google Scholar] [CrossRef]
Chen, Q.; Ge, Y.; Tang, H. Rock discontinuities characterization from large-scale point clouds using a point-based deep learning method. Eng. Geol. 2024, 337, 107585. [Google Scholar] [CrossRef]
Chen, Q.; Ge, Y.; Tang, H. An unsupervised method for rock discontinuities rapid characterization from 3D point clouds under noise. Gondwana Res. 2024, 132, 287–308. [Google Scholar] [CrossRef]
Sun, J.; Zhu, S.; Sun, J.; Zhou, J.; Yao, Y.; Wang, Y.; Zhang, J.; Zhou, B.; Wang, X. A robust deep learning approach for rock discontinuity identification from large scale 3D point clouds. Sci. Rep. 2025, 16, 1654. [Google Scholar] [CrossRef]
Lu, G.; Cao, B.; Zhu, X.; Lin, Z.; Bai, D.; Tao, C.; Li, Y. Identification of rock mass discontinuity from 3D point clouds using improved fuzzy C-means and convolutional neural network. Bull. Eng. Geol. Environ. 2024, 83, 159. [Google Scholar] [CrossRef]
Günen, M.A.; Aliyazıcıoğlu, Ş. Discontinuities identification from rock outcrop using auto-encoder and point clouds. Bull. Eng. Geol. Environ. 2025, 84, 418. [Google Scholar] [CrossRef]
Ge, Y.; Wang, H.; Liu, G.; Chen, Q.; Tang, H. Automated Identification of Rock Discontinuities from 3D Point Clouds Using a Convolutional Neural Network. Rock. Mech. Rock. Eng. 2025, 58, 3683–3700. [Google Scholar] [CrossRef]
Ji, Y.; Song, S.; Zhang, W.; Li, Y.; Xue, J.; Chen, J. Automatic identification of rock fractures based on deep learning. Eng. Geol. 2025, 345, 107874. [Google Scholar] [CrossRef]
Li, M.; Chen, M.; Lu, W.; Yan, P.; Tan, Z. Automatic extraction and quantitative analysis of characteristics from complex fractures on rock surfaces via deep learning. Int. J. Rock. Mech. Min. Sci. 2025, 187, 106038. [Google Scholar] [CrossRef]
Ma, R.; Yu, H.; Liu, X.; Yuan, X.; Geng, T.; Li, P. InSAR-YOLOv8 for wide-area landslide detection in InSAR measurements. Sci. Rep. 2025, 15, 024–84626. [Google Scholar] [CrossRef]
Ruan, S.; Hu, Y.; Liu, J.; Wang, J. An advanced crack detection method for slope management in open-pit mines: Applying enhanced YOLOv8 network. Int. J. Min. Reclam. Environ. 2026, 40, 70–87. [Google Scholar] [CrossRef]
Peng, P.; Gao, L.; Li, J.; Zhang, H. Optimized YOLOv8 framework for intelligent rockfall detection on mountain roads. Sci. Rep. 2025, 15, 14007. [Google Scholar] [CrossRef]
Yu, A.; Fan, H.; Xiong, Y.; Wei, L.; She, J. LHB-YOLOv8: An optimized YOLOv8 network for complex background drop stone detection. Appl. Sci. 2025, 15, 737. [Google Scholar] [CrossRef]
Zhao, Y.; Sun, F.; Wu, X. FEB-YOLOv8: A multi-scale lightweight detection model for underwater object detection. PLoS ONE 2024, 19, e0311173. [Google Scholar] [CrossRef] [PubMed]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:220512740. [Google Scholar] [CrossRef]
Mehrishal, S.; Kim, J.; Shao, Y.; Song, J.J. Artificial intelligence-aided semi-automatic joint trace detection from textured three-dimensional models of rock mass. J. Rock. Mech. Geotech. Eng. 2025, 17, 1973–1985. [Google Scholar] [CrossRef]
Han, S.; Tong, D.; Wu, B.; Wang, J.; Wang, X.; Zhang, W. An efficient semi-automated characterization of rock mass discontinuities from 3D point clouds based on Nutcracker Optimization Algorithm-improved probabilistic neural network. Bull. Eng. Geol. Environ. 2025, 84, 210. [Google Scholar] [CrossRef]
Pham, C.; Kim, B.-C.; Shin, H.-S. Deep learning-based identification of rock discontinuities on 3D model of tunnel face. Tunn. Undergr. Space Technol. 2025, 158, 106403. [Google Scholar] [CrossRef]
Xuan, C.; Zhang, Y.; Xu, W.; Li, X.; Zhang, N. Beishan exploration tunnel surrounding rock discontinuity identification based on structure from motion photogrammetry technology. Eng. Rep. 2024, 6, e12882. [Google Scholar] [CrossRef]
Han, J.; Wang, J.; Dong, W.; Wang, S.; Sun, Q.; Li, T.; Xu, Z.; Zhang, Y.; Zhang, W. A new algorithm for high-speed identification of discontinuities on large-scale rock outcrop: A case study in Jinsha River suture zone. J. Rock. Mech. Geotech. Eng. 2025, 18, 1250–1265. [Google Scholar] [CrossRef]
Peng, X.; Lin, P.; Xia, Q.; Yu, L.; Wang, M. A new method for recognizing discontinuities from 3D point clouds in tunnel construction environments. Tunn. Undergr. Space Technol. 2024, 152, 105955. [Google Scholar] [CrossRef]
Zhao, M.; Song, S.; Wang, F.; Zhu, C.; Liu, D.; Wang, S. A method to interpret fracture aperture of rock slope using adaptive shape and unmanned aerial vehicle multi-angle nap-of-the-object photogrammetry. J. Rock. Mech. Geotech. Eng. 2024, 16, 924–941. [Google Scholar] [CrossRef]
Lei, J.; Fan, Y. Rock CT Image Fracture Segmentation Based on Convolutional Neural Networks: L. Jian, F. Yufei. Rock. Mech. Rock. Eng. 2024, 57, 5883–5898. [Google Scholar] [CrossRef]

Figure 1. Overall architecture diagram of the intelligent recognition system for slope discontinuities. (a) High-level four-stage pipelinea; (b) Detailed module structure with data flow.

Figure 2. Schematic diagram of the improved YOLOv8 network structure.

Figure 3. Structural diagram of the ECA channel attention module.

Figure 4. Comparison of the BiFPN neck network structure and original PANet. (a) Original PANet; (b) Improved BiFPN The blue and red arrows denote the top-down FPN and bottom-up PAN pathways, respectively, and the green dashed lines denote the same-level cross-node connections introduced by the BiFPN.

Figure 5. Spatial distribution of point-to-plane registration residuals between UAV-derived and TLS reference point clouds. (a) Scene A (fragmented limestone slope, ICP RMSE = 4.8 mm); (b) Scene B (layered sandstone–mudstone slope, ICP RMSE = 3.6 mm). Color encodes residual magnitude in millimeters.

Figure 6. Flowchart of detection-guided point cloud segmentation.

Figure 7. End-to-end processing data flow diagram from raw data to engineering information.

Figure 8. Field photographs of the two study sites: (a) overall view of Scene A (fragmented limestone slope); (b) close-up of exposed discontinuity traces in Scene A; (c) overall view of Scene B (layered sandstone-mudstone slope); (d) close-up of structural surface distribution in Scene B.

Figure 9. Precision–Recall curve comparison of detection models on the discontinuity test set.

Figure 10. Comparison of detection performance in complex scenes. In each row the left column shows the original YOLOv8 and the right column the proposed method, for: (a) a shadow-occluded zone; (b) a vegetation-edge interference zone (sandstone slope); (c) a weathered debris-covered zone (limestone slope).

Figure 11. Waterfall chart of progressive mAP@0.5 improvement and parameter count variation across ablation configurations. The bars (left axis) give mAP@0.5, the green arrows mark the incremental accuracy gain contributed by each successive module, and the red dashed line (right axis) gives the parameter count.

Figure 12. Grouped clustered bar chart comparison of point cloud segmentation accuracy metrics across methods for Scene A and Scene B. (a) Scene A; (b) Scene B. In each panel the proposed method is shown in bold for emphasis.

Figure 13. Error distribution histogram of orientation parameter extraction. (a) dip direction error; (b) dip angle error. The red curve is the fitted normal distribution, and the dashed lines mark the ±5° (dip direction) and ±3° (dip angle) tolerance bands.

Figure 14. Variation curves of the segmentation accuracy and orientation error under different point cloud densities. The blue solid line with circles is the mean IoU (left axis) and the red dashed line with diamonds is the dip direction RMSE (right axis); the yellow shaded band marks the performance-degradation zone.

Figure 15. Equal-area lower-hemisphere stereonet projection of automatically extracted discontinuity orientations for Scene A, with kinematic analysis of the identified wedge failure block. (a) pole points and great circles of the three discontinuity sets (J1, J2 and J3) together with the slope face; (b) the J1–J2 intersection line I₁₂ and the kinematically admissible daylight envelope.

Figure 16. Three-dimensional recognition and parameter annotation visualization results of the slope discontinuities in Scene A. The blue, green and red point clusters denote discontinuity sets J1, J2 and J3, respectively, while the grey points represent the unclassified rock surface.

Figure 17. Conceptual diagram of the cross-modal error propagation mechanism from 2D detection offset to 3D orientation parameter deviation. The numbered blue arrows mark the five sequential propagation stages, the dashed boxes delineate the back-projected region of interest, and the colors distinguish the included adjacent intact-rock points, the true discontinuity surface and the estimated surface normal.

Table 1. Quantitative registration accuracy between UAV-derived point cloud and TLS reference point cloud for the two study sites.

Registration Metric	Scene A (Limestone)	Scene B (Sandstone–Mudstone)
SIFT inlier correspondences (mean ± SD)	1420 ± 230	1680 ± 210
RANSAC inlier ratio (%)	71.4	78.9
ICP point-to-plane RMSE (mm)	4.8	3.6
Median residual (mm)	3.9	2.7
90th-percentile residual (mm)	7.6	5.4
Maximum residual (mm)	14.7	11.2
Proportion of residuals < 8 mm (%)	92.1	95.4

Table 2. Summary of automated calculation methods for key geometric parameters of discontinuities.

Parameter Name	Calculation Method	Input Data
Dip Direction/Dip Angle	Inverse trigonometric function of normal vector (Equation (4))	RANSAC fitted plane normal vector
Spacing	Mean normal distance between adjacent planes of the same group	Plane equations of parallel discontinuity groups
Trace Length	Maximum extension length after projecting point cloud onto fitting plane	Individual discontinuity point cloud coordinates
Aperture	Normal distance between opposing point clouds on both sides of the discontinuity	Point cloud at discontinuity boundaries
Roughness	Root mean square of deviations from points to fitting plane	Individual discontinuity point cloud and fitting plane

Table 3. Cross-site and within-site generalization performance of the proposed detection network. “A → B” denotes training on Scene A and testing on Scene B, etc.

Train/Test Configuration	mAP@0.5 (%)	Precision (%)	Recall (%)	Drop vs. Mixed Baseline (pp)
Mixed-scene (baseline, A + B → A + B)	89.4	90.1	86.7	—
Within-site (A → A)	88.9	89.7	86.0	−0.5
Within-site (B → B)	89.7	90.3	87.1	+0.3
Cross-site (A → B)	80.7	82.1	78.4	−8.7
Cross-site (B → A)	76.5	77.8	74.3	−12.9

Table 4. Performance comparison of different detection models on the slope discontinuity dataset.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Precision (%)	Recall (%)	Parameters (M)	FPS
Faster R-CNN	78.3	52.1	81.6	74.8	41.12	14.3
YOLOv5s	82.7	57.4	84.2	79.5	7.24	86.2
RT-DETR-L	84.1	60.8	85.9	80.3	32.01	42.7
YOLOv8s (original)	85.6	62.3	86.4	82.1	11.17	78.5
Proposed Method	89.4	67.2	90.1	86.7	12.03	71.8

Table 5. Ablation study results of the improvement modules. In the table, a check mark (✓) indicates that the corresponding module is enabled and a dash (—) indicates that it is not enabled.

ID	Baseline	ECA	BiFPN	SIoU	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Parameters (M)
①	✓	—	—	—	85.6	62.3	11.17
②	✓	✓	—	—	87.2	64.5	11.25
③	✓	—	✓	—	87.8	65.1	11.89
④	✓	—	—	✓	86.4	63.7	11.17
⑤	✓	✓	✓	—	88.6	66.3	11.97
⑥	✓	✓	✓	✓	89.4	67.2	12.03

Table 6. Comparative statistical results of point cloud segmentation accuracy for the two scenes.

Method	Scene	Identified Faces	Correctly Identified	Average IoU (%)	Precision (%)	Recall (%)
RANSAC-only [11]	A	58	31	64.2	53.4	63.3
RANSAC-only [11]	B	39	24	69.8	61.5	75
Region growing [12]	A	52	35	69.5	67.3	71.4
Region growing [12]	B	35	27	74.1	77.1	84.4
Unguided DBSCAN + RANSAC	A	50	38	74.8	76	77.6
Unguided DBSCAN + RANSAC	B	34	28	79.2	82.4	87.5
Proposed Method	A	47	43	82.6	91.5	87.8
Proposed Method	B	31	29	86.3	93.5	90.6

Table 7. Cross-modal fusion ablation experiment results (Scene A).

Mode	Identified Faces	Correct Faces	Average IoU (%)	Dip Angle RMSE (°)	Spacing Relative Error (%)	Processing Time (min)
Image-only	45	41	—	—	—	1.2
Point-cloud-only	50	38	74.8	3.41	10.5	127.6
Proposed cross-modal fusion	47	43	82.6	2.46	6.8	86

Table 8. Error statistics of automated orientation parameter extraction versus measured values.

Parameter	Sample Size	Mean Error	Maximum Error	Root Mean Square Error (RMSE)	Standard Deviation
Dip Direction (°)	180	3.27	9.84	4.15	2.61
Dip Angle (°)	180	1.83	6.21	2.46	1.64

Table 9. Image-side robustness of the proposed detection network under brightness variation, in-plane rotation and reduced UAV image overlap rate.

Perturbation Type	Perturbation Level	mAP@0.5 (%)	Δ vs. Baseline (pp)	Precision (%)	Recall (%)
Baseline (no perturbation)	—	89.4	—	90.1	86.7
Brightness variation	±15%	87.6	−1.8	88.4	85.0
Brightness variation	±30%	85.2	−4.2	86.1	82.7
In-plane rotation	±5°	88.1	−1.3	88.9	85.6
In-plane rotation	±10°	85.7	−3.7	86.5	83.2
In-plane rotation	±15°	82.9	−6.5	83.8	80.5
Image overlap (UAV)	70% (slight reduction)	88.7	−0.7	89.5	85.9
Image overlap (UAV)	60% (moderate reduction)	85.8	−3.6	86.5	82.9
Image overlap (UAV)	50% (substantial reduction)	81.3	−8.1	82.0	78.5

Table 10. Processing time statistics for each stage of the system’s complete workflow (Scene A).

Processing Stage	Processing Time	Proportion (%)
Image Preprocessing and 3D Reconstruction	42 min	48.8
Improved YOLOv8 Detection	1.2 min	1.4
Pixel-to-Point-Cloud Registration	5.6 min	6.5
Point Cloud Segmentation and Parameter Calculation	33.4 min	38.8
Information Integration and Visualization Output	3.8 min	4.5
Total Workflow	86 min	100

Table 11. Comprehensive comparison of the proposed method with existing representative methods.

Comparison Dimension	Kong, Wu [13]	Chen, Wu [12]	Sun, Zhu [17]	Proposed Method
Technical Pathway	Pure point cloud clustering	PointNet++ point cloud segmentation	RL-JointNet deep learning	Image detection + point cloud segmentation collaboration
Input Data	Single point cloud	Single point cloud	Single point cloud	Multi-source fusion of imagery + point cloud
Dip Direction RMSE (°)	3.6	—	2.8	4.15
Dip Angle RMSE (°)	2	—	1.5	2.46
Global Accuracy (GA/mIoU)	—	—	98.7%/98.1%	82.6~86.3% (IoU)
Training Annotation Required	No	Yes	Yes	Yes (image end only)
End-to-End Engineering Parameter Output	Partial	No	No	Yes
Processing Efficiency	Medium	Low	Medium	High

Table 12. Quantitative sensitivity of orientation parameter RMSE to detection bounding box offset and ICP registration residual under the acquisition geometry of Scene A (shooting distance 8–15 m, focal length equivalent 35 mm).

Detection Offset (pixel)	ICP Residual (mm)	Dip Direction RMSE (°)	Dip Angle RMSE (°)
0 (ideal)	0 (ideal)	3.50	2.29
3	2	3.72	2.34
5 (current operating point)	5 (current operating point)	4.15	2.46
8	6	4.89	2.67
10	8	5.55	2.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, H.; Xiao, K.; Lin, H. Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation. Appl. Sci. 2026, 16, 5460. https://doi.org/10.3390/app16115460

AMA Style

Liu H, Xiao K, Lin H. Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation. Applied Sciences. 2026; 16(11):5460. https://doi.org/10.3390/app16115460

Chicago/Turabian Style

Liu, Hongwei, Ke Xiao, and Hang Lin. 2026. "Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation" Applied Sciences 16, no. 11: 5460. https://doi.org/10.3390/app16115460

APA Style

Liu, H., Xiao, K., & Lin, H. (2026). Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation. Applied Sciences, 16(11), 5460. https://doi.org/10.3390/app16115460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Recognition of Slope Discontinuities via Cross-Modal Fusion of Object Detection and Point Cloud Segmentation

Abstract

1. Introduction

2. Design of an Intelligent Recognition System Integrating Improved YOLOv8 and Point Cloud Segmentation

2.1. Overall System Framework and Engineering Closed-Loop Design

2.2. Improved YOLOv8 Detection Algorithm Oriented Toward Discontinuity Characteristics

2.2.1. Embedding of Lightweight Channel Attention Mechanism

2.2.2. Bidirectional Feature Pyramid Network Multi-Scale Fusion

2.2.3. Loss Function Optimization Design

2.3. Detection-Guided Point Cloud Segmentation and Parameter Extraction Algorithm

2.3.1. Image-to-Point-Cloud Registration and Accuracy Quantification

2.3.2. ROI-Constrained Segmentation and Parameter Extraction

2.4. Multi-Source Data Fusion and Rock Mass Structure Analysis

3. Experimental Design and Results Analysis

3.1. Experimental Data and Environment Configuration

3.1.1. Study Sites, Data Acquisition, and Dataset Construction

3.1.2. Cross-Site Generalization Test Setup

3.2. Improved YOLOv8 Detection Performance Evaluation

3.2.1. Detection Accuracy Comparative Experiment

3.2.2. Ablation Study

3.3. Point Cloud Segmentation and Parameter Extraction Accuracy Verification

3.3.1. Point Cloud Segmentation Accuracy Evaluation

3.3.2. Accuracy Analysis of Orientation Parameter Extraction

3.3.3. Robustness Testing Under Different Point Cloud Density Conditions

3.3.4. Image-Side Robustness Under Illumination, Geometric Perturbation, and View Overlap Variations

3.4. Overall System Performance and Engineering Utility Analysis

4. Discussion

4.1. Effectiveness Mechanism of the Cross-Modal Collaborative Pathway

4.2. Performance Boundaries and Applicable Conditions

4.3. Error Propagation Mechanism and Physical Interpretability

4.3.1. Quantitative Sensitivity Analysis

4.3.2. Physical Interpretability and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI