A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation

Liu, Shan; Fang, Shilin; Zhang, Luhao; Wang, Pengcheng; Cheng, Xiaorong; Xu, Lei; Sun, Jian; Jiang, Tengping

doi:10.3390/agriculture16090956

Open AccessArticle

A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation

by

Shan Liu

^1,2,3,

Shilin Fang

^1,2,

Luhao Zhang

^1,2,

Pengcheng Wang

⁴,

Xiaorong Cheng

⁴,

Lei Xu

³,

Jian Sun

^1,* and

Tengping Jiang

^2,3,5,*

¹

Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210093, China

²

State Key Laboratory of Climate System Prediction and Risk Management, Nanjing Normal University, Nanjing 210093, China

³

National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China

⁴

Jiangsu Yushu Information Technology Co., Ltd., Nanjing 210012, China

⁵

State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(9), 956; https://doi.org/10.3390/agriculture16090956

Submission received: 1 March 2026 / Revised: 27 March 2026 / Accepted: 22 April 2026 / Published: 27 April 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Instance segmentation of plant point clouds is challenging due to intricate structures, non-uniform density, and large intra-class variation. Conventional methods often suffer from blurred boundaries, instance adhesion, and insufficient coupling of semantic and instance features. To address these issues, this paper proposes MPRSF-CSA, a novel network integrating recursive slice-based feature extraction with an attention-embedding mechanism. The method first transforms disordered point clouds into ordered sequences via a multi-directional recursive slicing strategy and models inter-slice dependencies using BiLSTM. Parallel decoding branches for semantic and instance segmentation are constructed, and a core attention-embedding module facilitates bidirectional fusion of semantic and instance features. Instance segmentation is achieved via clustering and semantic-aware optimization. Experiments on two public datasets demonstrate that MPRSF-CSA outperforms existing approaches in segmentation accuracy, boundary preservation, and adaptability to complex plant scenes.

Keywords:

plant phenotyping; 3D point cloud processing; deep learning segmentation; instance segmentation; attention mechanism

1. Introduction

Plant phenotyping is a cornerstone of modern agricultural research and precision farming, with the accurate parsing of three-dimensional (3D) plant architecture being a primary objective [1]. Advances in 3D sensing technologies, such as LiDAR and photogrammetry, have made point clouds a predominant representation for capturing the intricate geometrical and structural details of plants [2]. Within this context, instance segmentation of plant point clouds—which entails identifying and delineating each individual plant organ (e.g., leaves, stems)—is a critical step for enabling automated, high-throughput measurement of traits like leaf count, size, and angle [3]. However, the practical deployment of such segmentation systems in agricultural settings is confronted by substantial challenges. Agricultural computer vision systems must operate under highly variable environmental conditions, including occlusion from overlapping foliage, illumination changes across different times of day and weather conditions, and the inherent structural complexity of plant canopies. As highlighted in recent studies on vineyard monitoring, deep learning models for plant analysis must generalize across diverse spectral domains, imaging conditions, and heterogeneous canopy structures to achieve robust performance in real-world scenarios [4]. These environmental variabilities significantly complicate the task of point cloud instance segmentation, demanding methods that are resilient to such perturbations.

1.1. Analysis of Existing Work

Despite its significance, instance segmentation of plant point clouds remains a formidable challenge due to several inherent complexities [5]. Plant structures are often highly intricate, with components exhibiting significant intra-class variation in shape and size. Severe occlusions and self-similarity between adjacent organs are common, leading to blurred boundaries and instance adhesion in the point cloud data. Furthermore, the non-uniform density of points, resulting from scanning limitations and complex surface geometries, complicates consistent feature extraction [6]. While deep learning-based methods have revolutionized point cloud processing, existing instance segmentation approaches—often built upon proposal refinement [7] or metric learning paradigms [8]—frequently struggle in such demanding plant environments. A critical limitation, which directly motivates our work, is the insufficient coupling of semantic and instance features. Many methods treat these tasks in a relatively isolated manner, failing to leverage their mutual information effectively. This shortcoming is particularly detrimental in plant scenes, where disambiguating touching instances of the same semantic class is essential for accurate segmentation [9].

Recent years have witnessed considerable progress in point cloud deep learning architectures. The processing of 3D point clouds with deep learning has primarily evolved along three paradigms: projection-based, voxel-based, and point-based methods. Each paradigm offers distinct advantages but also exhibits specific limitations that inform the design of our approach. Projection-based approaches project unordered 3D points onto 2D planes to leverage the power of well-established 2D Convolutional Neural Networks [10,11]. While efficient, they often suffer from information loss during projection and may struggle with severe occlusions prevalent in dense plant canopies. This information loss compromises the preservation of fine geometric details, a critical requirement for distinguishing closely spaced plant organs. Voxel-based methods partition the 3D space into regular grids and then apply 3D CNNs for processing [12,13]. Although effective in structuring data, they are computationally intensive and memory-bound, with resolution limited by voxel size, making it difficult to capture fine-grained details of thin plant structures. Point-based networks like PointNet [14] and its successor [15] directly process raw point sets, using shared Multi-Layer Perceptrons and symmetric aggregation functions to achieve permutation invariance. Subsequent work advanced this paradigm by explicitly modeling local geometric structures [16,17]. While these methods preserve precise geometric information, they typically focus on local neighborhoods and are challenged by capturing long-range dependencies across a plant’s entire structure [18]. This limitation is significant because plant architecture often involves spatially distributed yet structurally correlated components (e.g., leaves along a stem), necessitating global contextual understanding.

Instance segmentation methods for general point clouds can be broadly categorized into proposal-based, clustering-based, and learning-based approaches. A careful examination of their limitations reveals the specific gaps our method aims to address. Proposal-based approaches often generate 3D bounding box proposals first and then segment instances within them [7,19], but their reliance on axis-aligned boxes makes them ill-suited for the irregular shapes of plant organs. A popular line of research learns discriminative point embeddings such that points belonging to the same instance are close in the embedding space, with final instances obtained through clustering algorithms like DBSCAN [20]. A common challenge for these methods is the “instance adhesion” problem. In this problem, adjacent instances of the same semantic class are difficult to separate due to the lack of effective feature constraints at instance boundaries.

Recent works have recognized the importance of leveraging interactions between semantic and instance representations [21]. For example, SoftGroup [22] introduces a semantic-guided grouping strategy that first performs semantic segmentation and then groups points into instances using geometric cues. HAIS [23] proposes a hierarchical aggregation mechanism that combines semantic and instance features through a joint learning framework. Similarly, several instance-aware semantic segmentation methods [24,25] have explored bidirectional feature fusion through concatenation or summation operations. However, these approaches typically employ relatively simple interaction mechanisms—such as concatenation, summation, or unidirectional guidance—which may not fully exploit the mutual dependencies between semantic and instance information. In particular, they often lack explicit attention mechanisms that can dynamically weight feature interactions based on local context, and they are not specifically designed to handle the unique geometric complexities of plant structures, such as severe self-occlusion and high morphological similarity between adjacent organs.

A critical insight, often overlooked in general scene segmentation, is the tight coupling between semantic and instance information. For plants, knowing that a cluster of points is a “leaf” strongly informs the grouping process [26]. Our proposed architecture is explicitly designed to exploit this coupling through a bidirectional attention-embedding mechanism that fuses semantic and instance features, thereby enhancing discriminative capability. Unlike prior methods that rely on static concatenation or unidirectional feature fusion, our approach introduces two parallel attention sub-modules—Semantic-guided Instance Attention (SGIA) and Instance-guided Semantic Attention (IGSA)—that enable symmetric, bidirectional interaction. This design ensures that semantic information guides instance discrimination while instance distinctions simultaneously refine semantic understanding, creating a mutually reinforcing cycle that is particularly beneficial for resolving ambiguous boundaries in dense plant scenes.

Recurrent Neural Networks (RNNs) and their renowned variants, such as Long Short-Term Memory (LSTM) and GRUs, have achieved remarkable success in sequence modeling tasks due to their ability to capture long-range dependencies and contextual information [27,28]. RSNet [29] pioneered the adaptation of RNNs for point cloud semantic segmentation through a slicing strategy. However, points within each slice remain unordered, and intra-slice pooling leads to loss of fine-grained local geometric details. RSSNet [30] addressed this by introducing Graph Convolutional Networks and attention mechanisms within slices. Building upon these advances, our method adopts a multi-directional recursive slicing strategy that preserves local geometric fidelity while enabling the modeling of inter-slice dependencies via BiLSTM. The synergy between recurrent networks and attention mechanisms enables powerful global context understanding, which is crucial for interpreting complex 3D scenes [31]. Motivated by this synergy, we integrate a core attention-embedding module that facilitates bidirectional fusion of semantic and instance features—a capability absent in prior recurrent-based point cloud networks. In contrast to existing joint-learning frameworks [21,22,23,24,25] that typically treat semantic and instance features as separate streams merged at specific layers, our bidirectional fusion module operates iteratively throughout the decoding process, enabling continuous refinement of both feature types. This iterative bidirectional interaction is a key differentiator that allows our method to more effectively resolve instance adhesion in challenging plant scenes. Bidirectional structures and gating mechanisms prove particularly adept at capturing long-range, ordered spatial context—especially beneficial for structures with dominant orientations, such as plants.

The unique challenges of plant point clouds have spurred the development of specialized algorithms. The limitations of these existing plant-specific methods further justify the need for our approach. Early methods relied heavily on hand-crafted geometric features and traditional region-growing techniques [32,33], which were sensitive to parameter tuning and point density. With the rise in deep learning, several studies have adapted general point cloud networks for plant phenotyping, with PointNet++ [34] widely used as a backbone for leaf segmentation and organ counting [35]. However, these direct applications often fail to address the fundamental issues of boundary ambiguity and instance adhesion. The main reason is that they are not designed to handle the unique geometric complexities of plant structures, such as severe self-occlusion and high morphological similarity between adjacent organs. Some plant-specific methods have incorporated physiological rules [36] or skeleton-based processing [37], but their reliance on domain-specific priors can limit generalizability across different plant species and growth stages. Furthermore, while recent plant phenotyping approaches have begun to explore deep learning-based instance segmentation, they typically adopt general-purpose architectures without specialized mechanisms for semantic-instance feature interaction. Our method addresses this gap by introducing a plant-aware bidirectional attention mechanism that operates in concert with a multi-directional slicing strategy, enabling both global structural understanding and precise instance discrimination. In contrast, our method does not depend on species-specific assumptions, instead learning generalizable features through a data-driven architecture that jointly optimizes semantic and instance segmentation.

1.2. Contributions of Our Work

To address the aforementioned challenges, this paper proposes a novel multi-perspective recursive slice framework with cross-slice attention, named MPRSF-CSA, for instance-aware plant point cloud segmentation. First, a sequential, slice-based representation was introduced for plant point clouds through a novel multi-directional recursive slicing strategy that systematically converts unstructured plant point clouds into ordered sequences. Then, a bidirectional LSTM was employed to recursively aggregate slice-level features and capture global inter-slice dependencies. Next, we explicitly and bidirectionally fused semantic and instance features through a dedicated attention-embedding module to resolve the critical challenge of instance adhesion. Finally, the proposed framework was validated through comprehensive experiments on two public plant point cloud datasets.

The main contributions of this work are summarized as follows:

(1): We propose a novel multi-directional recursive slicing strategy that systematically converts unstructured plant point clouds into ordered sequences, providing a robust foundation for sequential feature learning.
(2): We develop a bidirectional LSTM-based architecture to recursively aggregate slice-level features, enabling the extraction of global representations that encapsulate critical local contextual cues.
(3): We introduce an attention-embedding module that explicitly and bidirectionally fuses semantic and instance features, strengthening their discriminative capacity and improving the model’s ability to resolve instance boundaries.
(4): We validate the proposed framework through comprehensive experiments on multiple public plant point cloud datasets. The results demonstrate that our method outperforms state-of-the-art approaches in terms of segmentation accuracy, boundary preservation, and adaptability to complex vegetation scenarios.

The remainder of this paper is organized as follows. Section 2 provides a detailed description of the proposed methodology. Section 3 presents the experimental setup, results, and a thorough discussion. Finally, Section 4 concludes the paper and suggests directions for future research.

2. Materials and Methods

As shown in Figure 1, the core of our approach lies in a fundamental transformation of the problem representation. We introduce a multi-directional recursive slicing strategy that converts the disordered plant point cloud into an ensemble of ordered sequences. This transformation allows us to model inter-slice dependencies using a Bidirectional LSTM, effectively extracting global features that are richly infused with local contextual information. Subsequently, the network constructs parallel decoding branches for semantic and instance segmentation. The pivotal attention-embedding module facilitates bidirectional fusion of semantic and instance features, ensuring mutual informativeness and enhancing the discriminative power of the final instance embeddings. The precise instance segmentation is ultimately achieved through a combination of instance-embedding clustering and a semantic-aware optimization process.

2.1. Multi-Directional Recursive Slicing and Feature Encoding

The primary challenge in plant point cloud processing arises from the inherent disorder (the permutation of points conveys no semantic information) and irregularity (characterized by a non-structured spatial distribution) of the data [7,14]. Consequently, conventional 3D convolution methods are impeded in their ability to directly and efficiently model the long-range dependencies between points, thereby limiting their capacity to interpret the complex topological structures of plants. To address this issue, a novel multi-directional recursive slicing strategy is proposed in this paper. By this method (see Figure 2), the point cloud space is recursively partitioned from multiple viewpoints, transforming the disordered and unstructured data into an ordered and regular sequence representation. Through this transformation process, not only is the original geometric structure of the plant preserved, but, more critically, a sequential order is established [38]. This ordered representation provides a robust foundation for subsequent efficient feature extraction using sequence models, thereby enabling the effective capture of long-range contextual dependencies among plant organs.

2.1.1. Multi-Directional Recursive Slicing Strategy

Given an input plant point cloud

P = {\{p_{i} \in ℝ^{3}\}}_{i = 1}^{N}

(where N denotes the total number of points in the point cloud), feature learning is typically conducted directly in three-dimensional Euclidean space by conventional methods. Consequently, the explicit capture of the complex spatial arrangement patterns of plant organs is hindered. To overcome this limitation, a Multi-Directional Recursive Slicing Module (MDRSM) is designed. By recursively partitioning the point cloud space along multiple orthogonal directions, the unstructured point cloud is transformed into an ordered sequence structure by this module. In this manner, the hierarchical arrangement patterns of plant organs across multiple dimensions are effectively revealed.

Specifically, three principal slicing directions are defined as

D = \{d_{x}, d_{y}, d_{z}\}

, corresponding to the x, y, and z axes of the coordinate space, respectively. These three orthogonal directions are selected to capture the spatial distribution of plant organs from complementary viewpoints: the z-axis captures the vertical growth pattern from base to top, while the x and y axes capture the lateral spread of branches and leaves. For each direction

d \in D

, a total of K (where K is a hyperparameter denoting the number of recursive levels) recursive slicing operations are performed sequentially [29]. The value of K determines the granularity of the slicing: larger K yields finer slices (

2^{K}

slices in total) that capture more detailed local geometry, while smaller K produces coarser slices that emphasize global structure. In this work, K is set to 4 based on empirical validation balancing detail preservation and computational efficiency. Taking the vertical direction

d_{z}

as an illustrative example, during the k-th slicing operation, the projection interval

[z_{m i n}, z_{m a x}]

of the point cloud along the z-axis is evenly partitioned into

2^{k}

consecutive sub-intervals. Consequently, a set of slices is obtained, denoted as

S_{k}^{d_{z}} = \{s_{k, 1}^{d_{z}}, s_{k, 2}^{d_{z}}, \dots, s_{k, 2^{k}}^{d_{z}}\}

. Here, each individual slice

s_{k, j}^{d_{z}}

is defined as the collection of points whose projections fall within the j-th z-axis sub-interval. Through this recursive partitioning process, a series of ordered sequences aligned along each principal axis is generated at multiple granularities for every main direction. By this means, a foundational data structure is established, facilitating the subsequent capture of the hierarchical organization of the point cloud across various dimensions.

Existing slicing methods are typically performed only at a single scale or along a single direction, the proposed MDRS module is distinguished by several notable advantages. First, owing to its multi-directional nature, the network is enabled to comprehensively capture the spatial distribution patterns of plant organs from diverse viewpoints. For instance, slicing along the vertical axis (z-axis) facilitates the effective separation of overlapping leaves, by which the occlusion problem is alleviated. Meanwhile, slicing along the horizontal axes (x/y-axis) allows for the clear differentiation of branches and leaves extending in left and right directions. Second, through the recursive partitioning strategy, a hierarchical representation is constructed, spanning from coarse to fine granularities. By means of coarse-grained slices, the overall topological structure and organ distribution of the plant are captured, whereas fine-grained slices serve to preserve local geometric details and surface texture information with high precision. In this way, both global structural patterns and local features are effectively balanced. Finally, by transforming the unordered point cloud into an ordered sequence of slices, a foundation is established for the subsequent modeling of long-range dependencies between slices using sequence models such as LSTM. Consequently, the capacity to understand complex plant structures is significantly enhanced.

The above recursive slicing strategy based on fixed Cartesian coordinate axes (x, y, z) is inherently not rotation-invariant. In practical plant point cloud acquisition scenarios, the orientation of plants may be significantly altered due to planting methods, scanning angles, or natural growth postures. If the network relies solely on slices from fixed directions, its generalization capability may be constrained. To address this issue, a rotation data augmentation strategy is employed during training. Specifically, in the data preprocessing stage, each point cloud sample is randomly rotated along the vertical axis (z-axis) within a range of 0° to 360°. By this augmentation approach, the network is encouraged to learn orientation-invariant feature representations, thereby reducing its sensitivity to the absolute orientation of plants.

2.1.2. Intra-Slice Feature Encoding and Sequential Representation

Following the acquisition of the multi-directional and multi-scale slice sets, feature aggregation is performed on the points within each slice to extract local geometric information. For a given slice

s_{k, j}^{d_{z}}

, which contains a set of M points

{\{p_{m}\}}_{m = 1}^{M}

(where M varies per slice), geometric features are first extracted for each point. These features include the 3D coordinates (x, y, z), the normal vector

({\hat{n}}_{x}, {\hat{n}}_{y}, {\hat{n}}_{z})

, and the local curvature

κ

, by which an initial feature vector is formed. Subsequently, a lightweight PointNet architecture is employed for feature aggregation within the slice. By this architecture, independent feature transformations are applied to each point through shared MLPs, after which all point features are aggregated using a symmetric function (max pooling). As a result, a feature vector

f_{k, j}^{d} \in ℝ^{C}

(Equation (1)) is obtained for the slice, where

C

denotes the output feature dimension (set to 128 in our experiments). Through this process, the unordered point set within each slice is compressed into a compact vector representation, thereby laying the foundation for subsequent sequence modeling.

f_{k, j}^{d} = {MAX}_{m \in [1, M]} (MLP (p_{m})),

(1)

where MAX denotes the max pooling operation, and MLP refers to the shared Multi-Layer Perceptron with a hidden dimension of 256 followed by ReLU activation. By means of this process, each slice is abstracted into a feature vector, through which the original point cloud is transformed into an ordered sequence of features.

For each direction d and each scale k, a feature sequence

F_{k}^{d} = [f_{k, 1}^{d}, f_{k, 2}^{d}, \dots, f_{k, 2^{k}}^{d}] \in ℝ^{2^{k} \times C}

is obtained. This sequence is characterized by an explicit spatial order, where adjacent slices are geometrically proximate in 3D space, thereby reflecting the continuous distribution of plant organs along a specific direction. Taking the sequence along the z-axis as an example, its order naturally corresponds to the arrangement of plant organs from the base to the top. By means of this spatial correspondence, important prior information is provided for subsequent sequence modeling.

2.1.3. Contextual Modeling Across Slices with Bidirectional LSTM

Rich spatial contextual information is embedded within the slice sequence

F_{k}^{d}

. Significant local continuity is exhibited by adjacent slices, as they typically contain different parts of the same organ. Meanwhile, long-range geometric dependencies that span across distinct organs, such as the structural correlation between the main stem and distal branches or leaves, may exist between non-adjacent slices. To effectively model these complex inter-slice dependencies, the Bidirectional Long Short-Term Memory (BiLSTM) network is introduced in this paper for sequence encoding. By means of its forward and backward LSTM layers, contextual information is captured from both directions along the sequence. These forward and backward hidden states are then concatenated, by which each slice’s hidden representation is enabled to aggregate features from all preceding and succeeding slices. Consequently, a comprehensive modeling of global spatial dependencies is achieved.

The BiLSTM architecture comprises two subnetworks: a forward LSTM and a backward LSTM, by which contextual dependencies are captured along opposite directions of the sequence. Specifically, the forward LSTM is employed to encode the slice features in sequential order, i.e.,

f_{k, 1}^{d} \to f_{k, 2}^{d} \to \dots \to f_{k, 2^{k}}^{d}

. By this means, historical information is aggregated into the hidden state at each time step, enabling the dependency of the current slice on its predecessors to be captured. Conversely, the backward LSTM encodes the sequence in reverse order, i.e.,

f_{k, 2^{k}}^{d} \to f_{k, 2^{k - 1}}^{d} \to \dots \to f_{k, 1}^{d}

, through which the dependency of the current slice on its successors is modeled. Once the encoding processes in both directions are completed, the forward hidden state

{\vec{h}}_{k, j}^{d}

and the backward hidden state

{\overset{\leftarrow}{h}}_{k, j}^{d}

corresponding to each slice are concatenated. As a result, a feature representation that integrates bidirectional contextual information is obtained, denoted as

h_{k, j}^{d} = [{\vec{h}}_{k, j}^{d}; {\overset{\leftarrow}{h}}_{k, j}^{d}] \in ℝ^{2 H}

(see Equation (2)), where H represents the dimensionality of the LSTM hidden states (set to 256 in our experiments). Through this bidirectional encoding mechanism, the final representation of each slice is enriched not only with its own local geometric information but also with contextual information aggregated from all preceding and succeeding slices across the entire sequence. Consequently, a comprehensive modeling of the global spatial dependencies among plant organs is achieved.

h_{k, j}^{d} = [\vec{LSTM} (f_{k, j}^{d}, {\vec{h}}_{k, j - 1}^{d}); \overset{\leftarrow}{LSTM} (f_{k, j}^{d}, {\overset{\leftarrow}{h}}_{k, j + 1}^{d})],

(2)

where [;] denotes the vector concatenation operation. Through encoding by the BiLSTM, the feature

h_{k, j}^{d}

of each slice is enriched not only with the local geometric information contained within that slice but also with the contextual information integrated from the entire sequence. By this means, an effective combination of local and global information is achieved.

To integrate the multi-directional and multi-scale features, a feature fusion layer is designed, by which all BiLSTM outputs from every direction and scale are aggregated. For each point

p_{i}

, given that it may belong to multiple slices simultaneously (originating from different directions and scales), the slice-level features are back-projected onto the point through weighted summation based on the point’s proximity to the slice center. In this manner, a global-local fused feature

x_{i} \in ℝ^{C_{j}}

is obtained for the point. Specifically, for each slice feature

h_{k, j}^{d}

, a weight is assigned based on the contribution of the point within that slice. The final point representation is then generated by aggregating features from all relevant slices. By means of this fused feature, not only is the original geometric information of the point preserved, but also the contextual priors derived from multi-directional recursive slicing are embedded. Consequently, a rich and comprehensive input representation is provided for subsequent parallel segmentation tasks (e.g., organ-level semantic segmentation), through which the model’s capacity to perceive complex plant structures is effectively enhanced.

2.2. Bidirectional Fusion Module with Cross-Slice Attention Embedding

Effectively leveraging multi-directional recursive slice features of plant point clouds for semantic and instance segmentation, while promoting information interaction between these two tasks, is critical for improving leaf instance segmentation performance. Existing methods often rely on simple concatenation or summation of semantic and instance features, which struggles to fully exploit their underlying correlations [39]. To address this issue, a cross-slice attention-embedding module (CSAEM) is proposed in this paper, enabling bidirectional adaptive fusion of semantic and instance features through an attention mechanism. Dynamic dependencies across tasks are captured by this module, allowing semantic information to guide the refinement of leaf boundaries, while instance features are fed back to optimize semantic consistency. The detailed procedure of the proposed module is shown in Figure 3, which illustrates the overall architecture of the CSAEM and its integration with the parallel decoding branches.

2.2.1. Parallel Decoding Branches and Feature Initialization

Following the encoder, two parallel decoding branches—a semantic branch and an instance branch—are constructed in the network [40]. Category labels such as stem and leaf are predicted for each point by the semantic branch, while a high-dimensional embedding vector is learned for each point by the instance branch, through which points belonging to the same leaf instance are brought closer in the embedding space via clustering, and those from different instances are pushed apart. As shown in Figure 3, these two branches operate in parallel but are subsequently connected through the dual attention fusion mechanism described in Section 2.2.2.

Specifically, the point features

x_{i}

obtained from the encoding stage are fed into the semantic decoder and the instance decoder, respectively. The semantic decoder, composed of multiple MLP layers, is employed to output semantic features

F_{s e m}^{i} \in ℝ^{C_{s}}

, where

C_{s}

denotes the number of semantic classes (set to 2 for stem and leaf in our experiments), which are used to capture category information (e.g., stem, leaf) for each point. The instance decoder, also implemented with a multi-layer MLP structure, is utilized to output initial instance embeddings

F_{i n s}^{i} \in ℝ^{C_{i}}

, where

C_{i}

is the instance-embedding dimension (set to 128 in our experiments), which are designed to learn high-dimensional representations for distinguishing different leaf instances. These features, generated by the two branches, encode the semantic category attributes and instance-discriminative attributes of points, respectively, providing foundational representations for subsequent segmentation tasks. However, in the initial stage, no information interaction or fusion is performed between the semantic features and the instance features. Consequently, semantic information cannot effectively guide the refinement of instance boundaries, and instance features are unable to be fed back to optimize semantic consistency. This independent feature extraction paradigm limits the segmentation accuracy for mutually occluded leaves in complex plant data. Therefore, promoting bidirectional interaction and adaptive fusion between semantic and instance features, while preserving their respective task-specific characteristics, is identified as the key to improving leaf instance segmentation performance.

2.2.2. Dual Parallel Attention Fusion Mechanism

To facilitate the deep fusion of semantic features and instance features for leaf instance segmentation in plant point clouds, a Dual Attention Fusion Module is designed in this paper. This module consists of two parallel attention sub-modules: (1) Semantic-guided Instance Attention (SGIA), where semantic context is utilized to enhance the discriminative power of instance features at leaf boundaries; and (2) Instance-guided Semantic Attention (IGSA), where semantic features are optimized through instance consistency constraints to effectively resolve segmentation ambiguities in scenes with mutually occluded leaves. The computational flow of these two sub-modules is illustrated in Figure 3, where the bidirectional interaction between the semantic and instance branches is clearly shown.

The SGIA module is designed to enhance the discriminative ability of instance embeddings by leveraging semantic features, addressing the challenge of distinguishing different instances based solely on geometric features. Specifically, in vegetation scenes, multiple leaves often exhibit similar local geometric structures, leading to ambiguous instance boundaries. To mitigate this, semantic context information is dynamically incorporated into instance embeddings via SGIA by computing the semantic similarity between each point and its neighbors. For a given point i, its enhanced instance feature

{\tilde{F}}_{i n s}^{i}

(see Equation (3)) is obtained by aggregating features from semantically similar points within its neighborhood through an attention mechanism. Here,

N (i)

denotes the set of neighboring points within a radius r = 0.1 (normalized coordinate space), which is a key parameter controlling the local receptive field. This process reinforces intra-instance semantic consistency while enlarging semantic discrepancies across different instances. By this semantic-guided attention mechanism, geometric ambiguities are effectively suppressed, resulting in instance embeddings that are more compact within semantically consistent regions and more separated at semantic boundaries. Experimental results demonstrate that SGIA significantly improves the accuracy of leaf instance segmentation in complex plant point clouds, exhibiting enhanced robustness, particularly for severely occluded leaves with similar geometries.

\begin{matrix} {\tilde{F}}_{i n s}^{i} = F_{i n s}^{i} + \sum_{j \in N (i)} α_{i j} \cdot W_{i n s} \cdot F_{i n s}^{j} \\ α_{i j} = \frac{\exp (β \cdot sim (F_{s e m}^{i}, F_{s e m}^{j}))}{\sum_{k \in N (i)} \exp (β \cdot sim (F_{s e m}^{i}, F_{s e m}^{k}))} \end{matrix},

(3)

where

N (i)

denotes the set of neighboring points for point i,

W_{i n s}

is a learnable linear transformation matrix, sim(,) is the cosine similarity function, and

β

is the temperature coefficient (set to 10 in our experiments) controlling the sharpness of the attention distribution. By this mechanism, points sharing similar semantic contexts—such as those belonging to the same leaf region—are brought closer together in the instance-embedding space. Consequently, the awareness of semantic priors is enhanced within the instance features.

The IGSA module is designed to facilitate information flow from instance features to semantic features in the opposite direction, thereby establishing a bidirectional interaction mechanism that complements the SGIA. In plant point clouds, points within the same instance are expected to share consistent semantic labels. However, due to factors such as mutual occlusion between leaves, scanning noise, or uneven sampling at edges, points located on instance boundaries or in occluded regions often exhibit semantic ambiguity, leading to fragmented predictions. To address this, semantic prediction is smoothed by the IGSA module, where the aggregation of semantic features is guided by the similarity of instance embeddings. Specifically, the similarity of instance features between neighboring points is first computed within a local neighborhood, generating attention weights that reflect instance affiliation. These weights are then used to perform a weighted aggregation of the semantic features from the neighboring points. For points with highly similar instance features—those that are highly likely to belong to the same leaf instance—a higher fusion weight is assigned to their semantic features. Consequently, the feature representations of these intra-instance points are brought closer together within the semantic space. Semantic fluctuations caused by occlusion and noise are effectively suppressed by this instance-guided attention mechanism. As a result, semantic features are maintained with high consistency within each instance, while a clear semantic transition is preserved at instance boundaries. Through the application of IGSA (see Equation (4)), a more regular and continuous semantic segmentation output is achieved, providing a more reliable semantic prior for subsequent instance segmentation. It has been demonstrated experimentally that the semantic labeling accuracy in edge regions is significantly improved by this module in complex vegetation scenes, with a notable reduction in over-segmentation and under-segmentation errors.

\begin{matrix} {\tilde{F}}_{s e m}^{i} = F_{s e m}^{i} + \sum_{j \in N (i)} γ_{i j} \cdot W_{s e m} \cdot F_{s e m}^{j} \\ γ_{i j} = \frac{\exp (η \cdot sim (F_{i n s}^{i}, F_{i n s}^{j}))}{\sum_{k \in N (i)} \exp (η \cdot sim (F_{i n s}^{i}, F_{i n s}^{k}))} \end{matrix},

(4)

where

W_{s e m}

is defined as a learnable weight matrix, and

η

is employed as a weighting coefficient set to 5 in our experiments. By this mechanism, points that are proximate within the instance-embedding space—and thus likely to belong to the same instance—are mutually reinforced in terms of their semantic features. Consequently, semantic ambiguities, particularly those occurring in boundary regions, are effectively corrected, and the consistency of semantic labels within each instance is preserved.

2.2.3. Multi-Scale Cross-Slice Attention

While the aforementioned dual attention mechanism enables feature interaction primarily within the local neighborhood of each point—allowing for the capture of fine-grained geometric details—it is inherently limited in its capacity to perceive the global distribution patterns of leaves within plant point clouds. To further leverage the global contextual information derived from multi-directional recursive slicing, the attention mechanism is extended to the slice level in this section. This is achieved through the proposed multi-scale cross-slice attention (MCSA) module. As illustrated in the right portion of Figure 3, parallel attention computation is performed by this module across slice sequences from different orientations. Consequently, long-range dependencies of leaf instances across various slices are effectively captured by the model. As a result, segmentation robustness is significantly enhanced, particularly for occluded and overlapping leaves within complex canopy structures.

For each direction d and scale k, a slice-level feature sequence

H_{k}^{d} = [h_{k, 1}^{d}, h_{k, 2}^{d}, \dots, h_{k, 2^{k}}^{d}]

is obtained. A self-attention mechanism is then applied to these slice sequences by the MCSA module, enabling each slice to attend to the features of slices at other positions. With this design, the constraint of a local receptive field is overcome, allowing for the interactive integration of global contextual information. Semantic correlations between different slices are dynamically captured through the self-attention computation, thereby enhancing the model’s capacity to model long-range dependencies within the feature representation. Furthermore, input sequences of arbitrary lengths are supported by this mechanism, offering flexibility for the effective aggregation of multi-scale features.

{\hat{h}}_{k, j}^{d} = \sum_{t = 1}^{2^{k}} softmax (\frac{{(W_{Q} h_{k, j}^{d})}^{T} (W_{K} h_{k, t}^{d})}{\sqrt{C_{h}}}) W_{V} h_{k, t}^{d},

(5)

where

W_{Q}

,

W_{K}

and

W_{V}

are the projection matrices, and

C_{h}

is the feature dimension. After being processed by the MCSA, the features of each slice are further enriched with contextual information aggregated from the entire slice sequence. Long-range dependencies, particularly between non-adjacent slices, are thereby captured. For instance, a semantic correlation may exist between a slice at the base of a plant and one at the top (e.g., both are part of the same long stem), a type of long-range dependency that is difficult to model through local neighborhoods alone.

Finally, the enhanced slice-level features are back-projected onto the points and subsequently fused with the point-level features generated by the dual attention module. Through this process, the final point feature representation is obtained.

F_{f i n a l}^{i} = MLP ([{\tilde{F}}_{s e m}^{i}; {\tilde{F}}_{i n s}^{i}; {\hat{x}}_{i}]),

(6)

where

{\hat{x}}_{i}

denotes the point-level feature obtained through interpolation from the enhanced slice features. By means of this multi-scale, cross-level feature fusion mechanism, the final features are ensured to contain both fine-grained local geometric details and integrated global contextual priors. High-quality inputs are thus provided for the subsequent instance clustering stage.

2.3. Semantic-Aware Instance Clustering and Joint Optimization

After the acquisition of the final feature representation for each point, the generation of the ultimate instance segmentation results must be performed based on these features. Traditional threshold-based clustering methods are often found to be inadequate for handling the significant variations in instance scale and density commonly observed in plant point clouds—where leaf sizes differ considerably and their spatial distribution is irregular, making fixed thresholds prone to either over-segmentation or under-segmentation. To address this challenge, a semantic-aware instance clustering strategy is proposed in this paper. With this approach, candidate points are first filtered using semantic priors. Clustering thresholds are then dynamically adjusted within the embedding space by incorporating semantic consistency, thereby enabling robust instance separation. On this basis, through the joint optimization of the objective function, both the semantic segmentation loss and the instance clustering loss are constrained, ensuring the synergistic improvement of semantic segmentation and instance segmentation. As a result, more accurate and complete leaf instance segmentation outcomes are achieved in complex vegetation scenes. The detailed procedure of the above steps is shown in Figure 4, which depicts the complete workflow from feature extraction to final instance segmentation.

2.3.1. Semantic-Aware Mean Shift Clustering

Mean-Shift clustering, recognized as a non-parametric, density-gradient-based algorithm, is advantageous primarily because the number of clusters does not need to be pre-specified. This characteristic makes it particularly well-suited for the complex scenarios found in plant point clouds, where the number of leaf instances is often variable. However, the conventional Mean-Shift algorithm, which relies solely on distance metrics within the feature space, is prone to interference from geometrically similar instances. To address this limitation, an improved version of the traditional Mean-Shift algorithm [41] is proposed in this paper, where semantic information is introduced to guide the clustering process. Specifically, the confidence scores of point-wise semantic labels are incorporated into the drift vector calculation. By this modification, cluster centers are preferentially shifted toward semantically consistent regions. Consequently, adjacent leaf instances belonging to different plants are more effectively distinguished, leading to an improvement in segmentation accuracy.

Specifically, for each point i, its instance-embedding vector

e_{i} \in ℝ^{C_{i}}

is derived from the instance-specific component of the final feature

F_{f i n a l}^{i}

. During Mean-Shift clustering, a semantic-aware kernel function is defined as Equation (7). By this kernel, both the distance in the embedding space and the consistency of semantic labels are taken into account, causing the drift process to be preferentially directed toward semantically homogeneous regions.

K (i, j) = \exp (- \frac{{‖ e_{i} - e_{j} ‖}^{2}}{2 σ_{e}^{2}}) \cdot I ({sem}_{i} = {sem}_{j}),

(7)

where

σ_{e}

is defined as the bandwidth parameter controlling the scale of the kernel (initialized as 0.5 and adaptively adjusted as described in Section 2.3.2),

I (\cdot)

denotes the indicator function, and

{sem}_{i}

represents the predicted semantic label for point i. Similarities are computed by this kernel function only between points that share the same semantic category, by which erroneous clustering across different semantic classes (such as stems and leaves) is effectively prevented. This design is consistent with fundamental botanical knowledge: while an instance is necessarily composed of points from a single semantic category, points from the same semantic category may belong to distinct instances (e.g., multiple individual leaves).

The density center for each point is iteratively updated by the Mean-Shift clustering process until convergence is achieved.

e_{i}^{(t + 1)} = \frac{\sum_{j \in N_{e} (i)} K (i, j) \cdot e_{j}^{(t)}}{\sum_{j \in N_{e} (i)} K (i, j)},

(8)

where

N_{e} (i)

is the set of neighboring points whose distance in the embedding space is less than the threshold

τ

(set to 0.8 in our experiments). Upon convergence, points that fall within the same density attraction basin are grouped together as a single instance.

2.3.2. Bandwidth Adaptation Mechanism

The performance of Mean-Shift clustering is highly dependent on the selection of the bandwidth parameter

σ_{e}

. If

σ_{e}

is set too small, the algorithm becomes overly sensitive to local variations, making it prone to over-segmentation where a single leaf is fragmented into multiple parts. Conversely, if

σ_{e}

is too large, the clustering window covers an excessively broad area, which tends to cause under-segmentation through the erroneous merging of adjacent but distinct leaves. Given that point density varies significantly across different regions of a plant point cloud—particularly with sparse distributions in the outer canopy and dense concentrations in the interior—a single global bandwidth is often inadequate for handling all scenarios.

To address this issue, an adaptive bandwidth mechanism is proposed in this paper, where the bandwidth size is dynamically adjusted according to the local point density. For a given point i, its local density

ρ_{i}

is defined as the number of points within its neighborhood. The adaptive bandwidth is then computed as follows:

σ_{e} (i) = σ_{0} \cdot (1 + λ \cdot \frac{ρ_{m a x} - ρ_{i}}{ρ_{m a x} - ρ_{m i n}}),

(9)

where

σ_{0}

is the base bandwidth set to 0.5,

λ

is a regulating factor set to 0.3 controlling the adaptation strength, and

ρ_{m a x}

and

ρ_{m i n}

represent the global maximum and minimum densities, respectively. By this mechanism, a larger bandwidth is applied in sparse regions to broaden the clustering range, thereby preventing over-segmentation. In dense regions, a smaller bandwidth is utilized to enhance clustering precision, thus avoiding under-segmentation.

2.3.3. Multi-Task Joint Loss Function

To enable end-to-end training, the semantic segmentation and instance-embedding tasks are jointly optimized within the network. The total loss function is composed of three terms:

L_{t o t a l} = L_{s e m} + λ_{1} L_{i n s t} + λ_{2} L_{r e g},

(10)

where

λ_{1}

= 0.5 and

λ_{2}

= 0.1 are balancing hyperparameters determined through grid search on the validation set.

The semantic segmentation loss

L_{s e m}

is formulated as the standard cross-entropy loss, by which the discrepancy between the predicted semantics and the ground truth label for each point is measured.

L_{s e m} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c}^{s e m} \log ({\hat{y}}_{i, c}^{s e m}),

(11)

where

y_{i, c}^{s e m}

denotes the ground-truth indicator of whether point i belongs to category c, and

{\hat{y}}_{i, c}^{s e m}

represents the predicted probability.

The instance-embedding loss

L_{i n s t}

is defined using a discriminative loss function, which consists of both a pulling term and a pushing term.

L_{i n s t} = \frac{1}{M} \sum_{m = 1}^{M} {[\frac{1}{|S_{m}|} \sum_{i \in S_{m}} ‖ e_{i} - μ_{m} ‖ - δ_{v}]}_{+}^{2} + \frac{1}{M (M - 1)} \sum_{m \neq n} {[δ_{d} - ‖ μ_{m} - μ_{n} ‖]}_{+}^{2},

(12)

where M denotes the number of instances,

S_{m}

is the set of points belonging to the m-th instance,

μ_{m}

represents the embedding center of that instance,

δ_{v}

and

δ_{d}

are the margin thresholds for pulling and pushing, respectively (set to 0.5 and 1.5 based on empirical validation), and

{[x]}_{+} = m a x (0, x)

. By the first term, the embeddings of points within the same instance are encouraged to converge toward their center, while by the second term, the centers of different instances are pushed apart.

A regularization loss

L_{r e g}

is employed to constrain the overall distribution of the embedding space, by which excessive feature dispersion is prevented.

L_{r e g} = \frac{1}{N} \sum_{i = 1}^{N} {‖ e_{i} ‖}^{2},

(13)

By jointly optimizing the loss functions described above, the network is trained to simultaneously learn the discrimination of vegetation semantic categories (such as leaves and branches) and the separation of leaf instance embeddings. Through this multi-task collaborative constraint, the semantic features and instance features are mutually reinforced: semantic information is leveraged to refine instance boundaries, while instance features, in turn, contribute to the optimization of semantic consistency. It is noteworthy that the attention weights within the CSAEM and SGIA/IGSA modules also participate in backpropagation. By this means, the network is enabled to adaptively learn the fusion weights between semantic and instance features based on the local geometry and density distribution of the vegetation point cloud. Consequently, more accurate leaf instance segmentation is achieved by the model within complex canopy structures.

Although high-quality initial segmentation results can be generated by the end-to-end network, boundary ambiguity or instance adhesion may still occur in complex scenarios such as leaf edges and occluded regions. To address this issue, a post-processing optimization step based on a Conditional Random Field (CRF) is introduced in this section.

Each point is treated as a node in the CRF, and an energy function is defined as follows [42]:

E (L) = \sum_{i} ψ_{u} (l_{i}) + \sum_{i, j} ψ_{p} (l_{i}, l_{j}),

(14)

where the unary term

ψ_{u} (l_{i})

is determined by the semantic and instance confidences predicted by the network, while the pairwise term

ψ_{p} (l_{i}, l_{j})

imposes a penalty for label inconsistency between neighboring points.

ψ_{p} (l_{i}, l_{j}) = ω_{1} \exp (- \frac{{‖ p_{i} - p_{j} ‖}^{2}}{2 θ_{pos}^{2}}) + ω_{2} \exp (- \frac{{‖ p_{i} - p_{j} ‖}^{2}}{2 θ_{pos}^{2}} - \frac{{‖ c_{i} - c_{j} ‖}^{2}}{2 θ_{color}^{2}}),

(15)

In this formulation,

c_{i}

represents the color information of point i (applicable to photogrammetric point clouds), and

θ_{pos}

and

θ_{color}

are bandwidth parameters. By minimizing the energy function, the segmentation results are smoothed, small isolated regions are eliminated, and instance boundaries are sharpened through the CRF.

3. Results

3.1. Datasets Description

The efficacy of the proposed framework was rigorously evaluated on two public plant point cloud benchmarks—Soybean-MVS (data were collected in 2023) [43] and PP3D (data were collected in 2025) [44]—which collectively encompass a wide spectrum of plant species, developmental stages, and data acquisition modalities.

Soybean-MVS [43] is an annotated 3D soybean dataset constructed using multiple-view stereo (MVS) reconstruction. It contains 102 soybean plant models spanning five varieties and 13 growth stages across the whole developmental period, with manual organ-level annotations for leaves, main stems, and stems. Owing to its full-growth-stage coverage, multi-variety variability, and detailed organ annotations, this dataset provides a valuable benchmark for evaluating the robustness of segmentation methods under pronounced morphological variation and frequent organ occlusion. The PP3D [44] consists of about 500 potted plant scans acquired via terrestrial LiDAR over multiple growing seasons, totaling 3.2 billion points characterized by substantial density fluctuations, atmospheric noise, and motion-induced artifacts. Among these, the PP3D dataset constitutes the most demanding testbed for instance segmentation, owing to its uncontrolled environmental conditions, pronounced point density inhomogeneity, and pervasive background clutter, all presenting considerable obstacles for the accurate isolation of individual plant organs. Representative scenes from these two datasets are illustrated in Figure 5.

3.2. Implementation Details

The proposed multi-perspective recursive slice framework with cross-slice attention was implemented using the PyTorch machine learning library (version 1.12.1). All training and evaluation procedures were carried out on a high-performance computing server equipped with the Ubuntu 22.04 operating system. The computational infrastructure comprised an AMD (Advanced Micro Devices, Santa Clara, CA, USA) EPYC 7502 32-Core processor operating at 2.50 GHz, 128 GB of system memory, and two NVIDIA GeForce RTX 5090D GPUs (NVIDIA Corporation, Santa Clara, CA, USA) each featuring 32 GB memory, though only a single GPU was utilized for standard training runs. The software stack included Python 3.8 as the primary programming language, with CUDA 11.6 and cuDNN 8.4 providing GPU-accelerated computation. Additional scientific computing dependencies encompassed PyTorch Geometric (2.2.0) for specialized point cloud operations, Open3D (0.16.0) for point cloud visualization and preprocessing, and scikit-learn (1.1.3) for clustering algorithms and evaluation metrics [45].

The network optimization was performed using the AdamW optimizer with weight decay set to 1 × 10⁻⁴ to prevent overfitting. This choice was made because AdamW decouples weight decay from the gradient update, providing more effective regularization than standard Adam and helping to prevent overfitting given the relatively small size of the plant point cloud datasets. The learning rate schedule followed a cosine annealing strategy with warm restarts, where the initial learning rate was configured as 5 × 10⁻⁴ and gradually reduced to 1 × 10⁻⁶ over the training course. This scheduler was selected to enable the optimizer to escape local minima and achieve better convergence compared to fixed or step-wise decay schedules. A batch size of 6 was selected due to GPU memory constraints given the high-resolution point cloud inputs, with gradient accumulation steps of 2 effectively simulating a batch size of 12. This accumulation strategy allowed us to maintain a sufficiently large effective batch size for stable gradient estimation while respecting hardware limitations. The training process extended for 200 epochs, with early stopping monitoring the validation performance after each epoch to terminate training if no improvement occurred for 30 consecutive epochs. The maximum epoch count of 200 was empirically determined based on preliminary experiments, which showed that validation metrics typically plateaued between 160 and 180 epochs. The total training duration on the combined plant datasets amounted to approximately 22 h when using mixed precision training with Automatic Mixed Precision enabled.

For both datasets, the data were divided into training, validation, and testing sets using a stratified split at the plant level. Specifically, we allocated 70% of plant individuals for training, 15% for validation, and 15% for testing, ensuring that point clouds from the same plant individual did not appear in multiple sets. This strategy prevents data leakage and provides a more realistic evaluation of cross-plant generalization capability. Data preprocessing constituted a critical component of the implementation pipeline. Raw point clouds were initially normalized to zero mean and unit variance within a bounding sphere of radius 1.0. To enhance model generalization and robustness, online data augmentation techniques were systematically applied during training, including random rotation within the full 360-degree range along the vertical axis, random scaling between 0.8 and 1.2, Gaussian noise injection with standard deviation of 0.01, and random point dropout with probability 0.05 to simulate sensor noise and occlusion patterns commonly encountered in real-world plant scanning scenarios. The training and validation splits were stratified at the plant level to ensure that point clouds from the same plant individual did not appear in both sets, thereby preventing data leakage and providing a more realistic evaluation of cross-plant generalization capability.

3.3. Result Evaluation and Analysis

3.3.1. Result Display and Evaluation

For dense plant point clouds, instance segmentation requires that individual leaf integrity be preserved while different leaves are distinguished from one another. The segmentation results of the proposed method on two datasets are presented in Figure 6. As can be observed from the visualizations, mutually occluded and spatially adjacent leaves are successfully differentiated by the proposed method, while the complete contour of each leaf is maintained. It is particularly noteworthy that, for severely occluded leaves with only small portions exposed, these are still recognized as independent instances by the network based on contextual information, rather than being erroneously merged into foreground leaves. This observation suggests that the fusion of semantic features and instance features is effectively facilitated by the attention-embedding modules, by which the network is enabled to leverage semantic priors of leaves (such as shape and orientation) to guide instance discrimination.

For the objective evaluation of the proposed method’s performance on plant point cloud segmentation tasks, standard evaluation metrics from the instance segmentation domain are adopted. These include Average Precision (AP), as well as AP@50 and AP@25 under different Intersection-over-Union thresholds [46], by which the segmentation accuracy of the method is comprehensively assessed at varying levels of strictness. All results are reported as the average values obtained from three independent runs on the test set.

The Soybean-MVS dataset comprises point clouds of soybean plants at various growth stages, characterized by dense foliage, severe occlusion, and diverse morphology, which pose significant challenges for instance segmentation methods. The segmentation performance of the proposed method on the Soybean-MVS dataset is summarized in Table 1. Under a relatively lenient IoU threshold, an AP@25 of 62.94% is achieved, indicating an exceptionally strong overall instance detection capability, by which the vast majority of leaf and stem instances are correctly identified. As the IoU threshold is increased to 50%, an AP@50 of 57.34% is obtained, demonstrating that excellent localization and segmentation accuracy are maintained by the method even under more stringent overlap requirements. An overall AP of 53.80% is achieved by integrating multiple thresholds, reflecting the stable performance of the method across varying levels of strictness.

The PP3D dataset encompasses a greater diversity of potted plant varieties and planting densities, serving as an important benchmark for validating the generalization capability of the proposed method. On this dataset, excellent performance is similarly demonstrated by the method: an AP@25 of 72.11% is achieved, which represents a further improvement over the result on the Soybean-MVS dataset; an AP@50 of 65.37% is obtained; and an overall AP of 53.02% is attained. Slight improvements across all three metrics are observed compared to the performance on Soybean-MVS. The reason for this, upon analysis, is attributed to the higher scanning quality and more uniform point cloud density of the PP3D dataset, which provides a better data foundation for fine-grained segmentation.

3.3.2. Ablation Study Analysis

To quantify the contribution of each core module, systematic ablation experiments were designed and conducted on the PP3D dataset, wherein key components were progressively removed and the resulting performance changes were observed. As presented in Table 2, an overall AP of 53.02% is achieved by the full model. Upon the initial removal of the multi-directional recursive slicing module (model A), which was replaced with a single-direction slicing strategy, the AP decreased to 46.22%, representing a substantial drop of 6.8 percentage points. This confirms the necessity of multi-directional slicing for capturing the complex spatial structure of soybean plants. Following the subsequent removal of BiLSTM sequence modeling (model B), which was replaced by simple concatenation of independent slice features, the AP further declined to 48.02%. This demonstrates the significant role played by the modeling of dependencies between slices in preserving instance integrity.

Although Transformer-based attention mechanisms have achieved significant progress in the field of point cloud processing in recent years, enabling direct global relationship modeling, a comparative experiment was conducted to further quantitatively evaluate the advantages of the BiLSTM architecture. The BiLSTM in the proposed method was replaced with a Transformer encoder (with all other modules kept unchanged, referred to as Module B*), and the segmentation accuracy was compared on the PP3D dataset. The experimental results showed that the BiLSTM version achieved an AP of 48.02%, while the Transformer version achieved 48.42%, with a performance improvement of only 0.4%. However, it was observed that the Transformer version increased the number of parameters by approximately 2.1 times and the training time by 1.8 times. Given the trade-off between performance gain and computational cost, a better efficiency-accuracy balance was provided by the BiLSTM in this application scenario.

Upon the removal of the cross-slice attention-embedding module (model C), which was replaced with a simple concatenation of semantic and instance features, the AP declined to 49.42%, a reduction of 3.6 percentage points. This validates the effectiveness of the attention mechanism in facilitating bidirectional fusion between the two feature types. When the semantic-guided instance attention was removed in isolation (while instance-guided semantic attention was retained), an AP of 50.82% was obtained (model D). Conversely, when the instance-guided semantic attention was removed in isolation, an AP of 51.52% was recorded (model E). Both results are lower than that of the full model, thereby demonstrating the necessity of bidirectional fusion. Finally, upon the removal of the semantic-aware Mean-Shift clustering, which was replaced with a traditional unguided clustering algorithm, the AP decreased to 51.69% (model F). This suggests that the incorporation of semantic priors plays a positive role in reducing instance adhesion. The rationality of each module’s design is systematically validated by the aforementioned ablation results.

To evaluate the robustness of the proposed method against variations in plant orientation, a set of supplementary experiments was designed. The test samples from the PP3D dataset were rotated along the vertical axis (z-axis) by six different angles—0°, 30°, 60°, 90°, 120°, and 150°—respectively. Inference was performed using the trained model (which was trained with rotation data augmentation), and the mean average precision for instance segmentation was calculated for each rotation angle. The results demonstrated that, under different rotation angles, our MPRSF-CSA maintained consistent segmentation performance, with a standard deviation of only 0.24. In contrast, the baseline model without rotation augmentation (RSSNet) exhibited larger performance fluctuations on the same test set, with a standard deviation of 0.98. These findings validate the effectiveness of the rotation data augmentation strategy and also indicate that the proposed multi-directional recursive slicing strategy is capable of capturing the intrinsic features of plant structures, rather than excessively relying on absolute spatial orientation.

3.3.3. Running Time

In addition to segmentation accuracy, computational efficiency is also regarded as an important metric for evaluating the practicality of a method. The time overhead for both training and inference was measured under identical hardware configurations. In terms of efficiency, approximately 16 h are required for the method to complete 150 training epochs on the Soybean-MVS dataset (with a batch size of 8), averaging 6.4 min per epoch. During the inference phase, an average processing time of 0.45 s is achieved for a single soybean plant point cloud containing approximately 30,000 points. Of this inference time, approximately 0.08 s (17.8%) is consumed by the semantic-aware Mean-Shift clustering stage, which represents the primary computational overhead beyond the network forward pass. This additional cost is justified by the clustering stage’s critical role in refining instance boundaries and resolving instance adhesion, as demonstrated in the ablation study. On the PP3D dataset, given the comparable scale of the point clouds, the training and inference times are generally consistent with those on Soybean-MVS. Regarding memory consumption, peak GPU memory usage during training is approximately 10.5 GB, while only 3.2 GB is required during inference. The clustering stage contributes an additional 0.8 GB of CPU memory due to the construction of the neighborhood graph, which remains within acceptable limits for standard workstations. These results indicate that, while high accuracy is maintained by the method, its computational overhead remains within an acceptable range for practical applications.

3.4. Performance Comparison

To comprehensively evaluate the effectiveness of our proposed MPRSF-CSA, we compared it with three state-of-the-art point cloud segmentation methods on Soybean-MVS and PP3D datasets. As clearly shown in Table 1, the proposed MPRSF-CSA achieves the best performance across all evaluation metrics, significantly outperforming all competing methods. Figure 7 illustrates the visualization of representative comparative results.

Compared to PointGroup [21], our method leads by about 31.35 percentage points in AP, taking the PP3D dataset as an example. This strongly demonstrates the superiority of our approach. As a general clustering-based method, PointGroup relies on simple geometric clustering strategies, which struggle to accurately distinguish adjacent leaves and stems in complex plant structures with severe inter-instance adhesion. In contrast, our recursive slicing mechanism progressively dis-entangles such intricate spatial relationships from multiple perspectives, laying the foundation for precise instance separation.

Even when compared to SCNet [44]—a method specifically designed for plant seg-mentation—our approach still achieves a notable improvement of 6.27 percentage points in AP. This indicates that our multi-perspective recursive slicing framework holds a distinct advantage over traditional plant segmentation paradigms based on center prediction and Gaussian modeling. In regions with dense leaf occlusion, SCNet tends to fail in center prediction, whereas our method robustly captures the local geometric integrity of each instance through serialized analysis of multi-view 2D slices, enabling more accurate localization and segmentation of occluded leaves.

Furthermore, our method exhibits clear advantages over the slice-based method MRSliceNet [47]. Although MRSliceNet also leverages slice features for instance segmentation, it processes these slices independently and lacks a mechanism to explicitly model the contextual relationships across different slices. Consequently, it struggles to maintain instance consistency when dealing with discontinuities caused by occlusions or complex plant morphology, often leading to fragmented instance predictions. By contrast, the core innovation of our method, the Cross-Slice Attention mechanism, actively models the inherent continuity of the same instance across different slice sequences while differentiating between sequences of different instances. This facilitates a qualitative leap from slice-level feature extraction to comprehensive instance-aware reasoning.

In summary, through its unique multi-perspective recursive slicing framework, MPRSF-CSA transforms the complex 3D instance segmentation task into a series of inter-related 2D sequence analysis problems. The cross-slice attention mechanism effectively addresses the challenges of contextual modeling and instance consistency preservation. This framework is particularly suitable for handling complex topological structures and severe occlusions, as commonly encountered in plant scenes, achieving state-of-the-art in-stance segmentation performance while maintaining a reasonable parameter count.

3.5. Limitations

Despite the strong performance demonstrated by the proposed MPRSF-CSA framework, we identify several limitations that warrant further investigation. This section provides a critical analysis of the method’s behavior in challenging scenarios, drawing directly from the experimental results presented above.

As observed in the segmentation results on the PP3D dataset, the performance of the proposed method degrades in regions characterized by extremely dense foliage and severe self-occlusion. In cases where multiple leaves are tightly stacked with minimal visible surface area, the recursive slicing strategy may fail to capture sufficient geometric cues to separate adjacent instances. The attention-embedding module, while effective at fusing semantic and instance features, relies on the availability of discriminative local geometries that may be absent under extreme occlusion.

The multi-directional recursive slicing strategy, while generally effective, has inherent limitations. The fixed set of slicing directions may not be optimal for all plant architectures. For species with highly anisotropic growth patterns—such as grasses with predominantly vertical orientation or trailing plants with horizontal sprawl—the current slicing configuration may not capture the dominant spatial structure equally well. Furthermore, the intra-slice pooling operation, necessary for generating slice-level features, inevitably discards some fine-grained local details, which may be critical for distinguishing morphologically similar but distinct organs.

4. Conclusions

In this paper, we have presented a novel multi-perspective recursive slice framework with cross-slice attention for instance-aware plant point cloud segmentation. Confronting the inherent challenges of complex plant structures, non-uniform density, and blurred instance boundaries, our approach rethinks the problem by transforming disordered point clouds into ordered sequences. Extensive evaluations on two public datasets, Soybean-MVS and PP3D, confirm the superiority of our proposed framework over current plant segmentation methods. Qualitative results demonstrate that our model excels in preserving fine boundaries and exhibits robust adaptability across a variety of complex plant specimens and growth stages. These experimental findings validate the efficacy of our slice-based sequence modeling and the critical role of explicit semantic-instance feature coupling in addressing the unique challenges of plant point cloud segmentation.

Despite its strong performance, the current framework has certain limitations. The slicing process, while efficient, may still overlook extremely fine-grained details within a slice. Furthermore, the performance can be influenced by the chosen slicing directions, and an optimal set of directions might vary for plant species with vastly different growth habits. For future work, we plan to explore adaptive slicing strategies that can dynamically adjust based on the local point cloud density and plant structure. Investigating the integration of more powerful sequence models, such as Transformers, to replace the BiLSTM could further enhance the modeling of long-range dependencies. Finally, extending this framework to other challenging biological point cloud segmentation tasks, such as root system architecture analysis or whole-forest segmentation, presents a promising and impactful research direction.

Author Contributions

Conceptualization, S.L., J.S. and T.J.; methodology, S.L., S.F. and T.J.; software, S.L., S.F. and T.J.; validation, S.L., S.F., L.Z., P.W., X.C. and T.J.; formal analysis, S.L., J.S. and T.J.; investigation, S.L., L.Z., P.W., X.C. and T.J.; resources, P.W., X.C., L.X., J.S. and T.J.; data curation, S.L., P.W., X.C. and T.J.; writing—original draft preparation, S.L., S.F., J.S. and T.J.; writing—review and editing, S.L., S.F., J.S. and T.J.; visualization, S.L., S.F. and L.Z.; supervision, L.X., J.S. and T.J.; project administration, P.W., X.C., L.X., J.S. and T.J.; funding acquisition, P.W., X.C., L.X., J.S. and T.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund of the National Engineering Research Center of Geographic Information System (grant number NERCGIS-202409), the National Natural Science Foundation of China (grant number 42401552), the Natural Science Foundation of Jiangsu Province, China (grant number BK20240598), the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province, China (grant number 24KJB420005), the Open Fund of Key Laboratory of Land Satellite Remote Sensing Application, Ministry of Natural Resources of China (grant number KLSMNR-K202305), and the grant from State Key Laboratory of Resources and Environmental Information System.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Acknowledgments

The authors acknowledge all the reviewers for their valuable comments.

Conflicts of Interest

Pengcheng Wang and Xiaorong Cheng were employed by the Jiangsu Yushu Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Jin, S.; Li, D.; Yun, T.; Tang, J.; Wang, K.; Li, S.; Yang, H.; Yang, S.; Xu, S.; Cao, L.; et al. Deep learning for three-dimensional (3D) plant phenomics. Plant Phenomics 2025, 7, 100107. [Google Scholar] [CrossRef]
Roggiolani, G.; Bailey, B.N.; Behley, J.; Stachniss, C. Generation of labeled leaf point clouds for plants trait estimation. Plant Phenomics 2025, 7, 100071. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Shi, G.; Kong, W.; Wang, S.; Chen, Y. A leaf segmentation and phenotypic feature extraction framework for Multiview stereo plant point clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2321–2336. [Google Scholar] [CrossRef]
Rana, S.; Hensel, O.; Nasirahmadi, A. From vineyard to vision: Multi-domain analysis and mitigation of grape cluster detection failures in complex viticultural environments. Results Eng. 2026, 29, 108833. [Google Scholar] [CrossRef]
Jin, S.; Su, Y.; Wu, F.; Pang, S.; Gao, S.; Hu, T.; Liu, J.; Guo, Q. Stem-Leaf Segmentation and Phenotypic Trait Extraction of Individual Maize Using Terrestrial LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1336–1346. [Google Scholar] [CrossRef]
Yang, X.; Shang, Z.; Huang, H.; Liu, C.; Xu, T. SMFCA-Net: Sparse Multifrequency Cross-Attention Network for Single-Plant Point Cloud Classification and Segmentation. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5700917. [Google Scholar] [CrossRef]
Shin, S.; Zhou, K.; Vankadari, M.; Markha, A.; Trigoni, N. Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Wang, P.; Tang, Y.; Liao, Z.; Yan, Y.; Dai, L.; Liu, S.; Jiang, T. Road-Side Individual Tree Segmentation from Urban MLS Point Clouds Using Metric Learning. Remote Sens. 2023, 15, 1992. [Google Scholar] [CrossRef]
Roh, W.; Jung, H.; Nam, G.; Yeom, J.; Park, H.; Yoon, S.; Kim, S. Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Fang, L.; You, Z.; Shen, G.; Chen, Y.; Li, J. A joint deep learning network of point clouds and multiple views for roadside object classification from lidar point clouds. ISPRS J. Photogramm. Remote Sens. 2022, 193, 115–136. [Google Scholar] [CrossRef]
Lei, H.; Akhtar, N.; Mian, A. Spherical kernel for efficient graph convolution on 3D point clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3664–3680. [Google Scholar] [CrossRef]
Jiang, T.; Sun, J.; Liu, S.; Zhang, X.; Wu, Q.; Wang, Y. Hierarchical semantic segmentation of urban scene point clouds via group proposal and graph attention network. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102626. [Google Scholar] [CrossRef]
Huang, S.; Ma, Z.; Mu, T.; Fu, H.; Hu, S. Supervoxel convolution for online 3D semantic segmentation. ACM Trans. Graph. 2021, 40, 1–15. [Google Scholar] [CrossRef]
Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Jiang, T.; Wang, Y.; Liu, S.; Cong, Y.; Dai, L.; Sun, J. Local and global structure for urban ALS point cloud semantic segmentation with ground-aware attention. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5702615. [Google Scholar] [CrossRef]
Luo, H.; Chen, C.; Fang, L.; Khoshelham, K.; Shen, G. MS-RRFSegNet: Multiscale regional relation feature segmentation network for semantic segmentation of urban scene point clouds. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8301–8315. [Google Scholar] [CrossRef]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4558–4567. [Google Scholar]
Li, D.; Ahmed, F.; Wang, Z. 3D-NOD: 3D new organ detection in plant growth by a spatiotemporal point cloud deep segmentation framework. Plant Phenomics 2025, 7, 100002. [Google Scholar] [CrossRef]
Yu, Q.; Du, H.; Liu, C.; Yu, X. When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Yan, M.; Zhang, J.; Zhu, Y.; Wang, H. MaskClustering: View Consensus Based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Jiang, L.; Zhao, H.; Shi, S.; Liu, S.; Fu, C.W.; Jia, J. PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Vu, T.; Kim, K.; Nguyen, T.; Luu, T.M.; Kim, J.; Yoo, C.D. Scalable SoftGroup for 3D Instance Segmentation on Point Clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1981–1995. [Google Scholar] [CrossRef]
Chen, S.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X. Hierarchical Aggregation for 3D Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Zhao, L.; Tao, W. JSNet++: Dynamic Filters and Pointwise Correlation for 3D Point Cloud Instance and Semantic Segmentation. IEEE Trans. Circuit Syst. Video Technol. 2025, 33, 1854–1867. [Google Scholar] [CrossRef]
Fang, Z.; Zhuang, C.; Lu, Z.; Wang, Y.; Liu, L.; Xiao, J. BGPSeg: Boundary-Guided Primitive Instance Segmentation of Point Clouds. IEEE Trans. Image Process. 2025, 34, 1454–1468. [Google Scholar] [CrossRef]
Li, D.; Shi, G.L.; Li, J.S.; Chen, Y.L.; Zhang, S.Y.; Xiang, S.Y.; Jin, S.C. PlantNet: A dual-function point cloud segmentation network for multiple plant species. ISPRS J. Photogramm. Remote Sens. 2022, 184, 243–263. [Google Scholar] [CrossRef]
Yu, F.; Liu, K.; Zhang, Y.; Zhu, C.; Xu, K. PartNet: A Recursive Part Decomposition Network for Fine-grained and Hierarchical Shape Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Jiang, T.; Liu, S.; Zhang, Q.; Zhao, L.; Sun, J.; Wang, Y. ShrimpSeg: A local-global structure for mantis shrimp point cloud segmentation network with contextual reasoning. Appl. Opt. 2023, 62, 97–103. [Google Scholar] [CrossRef]
Huang, Q.; Wang, W.; Neumann, U. Recurrent Slice Networks for 3D Segmentation of Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Luo, Z.; Liu, D.; Li, J.; Chen, Y.; Xiao, Z.; Juniorc, J.; Gonçalvesc, W.; Wang, C. Learning sequential slice representation with an attention-embedding network for 3D shape recognition and retrieval in MLS point clouds. ISPRS J. Photogramm. Remote Sens. 2021, 176, 237–249. [Google Scholar] [CrossRef]
Chen, X.; Wu, P.; Wu, Y.; Aboud, L.; Postolache, O.; Wang, Z. Ship trajectory prediction via a transformer-based model by considering spatial-temporal dependency. Intell. Robot. 2025, 5, 562–578. [Google Scholar] [CrossRef]
Wang, D.; Takoudjou, S.M.; Casella, E. LeWoS: A universal leaf-wood classification method to facilitate the 3D modelling of large tropical trees using terrestrial LiDAR. Methods Ecol. Evol. 2020, 11, 376–389. [Google Scholar] [CrossRef]
Li, L.; Li, Q.; Xu, G.; Zhou, P.; Tu, J.; Li, J.; Li, M.; Yao, J. A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation. ISPRS J. Photogramm. Remote Sens. 2024, 218, 518–530. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Wang, D. Unsupervised semantic and instance segmentation of forest point clouds. ISPRS J. Photogramm. Remote Sens. 2020, 165, 86–97. [Google Scholar] [CrossRef]
Li, D.; Liu, L.; Xu, S.; Jin, S. TrackPlant3D: 3D organ growth tracking framework for organ-level dynamic phenotyping. Comput. Electron. Agric. 2024, 226, 109435. [Google Scholar] [CrossRef]
Yang, X.; Miao, T.; Tian, X.; Wang, D.; Zhao, J.; Lin, L.; Zhu, C.; Yang, T.; Xu, T. Maize stem–leaf segmentation framework based on deformable point clouds. ISPRS J. Photogramm. Remote Sens. 2024, 211, 49–66. [Google Scholar] [CrossRef]
Du, R.; Ma, Z.; Xie, P.; He, Y.; Cen, H. PST: Plant segmentation transformer for 3D point clouds of rapeseed plants at the podding stage. ISPRS J. Photogramm. Remote Sens. 2023, 195, 380–392. [Google Scholar] [CrossRef]
Liu, D.; Zhang, Y.; Ren, Y.; Pan, D.; He, X.; Cong, M.; Yu, G. Multi-directional attention: A lightweight attention module for slender structures. Intell. Robot. 2025, 5, 827–843. [Google Scholar] [CrossRef]
Zhao, L.; Tao, W. JSNet: Joint instance and semantic segmentation of 3d point clouds. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12951–12958. [Google Scholar]
Jiang, T.; Wang, Y.; Liu, S.; Zhang, Q.; Zhao, L.; Sun, J. Instance recognition of street trees from urban point clouds using a three-stage neural network. ISPRS J. Photogramm. Remote Sens. 2023, 199, 305–334. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, T.; Yu, M.; Tao, S.; Sun, J.; Liu, S. Semantic-Based Building Extraction from LiDAR Point Clouds Using Contexts and Optimization in Complex Environment. Sensors 2020, 20, 3386. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Zhang, Z.; Sun, K.; Li, S.; Yu, J.; Miao, L.; Zhang, Z.; Li, Y.; Zhao, H.; Hu, Z.; et al. Soybean-MVS: Annotated Three-Dimensional Model Dataset of Whole Growth Period Soybeans for 3D Plant Organ Segmentation. Agriculture 2023, 13, 1321. [Google Scholar] [CrossRef]
Zhao, L.; Wu, S.; Fu, J.; Fang, S.; Liu, S.; Jiang, T. Panoptic Plant Recognition in 3D Point Clouds: A Dual-Representation Learning Approach with the PP3D Dataset. Remote Sens. 2025, 17, 2673. [Google Scholar] [CrossRef]
Li, Z.; Li, M.; Shi, L.; Li, D. A novel fatigue driving detection method based on whale optimization and Attention-enhanced GRU. Intell. Robot. 2024, 4, 230–243. [Google Scholar] [CrossRef]
Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. OneFormer3D: One Transformer for Unified Point Cloud Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Liu, S.; Wang, G.; Fang, H.; Huang, M.; Jiang, T.; Wang, Y. MRSliceNet: Multi-Scale Recursive Slice and Context Fusion Network for Instance Segmentation of Leaves from Plant Point Clouds. Plants 2025, 14, 3349. [Google Scholar] [CrossRef]

Figure 1. The pipeline of the proposed MPRSF-CSA for instance-aware plant point cloud segmentation. The blue, purple, and yellow panels denote the point cloud processing module, bidirectional fusion module, and semantic-aware instance clustering module, respectively. The colored slice stacks represent recursive slicing along different directions, arrows indicate the processing flow and feature interactions between modules, and different colors in the clustering and final segmentation results denote different leaf instances.

Figure 2. The overview of the multi-directional recursive slicing and feature encoding.

Figure 3. The overview of the bidirectional fusion module with cross-slice attention embedding.

Figure 4. The overview of the semantic-aware instance clustering and joint optimization. Different colors indicate different instances or groups during the clustering and optimization process.

Figure 5. Examples of the plant point clouds in the two benchmarks. Different colors indicate different visible instances in the point cloud visualization.

Figure 6. Segmentation results of our method on two datasets. (Top): input point clouds; (Bottom): segmentation outputs.

Figure 7. Segmentation results of the Soybean-MVS and PP3D datasets. Different colors represent different instances [21,44,47].

Table 1. Plant point cloud segmentation performance evaluation. Note: AP denotes Average Precision. AP₂₅ and AP₅₀ are the AP values at IoU thresholds of 0.25 and 0.5, respectively.

Methods	Soybean-MVS			PP3D
Methods	AP (%)	AP₂₅ (%)	AP₅₀ (%)	AP (%)	AP₂₅ (%)	AP₅₀ (%)
PointGroup [21]	25.98	28.26	27.08	21.67	39.55	35.32
SCNet [44]	44.01	50.49	47.45	46.75	60.18	55.22
MRSliceNet [47]	46.75	50.63	48.95	48.95	69.61	60.56
MPRSF-CSA (ours)	53.80	62.94	57.34	53.02	72.11	65.37

Table 2. Ablation study performance evaluation on PP3D dataset. Model B * denotes the variant in which the BiLSTM is replaced by a Transformer encoder, while all other modules remain unchanged.

Methods	AP (%)
Model A	46.22
Model B	48.02
Model B *	48.42
Model C	49.42
Model D	50.82
Model E	51.52
Model F	51.69
Full Model	53.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, S.; Fang, S.; Zhang, L.; Wang, P.; Cheng, X.; Xu, L.; Sun, J.; Jiang, T. A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation. Agriculture 2026, 16, 956. https://doi.org/10.3390/agriculture16090956

AMA Style

Liu S, Fang S, Zhang L, Wang P, Cheng X, Xu L, Sun J, Jiang T. A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation. Agriculture. 2026; 16(9):956. https://doi.org/10.3390/agriculture16090956

Chicago/Turabian Style

Liu, Shan, Shilin Fang, Luhao Zhang, Pengcheng Wang, Xiaorong Cheng, Lei Xu, Jian Sun, and Tengping Jiang. 2026. "A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation" Agriculture 16, no. 9: 956. https://doi.org/10.3390/agriculture16090956

APA Style

Liu, S., Fang, S., Zhang, L., Wang, P., Cheng, X., Xu, L., Sun, J., & Jiang, T. (2026). A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation. Agriculture, 16(9), 956. https://doi.org/10.3390/agriculture16090956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Perspective Recursive Slice Framework with Cross-Slice Attention for Plant Point Cloud Instance Segmentation

Abstract

1. Introduction

1.1. Analysis of Existing Work

1.2. Contributions of Our Work

2. Materials and Methods

2.1. Multi-Directional Recursive Slicing and Feature Encoding

2.1.1. Multi-Directional Recursive Slicing Strategy

2.1.2. Intra-Slice Feature Encoding and Sequential Representation

2.1.3. Contextual Modeling Across Slices with Bidirectional LSTM

2.2. Bidirectional Fusion Module with Cross-Slice Attention Embedding

2.2.1. Parallel Decoding Branches and Feature Initialization

2.2.2. Dual Parallel Attention Fusion Mechanism

2.2.3. Multi-Scale Cross-Slice Attention

2.3. Semantic-Aware Instance Clustering and Joint Optimization

2.3.1. Semantic-Aware Mean Shift Clustering

2.3.2. Bandwidth Adaptation Mechanism

2.3.3. Multi-Task Joint Loss Function

3. Results

3.1. Datasets Description

3.2. Implementation Details

3.3. Result Evaluation and Analysis

3.3.1. Result Display and Evaluation

3.3.2. Ablation Study Analysis

3.3.3. Running Time

3.4. Performance Comparison

3.5. Limitations

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI