MASPC_Transform: A Plant Point Cloud Segmentation Network Based on Multi-Head Attention Separation and Position Code

Plant point cloud segmentation is an important step in 3D plant phenotype research. Because the stems, leaves, flowers, and other organs of plants are often intertwined and small in size, this makes plant point cloud segmentation more challenging than other segmentation tasks. In this paper, we propose MASPC_Transform, a novel plant point cloud segmentation network base on multi-head attention separation and position code. The proposed MASPC_Transform establishes connections for similar point clouds scattered in different areas of the point cloud space through multiple attention heads. In order to avoid the aggregation of multiple attention heads, we propose a multi-head attention separation loss based on spatial similarity, so that the attention positions of different attention heads can be dispersed as much as possible. In order to reduce the impact of point cloud disorder and irregularity on feature extraction, we propose a new point cloud position coding method, and use the position coding network based on this method in the local and global feature extraction modules of MASPC_Transform. We evaluate our MASPC_Transform on the ROSE_X dataset. Compared with the state-of-the-art approaches, the proposed MASPC_Transform achieved better segmentation results.


Introduction
Plant phenotype is to study how to measure the shape characteristics of plants, such as plant height, leaf organ size, root distribution, fruit weight, etc. These traits are closely related to the yield, quality, and stress resistance of plants. The study of plant phenotype has important value for agricultural modernization breeding [1], water and fertilizer managements of crop [2], and pest control [3].
In the process of plant phenotypic feature extraction, accurate segmentation of plant data according to different organs (stems, leaves, flowers, etc.,) is the premise of highprecision plant phenotype [4]. Plant organ segmentation technology based on 2D images has been very mature [5][6][7][8]. In recent years, with the development of LiDAR technology, more and more 3D spatial information of plants has been collected [9]. The plant point cloud contains the 3D spatial position, RGB color, normal vector, and other information of the collected object. Compared with 2D images, plant point cloud retains more spatial details and is not easily affected by occlusion, and can extract the plant structure more accurately.
By summarizing the existing segmentation methods of plant point clouds, we find that the existing methods have poor segmentation effect at the junction of different plant organs. For example, in the segmentation result in the fifth row and the first column in Figure 7, some stems are erroneously recognized as leaves. And this phenomenon is more obvious when the stem contacts the leaves. In the segmentation result in row 6 and column 2 of Figure 7, the part of the small calyx in the point cloud is erroneously divided into leaves. The reasons for the above segmentation errors are as follows: (1) for the plant segmentation task, the points belonging to the same plant organ are far from each other and are interwoven with the point clouds of other organs. For example, in Figures 5 and 6 the stems of the plants are almost distributed in the whole point cloud space and interweaved with other organs. The segmentation network often extracts the point cloud features of the whole plant without distinction, and does not mine the relationship of point clouds belonging to the same organ in the point cloud space. (2) Plant point clouds have the characteristics of disorder and irregularity, which will affect feature extraction.
To further improve the segmentation accuracy of plant point clouds, we use the Point Transformer [10] as the backbone of the proposed MASPC_Transform. The Point Transform uses the multi-head attentional mechanism in the process of local and global feature extraction. The multi-head attention mechanism can form associations between points of the same organ. And the multi-head attention mechanism is composed of multiple parallel self-attention mechanisms, which makes the whole feature generate multiple sub feature spaces and can extract feature information from multiple dimensions. However, the features of multi-head attention extraction may tend to be similar [11], that is, multiple attention heads establish connections for similar semantic point clouds at different positions in the point cloud space, but these point clouds may be located in the same area (for example, these point clouds may all be located on the same leaf of a plant). Therefore, we propose a multi-head attention separation loss based on spatial similarity, so that the attention positions of different attention heads can be separated from each other as much as possible, so as to establish a connection for point clouds that are distant in the point cloud space but belong to the same organ. In order to suppress the influence of point cloud disorder and irregularity on feature extraction, we added position coding network in the local and global feature extraction modules of MASPC_Transform.
The main contributions of this paper are summarized as follows: 1.
We propose a plant point cloud segmentation network named MASPC_Transform, and evaluate its segmentation performance on the ROSE_X dataset.

2.
We propose a loss function of multi-head attention separation based on spatial similarity. This loss can make the attention positions of different attention heads as dispersed as possible, and establish a connection for the point clouds that are far away but belong to the same organ, thus providing more semantic information for accurate segmentation.

3.
In order to reduce the impact of point cloud disorder and irregularity on feature extraction, we propose a position coding method that can reflect the relative position of points, and use the position coding network in the local and global feature extraction modules of MASPC_Transform.
The rest of this paper is organized as follows. Section 2 introduces the related work of plant point cloud segmentation. Section 3 describes the detailed structure of MASPC_Transform. In Section 4, we evaluated the segmentation performance of MASPC_Transform on the ROSE_X dataset and analyzed the experimental results. The last part is the conclusion of this paper.

Related Work
Traditional methods achieve the segmentation of plant point clouds through geometric features [12]. These methods use geometric information such as point cloud edge points, smoothness, plane fitting residual [13], curvature gradient [14] to classify and aggregate each point. On this basis, the method of clustering and model fitting [15] is further applied to complete the segmentation of point cloud data. Lee et al. [16] developed an adaptive clustering method, which can segment the point cloud data of pine forest to manage individual pine trees. This method is suitable for different sizes of canopy, but it needs a lot of data for pre training. Tao et al. [17] completed the segmentation task of single tree by setting a reasonable spacing threshold by using the characteristics of different trees and combining the "growth" algorithm. Xu et al. [18] applied the traditional Dijkstra shortest path algorithm to the spatial point cloud to complete the separation of tree branches and leaves. Matheus et al. [19] fused a variety of algorithms to realize the recognition of geometric characteristics in tree point cloud, and combined with the shortest path algorithm to complete the segmentation of point cloud structure, which greatly improved the robustness of the algorithm. Li et al. [20] designed a new algorithm to more accurately estimate the inclination and azimuth of the blades in the point cloud, and constructed a new projection coefficient model. In the follow-up study, Li et al. [21] developed a new path discrimination method by improving Laplace's shrinkage skeletonization algorithm to obtain the relevant parameters of the branch architecture. Traditional algorithms are easily affected by outliers and noise, which reduces the segmentation accuracy. The design of such algorithms often depends on the empirical design of geometric features, which are only effective for specific segmentation tasks.
Compared with traditional algorithms, deep learning methods are data-driven, do not need too many artificial design features, and have better performance. Currently, the deep learning methods that have been applied to point cloud segmentation include methods based on multi-view [22], voxel [23], and point cloud [24][25][26]. The method based on point cloud has the characteristics of directly processing point cloud and greatly retaining data information, so it has gradually become the mainstream research direction. Qi et al. [24] first proposed the network structure pointnet for directly processing point cloud data. This network proposed to use multilayer perceptron (MLP) with shared parameters to learn features and use symmetric functions to obtain global features. However, it has the problem that it cannot make full use of local information of points to extract fine-grained features. In order to solve this problem, an improved pointnet++ network [25] is proposed, which performs hierarchical and progressive learning on points from a large local area to obtain accurate geometric features near each point. In order to better extract the features of point clouds, Lee et al. [27] proposed an attention network, which can deal with disordered sets by adjusting the internal parameters of the network and can be used to extract the features of point clouds. Engel et al. [10] designed the Point Transformer network for point cloud segmentation, used multi-head attention in the network, and designed the SortNet structure to ensure the permutation invariance of extracted features.
Although great progress has been made in the research of deep learning segmentation algorithms for point cloud data, there are still few research on the segmentation of plant point cloud using deep learning methods. Wu et al. [28] adjusted the pointnet architecture to make the framework more suitable for processing the segmentation task of branches and leaves, and proposed a contribution score evaluation method. Jin et al. [29] made corn point cloud voxelized and applied convolutional neural network to voxelized data to complete a series of research work such as corn population segmentation and individual segmentation. Dutagaci et al. [30] provided valuable rosette data sets and provided benchmarks. Turgut et al. [31] verified the segmentation accuracy of various point based deep learning methods based on the work of Dutagaci [30], and studied the feasibility of three-dimensional synthesis model for training networks. Compared with other field point cloud segmentation tasks, plant point cloud segmentation is more challenging. This is because the stems, leaves, flowers, and other parts of plants are intertwined, resulting in the segmentation effect of existing segmentation methods is not ideal. The particularity of plant point cloud is that each part (organ) of the plant is very small and interwoven. This study proposes MASPC_Transform for segmentation of complex point clouds such as plant point clouds. In addition to the plant point cloud segmentation task, it is also applicable to the segmentation task of other point clouds with complex interwoven structures, such as forest point clouds [32].

Architecture of MASPC_Transform
The architecture of MASPC_Transform is shown in Figure 1. We use Point Transformer [10] as the network framework of MASPC_Transform. The difference between the proposed MASPC_Transform and the Point Transformer is that the proposed position coding network is used in the PC_MSG and PC_SortNet modules, and the proposed multihead attention separation loss based on spatial similarity is added to the loss function of the entire network. MASPC_Transform includes feature extraction part and detection head. The feature extraction network has two branches: location feature generation and global feature generation. These two branches are responsible for extracting local and global features of plant point clouds. The global features (F Globel ) and local features (F Location ) are aggregated in the detection head and the segmentation results are obtained.

Architecture of MASPC_Transform
The architecture of MASPC_Transform is shown in Figure 1. We use Point Transformer [10] as the network framework of MASPC_Transform. The difference between the proposed MASPC_Transform and the Point Transformer is that the proposed position coding network is used in the PC_MSG and PC_SortNet modules, and the proposed multi-head attention separation loss based on spatial similarity is added to the loss function of the entire network. MASPC_Transform includes feature extraction part and detection head. The feature extraction network has two branches: location feature generation and global feature generation. These two branches are responsible for extracting local and global features of plant point clouds. The global features (FGlobel) and local features (FLocation) are aggregated in the detection head and the segmentation results are obtained. Multi-head attention [10] in MASPC_Transform is defined as follows: Multi-head attention [10] in MASPC_Transform is defined as follows: In Equation (1), Q, K, and V respectively represent the query matrix, key matrix, and value matrix of attention, and their matrix dimensions are d k , d k , and d v .
represents the features output by the ith attention head, ×d m are the learnable parameters. The symbol indicates that the features outputted by different attention heads are concatenated together. In Equation (2), LayerNorm is layer normalization [33]. S is defined as S = LayerNorm(X + Multihead(X, Y, Y)), Φ is a network module with multiple MLPs, which is responsible for further feature extraction of S. A MH (X, Y) is the prototype of all multi-head attention in the network.
In Equations (3) and (4), A self , A LG , and A cross are derived from A MH . A self can perform the calculation of multi-head attention among all elements of P, while A LG and A cross can handle different sets P and Q, and perform the calculation of multi-head attention between the two sets.
We proposed a multi-head attention separation loss based on spatial similarity (loss in Figure 1). This loss acts on all the multi-head attention modules in MASPC_Transform. Therefore, we call the three attention modules that are affected by the proposed loss as Div_A sel f , Div_A LG , and Div_A cross . These three multi-head attention modules are responsible for establishing connections for similar features at different positions in the point cloud space. We will discuss the loss function of multi-head attention separation based on spatial similarity in Section 3.3.

Position Code
Plant point cloud data are a collection of a series of points in space. Point sets have the characteristics of disordered and irregular distribution, so we propose a unique point cloud position coding method. The position code contains the relative position information of each point and its adjacent points, so as to avoid the interference of the disorder of the point cloud on the feature extraction. Position code function δ is defined as follows: Suppose there are n points in the whole point cloud space. In Equation (5), P i is a point in a subspace after the ball query, P i , P i1 , P i2 , P i3 , . . . , P ij ∈ P, P is the set of all points in the subspace. P i , (P i − P i1 ), . . . . . . , (P i − P ij ) are the relative position codes of point P i , n i=1 () represents the relative position code of all points in the space. Function θ is a multi-layer perceptron (MLP) used for feature extraction of position code. The symbol indicates that the obtained two features are concatenated. Equation (5) indicates that the position coding δ of the point cloud space is composed of the relative position code (RPC) and the absolute position code (APC) of each point in the space. The absolute position code of a point is the coordinates of the point cloud. The relative position code of a point is the difference between the coordinates of the point and all points in its subspace. The relative position code keeps a certain invariance to the disorder of the point cloud, and it reflects the relationship between a point and its adjacent points, which can make the feature contain more local information. The position code network is shown in Figure 2.

MSG and SortNet Based on Position Code Network
In MASPC_Transform, we improved the MSG [10] in Point Transformer, and used the Position code-MSG(PC-MSG) module to extract global features. The structure of PC-MSG is shown in Figure 3. PC-MSG first takes the farthest point sampling (FPS), then the sampling point is taken as the center point, and three different radius are selected for ball query. According to the method in 3.1, the RPC of points is calculated in the subspace of each scale in PC-MSG. After that, the RPC features of each scale were extracted using MLP. In Figure 3, the orange rectangle represents the extracted RPC features of each scale, the blue rectangle represents the extracted APC features of each scale, and the high D features are the features extracted by the high-dimensional feature extraction network before the PC-MSG network.

MSG and SortNet based on Position Code Network
In MASPC_Transform, we improved the MSG [10] in Point Transformer, and used the Position code-MSG(PC-MSG) module to extract global features. The structure of PC-MSG is shown in Figure 3. PC-MSG first takes the farthest point sampling (FPS), then the sampling point is taken as the center point, and three different radius are selected for ball query. According to the method in 3.1, the RPC of points is calculated in the subspace of each scale in PC-MSG. After that, the RPC features of each scale were extracted using MLP. In Figure 3, the orange rectangle represents the extracted RPC features of each scale, the blue rectangle represents the extracted APC features of each scale, and the high D features are the features extracted by the high-dimensional feature extraction network before the PC-MSG network. Finally, the RPC features, APC features, and high D features are concatenated together. Because the network structures of different scales in the MSG are the same, the feature extraction process of the second scale of the network is omitted in Figure 3. We also improved SortNet in Point Transformer network [10], replacing SortNet with PC-SortNet with position code. As shown in Figure 4, in the PC-SortNet, the input features first pass through multiple MLPs, and its feature dimension is reduced to 1 dimension. This feature calculates a learnable importance score for each point in the point cloud space. After that, k points with the highest score are selected through the Top-k module. We take k points as the center of the ball query and extract the features of the region within the ball. We use a method similar to skip connect to concatenate the features of different stages. As shown by the Red PC in Figure 4, we use the position code proposed in Section We also improved SortNet in Point Transformer network [10], replacing SortNet with PC-SortNet with position code. As shown in Figure 4, in the PC-SortNet, the input features first pass through multiple MLPs, and its feature dimension is reduced to 1 dimension. This feature calculates a learnable importance score for each point in the point cloud space. After that, k points with the highest score are selected through the Top-k module. We take k points as the center of the ball query and extract the features of the region within the ball. We use a method similar to skip connect to concatenate the features of different stages. As shown by the Red PC in Figure 4, we use the position code proposed in Section 3.1 when querying the ball and extracting features. We also improved SortNet in Point Transformer network [10], replacing SortNet with PC-SortNet with position code. As shown in Figure 4, in the PC-SortNet, the input features first pass through multiple MLPs, and its feature dimension is reduced to 1 dimension. This feature calculates a learnable importance score for each point in the point cloud space. After that, k points with the highest score are selected through the Top-k module. We take k points as the center of the ball query and extract the features of the region within the ball. We use a method similar to skip connect to concatenate the features of different stages. As shown by the Red PC in Figure 4, we use the position code proposed in Section 3.1 when querying the ball and extracting features.

Multi-Head Attention Separation Loss Based on Spatial Similarity
When multi-head attention is used for feature extraction, there is a possibility that the generated multiple attention spaces are similar [11], which will cause multiple attention spaces to overlap each other, resulting in repeated extraction in some areas and insufficient feature extraction in other areas. Therefore, we propose a multi-head attention separation loss based on spatial similarity, which makes each attention positions of the segmented network tend to be separated. Its definition is as follows: In Equation (6), F i sa and F j sa are the different attention feature spaces of multi-head attention output, and F is the set of feature spaces output by the attention mechanism. The symbol | . | represents the module of the matrix, . 2 denotes the 2-norm of the matrix. Equation (6) can calculate the average cosine distance of all output feature spaces. Cosine distance is an index to measure the difference of feature space in direction, so it can be used to evaluate the similarity of feature space. By dividing by n 2 , we can make the calculated value tend to a reasonable range and avoid the difficulty of network training. Using a negative sign to indicate Separation_Loss penalizes network parameters that make F i sa and F j sa tend to be similar. We take the Separation_Loss as a part of the loss function and train the network, so that the attention features tend to be diverse. The loss function of MASPC_Transform is as follows: Loss = Loss CrossEntropy + Loss scal × Separation_Loss In Equation (7), p(x) is the real classification probability distribution of the input point cloud, and q(x) is the prediction probability distribution actually given by the network. Equation (7) depicts the difference between the classification result and the real value. The smaller the value of Loss_CrossEntropy, the more realistic the prediction given by the network. As shown in Equation (8), we used the Loss_CrossEntropy and Separation_Loss as MASPC_Transform's loss function. Where Loss_scal is the weight of Separation_Loss in the loss function. Using the new loss function to train MASPC_Transform can make multiple attention feature spaces specific.

Data Set
We evaluated the performance of MASPC_Transform on the ROSE_X dataset [30]. The ROSE_X dataset contains a total of 11 rose point cloud data. The rose point cloud data contain three semantic tags, namely, flower, leaf, and stem. The petals, calyx, and bud of rose are all marked as "flower" label, and the stem and petiole are all marked as "stem" label. We use nine rose point clouds to train the network, and the other two rose point clouds to test the segmentation performance of the network after training. We denoted the two roses used for the test as test_R1 and test_R2. Because the volume of a single rose point cloud is large and the number of points is large, and the amount of data that can be processed at a single time is limited, it is necessary to divide the point cloud into smaller blocks. We adopt the same blocking method as in [30], that is, the size and number of points of each block are as consistent as possible, and the structure within the block is as complete as possible. With this method, we divided the nine rose point clouds used for training into 596 point clouds and the two point clouds used for testing into 143 point clouds.

Implementation Details
For the model training, the Adam optimizer is used to update and optimize the network parameters. The initial learning rate is set to 0.001 and the batch size is 16. The GPU model is NVIDIA GeForce RTX 2080Ti, operating system is Ubuntu 18.04 LTS, CUDA version is 11.0. The proposed model is implemented in PyTorch with Python version 3.6. When training MASPC_Transform network, the input point cloud only contains three-dimensional X-Y-Z coordinates, and the number of input points is 2048.

Evaluation Methodology
We use the Intersection over Union (IoU) and Mean Intersection over Union (MIoU) to evaluate the performance of all networks. Where IoU is equal to the ratio of intersection and union between the predicted point set and the real point set, and MIoU represents the average value of IOU of all categories. The higher the values of these two indicators, the better the segmentation effect of the point cloud. The mathematical definition is as follows: where TP c , FP c , and FN c are the number of positive samples of category C that have been correctly identified, the number of negative samples that have been misreported. and the number of positive samples that have been missed, C ∈ {Flower, stem, lea f }, k is the number of all categories.

Segmentation Results
In Table 1, we show the segmentation results of different segmentation networks on ROSE_X dataset, including PointNet [24], PointNet++ [25], DGCNN [34], PointCNN [35], ShellNet [36], RIConv [37], and the proposed MASPC_Transform. In Table 1, we can see MASPC_Transform has the highest MIoU, and MASPC_Transform achieves the best segmentation results on both the flower and stem classes. As an improved version of PointNet, PointNet++ can flexibly extract local features by adjusting the neighborhood radius, and has the ability to extract the features of small organs of plants. So, it achieves the best segmentation results in leaf class. The IoU value of MASPC_Transform on the leaf class is slightly lower than that of PointNet++, but the MIoU value of MASPC_Transform is higher than that of PointNet++.

Visual Effects
Figures 5 and 6 respectively show the segmentation results of different segmentation networks on test_R1 and test_R2. Figures 5a and 6a are the ground truth of test_R1 and test_R2. In Figures 5a and 6a, we can see that the stems, leaves, and flowers of the two plants are interlaced and occluded each other, which creates great difficulties for the segmentation algorithm. In Figure 5d,f and Figure 6d,f, we can see that PointNet and DGCNN hardly segment different plant organs. It can be seen from the area within the dotted circle in Figures 5 and 6 that the segmentation ability of the comparison network (Point Transformer, PointNet++, DGCNN, PointCNN, ShellNet and RIConv) for details is inferior to that of MASPC_Transform. As shown in Figure 5c, Point Transformer mistakenly divides some petals into leaves. As shown in Figure 5e, PointNet++ mistakenly divided part of the calyx at the top into leaves and stems. As shown in Figure 5g, PointCNN mistakenly divided part of the calyx at the top into stems, and mistakenly divided the stems in the lowest red circle into leaves. As shown in Figure 5h, ShellNet mistakenly divided the calyx in the red circle into leaves. As shown in Figure 5i, RICov mistakenly divided some flowers in the top red circle into leaves. In Figure 6, there is also a case of false segmentation in the comparison network. The proposed MASPC_Transform has the best segmentation effect for the interlaced parts of different plant organs.
In order to show the segmentation effect of each method more clearly, we extracted some regions from the segmented plant point cloud and showed them more clearly in Figure 7. As can be seen from the first column in Figure 7, the objects to be segmented are leaves and stems. Among the segmentation results of all methods, the results corresponding to MASPC_Transform proposed by us are the most similar to ground truth. PointNet was failed to segment stems and leaves. DGCNN and PointCNN hardly segment the stem and leaf correctly. The stems segmented by PointNet++, ShellNet, and RIConv were shorter than those separated by MASPC_Transform, and they mistakenly divided the stems between two leaves into leaves. Point Transformer also mistakenly divides some stems into leaves at the intersection of leaves. In the segmentation results of the second and third columns of Figure 7, the MASPC_Transform also achieves the best segmentation effect. calyx at the top into leaves and stems. As shown in Figure 5g, PointCNN mistakenly divided part of the calyx at the top into stems, and mistakenly divided the stems in the lowest red circle into leaves. As shown in Figure 5h, ShellNet mistakenly divided the calyx in the red circle into leaves. As shown in Figure 5i, RICov mistakenly divided some flowers in the top red circle into leaves. In Figure 6, there is also a case of false segmentation in the comparison network. The proposed MASPC_Transform has the best segmentation effect for the interlaced parts of different plant organs. It can be seen from the segmentation effect shown in Figures 5-7 that the MASPC_Transform has the best segmentation effect. This is because the multi-head attention and the multihead attention separation loss based on spatial similarity in MASPC_Transform establish a connection for the same kind of point clouds (point clouds with similar semantics) scattered in different regions of the point cloud space. In areas where multiple categories are interlaced, this association can help MASPC_Transform achieve better segmentation effect in detail. In order to show the segmentation effect of each method more clearly, we extracted some regions from the segmented plant point cloud and showed them more clearly in Figure 7. As can be seen from the first column in Figure 7, the objects to be segmented are leaves and stems. Among the segmentation results of all methods, the results corresponding to MASPC_Transform proposed by us are the most similar to ground truth. PointNet was failed to segment stems and leaves. DGCNN and PointCNN hardly segment the stem and leaf correctly. The stems segmented by PointNet++, ShellNet, and RIConv were shorter than those separated by MASPC_Transform, and they mistakenly divided the stems between two leaves into leaves. Point Transformer also mistakenly divides some stems into leaves at the intersection of leaves. In the segmentation results of the second and third columns of Figure 7, the MASPC_Transform also achieves the best segmentation effect.  Table 2 shows the results of our ablation studies on the ROSE_X dataset. In the ablation studies, we used the original Point Transformer [10] as the baseline. In Table 2, Without RPC represents a network that does not use RPC, but still uses our Equation (8) to train the network. Without Separation_Loss means that the proposed multi-head attention Separation_Loss is not used in the network, and only CrossEntropy is used to train the network. Note that RPC is used in the Without Separation_Loss network. The last column presents the experimental results of MASPC_Transform proposed by us. From the results shown in Table 2, we can see that the values of IoU and MIoU of MASPC_Transform are the highest. The IoU and MIoU of each category of MASPC_Transform without multi-head attention separation loss function and MASPC_Transform without relative position code are lower than those of MASPC_Transform, but better than the Point Transformer.

Ablation Studies
According to the experimental results in Section 4.4, the proposed MASPC_Transform outperforms the state-of-the-art approaches. The visualization results shown in Figures 6 and 7 confirm the experimental results in Section 4.4. The visualization results of these comparison approaches for rose point clouds with interlaced stems, leaves, and flowers is not as good as MASPC_Transform. This shows that our multi-head attention separation loss can distract the attention positions of different attention heads as much as possible, and establish connections for point clouds that are far away but belong to the same organ. However, these comparison approaches do not have this ability, so that these segmentation networks believe that two flowers (stems or leaves) far away belong to different categories.
The results of ablation studies verify the effectiveness of the multi-head attention separation loss (Separation_Loss) and position code (PC). It can be seen from the segmentation effect shown in Figures 5-7 that the MASPC_Transform has the best segmentation effect. This is because the multi-head attention and the multi-head attention separation loss based on spatial similarity in MASPC_Transform establish a connection for the same kind of point clouds (point clouds with similar semantics) scattered in different regions of the point cloud space. In areas where multiple categories are interlaced, this association can help MASPC_Transform achieve better segmentation effect in detail. Table 2 shows the results of our ablation studies on the ROSE_X dataset. In the abla-

Conclusions
We propose a plant point cloud segmentation network named MASPC_Transform. In order to make the attention positions of different attention heads of MASPC_Transform as dispersed as possible, we propose a multi-head attention separation loss based on spatial similarity. In order to reduce the impact of point cloud disorder and irregularity on feature extraction, we use position coding in the local and global feature extraction modules of MARP_Transform. We evaluated the proposed MASPC_Transform on the ROSE_X dataset.