TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds

Ma, Shanshan; Lu, Xu; Zhang, Liang

doi:10.3390/app15158406

Open AccessArticle

TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds

by

Shanshan Ma

,

Xu Lu

and

Liang Zhang

^*

College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8406; https://doi.org/10.3390/app15158406

Submission received: 6 June 2025 / Revised: 19 July 2025 / Accepted: 25 July 2025 / Published: 29 July 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate organ-level segmentation is essential for achieving high-throughput, non-destructive, and automated plant phenotyping. To address the challenge of intelligent acquisition of phenotypic parameters in tomato plants, we propose TSINet, an end-to-end dual-task segmentation network designed for effective and precise semantic labeling and instance recognition of tomato point clouds, based on the Pheno4D dataset. TSINet adopts an encoder–decoder architecture, where a shared encoder incorporates four Geometry-Aware Adaptive Feature Extraction Blocks (GAFEBs) to effectively capture local structures and geometric relationships in raw point clouds. Two parallel decoder branches are employed to independently decode shared high-level features for the respective segmentation tasks. Additionally, a Dual Attention-Based Feature Enhancement Module (DAFEM) is introduced to further enrich feature representations. The experimental results demonstrate that TSINet achieves superior performance in both semantic and instance segmentation, particularly excelling in challenging categories such as stems and large-scale instances. Specifically, TSINet achieves 97.00% mean precision, 96.17% recall, 96.57% F1-score, and 93.43% IoU in semantic segmentation and 81.54% mPrec, 81.69% mRec, 81.60% mCov, and 86.40% mWCov in instance segmentation. Compared with state-of-the-art methods, TSINet achieves balanced improvements across all metrics, significantly reducing false positives and false negatives while enhancing spatial completeness and segmentation accuracy. Furthermore, we conducted ablation studies and generalization tests to systematically validate the effectiveness of each TSINet component and the overall robustness of the model. This study provides an effective technological approach for high-throughput automated phenotyping of tomato plants, contributing to the advancement of intelligent agricultural management.

Keywords:

3D point clouds; tomato; deep learning; plant organ segmentation; plant phenotyping

1. Introduction

Recognized for its high nutritional content, rapid growth cycle, and strong environmental adaptability, the tomato has become a crop of significant economic value, contributing substantially to both the fresh produce market and the food processing industry [1]. Beyond its agricultural value, the tomato serves as a model plant in genetics, molecular biology, and physiological ecology, providing a crucial foundation for research on modern agricultural technologies, genetic improvement, and precision breeding [2,3]. Accurate acquisition of organ-level phenotypic traits, such as leaf area and plant height, is essential for monitoring growth status, evaluating environmental responses, and supporting breeding decisions. Traditional phenotyping techniques mostly rely on manual measurements, which are time-consuming, damaging, labor-intensive, and inappropriate for dynamic, large-scale, non-invasive applications [4].

With the advancement of precision agriculture, the demand for high-throughput, automated, and non-destructive phenotyping technologies is increasing, making automated plant phenotyping a research hotspot. Early approaches primarily utilized 2D image-based visual perception, traditional learning algorithms, and neural network techniques [5,6,7]. However, 2D imaging suffers from limitations such as occlusion and a lack of in-depth information, making it inadequate for capturing the complex 3D structures of plants, especially in cases of overlapping leaves and variable postures, such as in tomato plants.

To address these challenges, 3D vision technologies have been introduced into plant phenotyping. 3D plant modeling provides rich spatial structure information suitable for non-invasive growth monitoring and structural analysis. For example, Miao et al. [8] proposed a method combining skeleton extraction and stem-leaf classification for automatic segmentation of maize seedling point clouds. Liang et al. [9] provided a 3D point cloud-based approach combining RANSAC, region expansion, and greedy projection meshing algorithms to precisely estimate phenotypic traits such as leaf area in tomato seedlings. Xing et al. [10] introduced a non-destructive image-based measurement system for automatic detection and size estimation of tomato leaves, combining 2D and 3D data from a Zivid 3D camera to enable 3D monitoring during plant growth.

With the rapid development of deep learning in 3D vision, point cloud-based neural networks such as PointNet, PointNet++, and DGCNN have been utilized for segmenting plant point clouds. As a scenario, Qiao et al. [11] recommended a phenotyping framework that integrates 3D plant reconstruction employing a lightweight point cloud segmentation network (PointSegNet) in conjunction with the Nerfacto neural radiance field model, which combines global-local set abstraction and edge-aware feature propagation to achieve precise partitioning of stem and leaf structures, as well as robust retrieval of phenotypic features. Sun et al. [12] introduced Win-Former, a window-based Transformer for efficient semantic segmentation of maize point clouds, achieving 83.45% mIoU on the Pheno4D dataset. Zhang et al. [13] presented a framework based on an improved Red-Blue Magpie Optimization (ES-RBMO) strategy and a four-layer convolutional model for semantic classification of tomato stems and leaves. Yan et al. [14] designed PEPNet, a 3D deep learning network for extracting traits and segmenting plant organs, significantly improving inference speed and throughput in cotton stem-leaf segmentation. Yang et al. [15] trained PointNet++ and HAIS on augmented datasets for deformable point cloud-based segmentation of maize stems and leaves. Hao et al. [16] proposed a point cloud-based approach for cotton phenotyping using the PointSegAt network and an active boundary segmentation algorithm. Liu et al. [17] introduced FACNet for high-precision segmentation of pumpkin seedlings, incorporating a dual-branch feature extractor and an adaptive multi-scale fusion module to handle overlapping leaves and morphological diversity. Yao et al. [18] developed an automated pipeline using CAFPoint for accurate tomato organ segmentation and trait extraction by fusing geometric, normal, and color features. Song et al. [19] introduced CotSegNet with enhanced attention and region-growing for precise cotton organ segmentation. Xie et al. [20] proposed Plant-MAE, a self-supervised framework that reduces annotation needs while ensuring accurate trait extraction. Liu et al. [21] combined PointNet++ with Transformer to build TPointNetPlus for high-accuracy cotton segmentation. Dong et al. [22] introduced a dual-phase method integrating PointNeXt with Quickshift++ to perform plant organ instance segmentation, achieving accurate and generalizable results across monocot and dicot crops through integrated semantic segmentation and spatial clustering.

Despite these advances, 3D plant point cloud analysis still faces several challenges. Existing methods often treat semantic and instance segmentation as separate tasks, lacking synergy and leading to inconsistent results. Local geometric details and spatial contextual information are underutilized, limiting performance in complex structures such as intersecting stems and occluded organs. Moreover, there is a lack of 3D deep learning models specifically designed for tomatoes—a morphologically complex dicot—making it difficult for general-purpose models to accurately capture their phenotypic characteristics. Therefore, developing a segmentation framework that integrates semantic and instance information, is structure-aware, and is tailored to the unique features of tomato point clouds is of great importance.

To address these issues, we propose TSINet, a dual-task deep learning network that runs end-to-end for 3D tomato plant point clouds. TSINet is designed to perform both semantic and instance segmentation to achieve fine-grained recognition and separation of tomato organs. The main contributions of this work are as follows:

We design a Geometry-Aware Adaptive Feature Extraction Block (GAFEB) within a shared encoder, incorporating EdgeConv, PAConv, and residual connections to enhance the extraction of local and contextual geometric features.
A Dual Attention-Based Feature Enhancement Module (DAFEM) is introduced, combining spatial and channel attention mechanisms to model salient regions in decoder features, improving perception in structurally complex areas.
We propose TSINet, a point-based dual-functional segmentation network tailored to tomato plants, capable of segmenting stems and leaves with precision in both semantics and instances.

2. Materials and Methods

2.1. Data Acquisition

For the semantic and instance segmentation tasks in this research, we utilized 3D tomato plant point cloud data from the Pheno4D [23] dataset. Pheno4D is a large-scale spatio-temporal plant point cloud dataset that includes two crop types: maize (seven plants over 12 days) and tomato (seven plants over 20 days). All plants were cultivated in pots within a greenhouse and scanned using a high-precision 3D laser scanning system with spatial accuracy reaching the sub-millimeter level (less than 0.1 mm). Figure 1 shows three tomato plants at different growth stages.

The dataset contains continuous scanning data of seven tomato plants over multiple growth stages during a three-week period following initial sprouting, resulting in 140 high-resolution tomato plant point clouds. Among them, 77 point clouds were manually annotated, with each point labeled as “soil”, “stem”, or “leaf”. In addition, Pheno4D provides temporally consistent organ labeling, meaning that the same plant organ (e.g., a specific leaf) is assigned an identical instance label across different time points. Each individual leaf on a given plant is annotated with a distinct instance label to set it apart from other leaves, as illustrated in Figure 2b.

2.2. Data Preprocessing

Each point in the Pheno4D dataset is labeled as “soil”, “stem”, or “leaf”, with individual leaves assigned unique instance labels to distinguish them from others on the same plant. As this study focuses on structural understanding and segmentation of tomato plants, all points labeled as “soil” were removed during preprocessing to eliminate background noise and concentrate on the plant body. In addition to lowering computational complexity, this step increased the precision of semantic and instance segmentation by focusing on pertinent categories (“stem” and “leaf”). Moreover, a significant class imbalance exists in the raw data: most points belong to the “soil” and “leaf” categories, while the “stem” class is underrepresented. Such imbalance can bias deep learning models toward dominant classes and impair learning for minority classes like “stem”. By removing soil points, we mitigated this issue and ensured the model focused on essential plant structures. The final preprocessed dataset contains only “stem” and “leaf” points, which serve as input for subsequent model training and evaluation.

Since the original dataset only provides instance labels without explicit semantic labels, we derived semantic annotations from the instance information. Specifically, the semantic labels distinguish only between “stem” and “leaf”, without differentiating between individual leaves of the same plant. Figure 2 illustrates examples of raw and annotated point clouds.

To improve the deep neural network’s capacity for feature learning and prevent overfitting, we further partitioned and augmented the dataset. The 77 annotated point cloud samples were randomly divided into training and testing sets at an 8:2 ratio, ensuring that both subsets covered tomato plants from all growth stages. This strategy prevents bias in the training set toward mature plants or specific developmental stages and enables a more comprehensive evaluation of the model’s generalization ability across different stages of tomato growth. For each point cloud, we applied Farthest Point Sampling (FPS) [24] 20 times, using random initialization to both restrict the point count and augment the dataset. Each augmented point cloud was constrained to 4096 points, reducing data density while improving computational efficiency and accuracy. Although the augmented samples originate from the same raw point cloud, they exhibit significant differences in local point distributions. Table 1 summarizes the data distribution before and after augmentation. Finally, as the original Pheno4D dataset is stored in TXT format with low I/O efficiency, we converted the data into HDF5 format to accelerate loading during model training.

2.3. Network Architecture

We propose TSINet, a dual-branch network based on an encoder–decoder architecture, specifically designed to achieve efficient and accurate semantic and instance segmentation of tomato plant point clouds. As illustrated in Figure 3, the upper branch is responsible for semantic segmentation, while the lower branch handles instance segmentation. TSINet consists of three main components: a shared encoder, parallel decoders, and a dual attention-based feature enhancement module. The shared encoder is composed of stacked Geometry-Aware Adaptive Feature Extraction Blocks (GAFEBs), which progressively encode raw point cloud coordinates into hierarchical high-dimensional features. This enables effective extraction of both local geometric structures and global contextual information, providing unified feature representation for both segmentation tasks. The parallel decoder comprises two symmetric upsampling paths, which reconstruct semantic and instance features, respectively, and facilitate their fusion. To enhance the discriminative power and contextual awareness of the fused features, we introduce a dual attention module that combines spatial attention (SA) and channel attention (CA). The final semantic predictions are obtained via an argmax operation, while instance segmentation is achieved by applying the MeanShift clustering [25] algorithm to the instance-aware features.

2.3.1. Geometry-Aware Adaptive Feature Extraction Block (GAFEB)

To effectively capture the local structures and geometric relationships in the raw tomato plant point cloud, we propose a Geometry-Aware Adaptive Feature Extraction Block (GAFEB), as illustrated in Figure 4a. GAFEB consists of four key components: Farthest Point Sampling (FPS) [24], EdgeConv [26], Position Adaptive Convolution (PAConv) [27], and residual-based feature fusion.

For each GAFEB, we first perform FPS to downsample the input point cloud by a factor of four, retaining only 25% of the points while preserving their corresponding C-dimensional features. FPS selects representative key points with uniform spatial distribution, reducing computational redundancy.

EdgeConv dynamically constructs local adjacency graphs in the feature space and performs convolution over the relational features between each point and its neighboring points, effectively capturing local geometric structures and spatial dependencies within the point cloud. As illustrated in Figure 4b, a k-nearest neighbor (kNN) graph is constructed for each sampled point, forming a graph structure

G (V, E)

, where

V = {1, . . ., n}

denotes the set of sampled points and

E \subseteq V \times V

denotes the set of edges representing spatial relationships between neighboring points. The edge feature between a point

x_{i}

and its neighbor

x_{i j}

is defined as

e_{i j} = h_{θ} ([x_{i}, x_{i j} - x_{i}])

(1)

where

x_{i}

is the feature of the center point,

x_{i j}

is the feature of a neighboring point, and

h_{θ} (\cdot)

denotes a multilayer perceptron (MLP) with shared parameters. The edge features within each neighborhood are aggregated via max pooling to produce a local structure-aware representation.

To enhance adaptability to varying local geometries, we introduce PAConv after EdgeConv. As shown in Figure 4c, PAConv adaptively learns convolutional kernel weights based on the spatial configuration of point clouds. It leverages a learnable ScoreNet to generate dynamic weights over a set of predefined kernels, enabling spatially adaptive local feature extraction.

y_{i} = \sum_{m = 1}^{M} α_{m} (i) \cdot {C o n v}_{m} (x_{i}, N (x_{i}))

(2)

where

M

is the number of predefined kernels, and

α_{m} (i)

are the weights generated by ScoreNet.

Compared with static MLPs or fixed graph structures, PAConv provides improved flexibility and generalization in modeling diverse local structures. To further enrich feature representations, the outputs of EdgeConv and PAConv are concatenated and fused using an additional PAConv layer. A residual connection is then employed to add the input features to the fused features, resulting in enhanced representations. This residual design stabilizes gradient flow, alleviates feature degradation, and facilitates effective multi-scale feature integration.

By stacking four consecutive GAFEB modules, the encoder progressively abstracts hierarchical geometric features, offering a consistent and discriminative shared representation for tasks involving instance and semantic segmentation.

2.3.2. Dual-Branch Feature Decoder

In the decoding stage, two parallel branches are employed to decode the shared high-level features for different tasks. The upper branch is designed for semantic segmentation, while the lower branch focuses on instance embedding. Although both branches adopt the same architectural design, their parameters are independently optimized. Each branch performs progressive feature upsampling using feature interpolation, which restores features from subsampled point sets to denser ones. To enhance spatial detail recovery and multi-scale representation, skip connections are introduced between corresponding layers of the encoder and decoder. Finally, lightweight 1D convolutional layers are used to transform the fused features into task-specific representations. Through repeated upsampling, skip connections, and point-wise MLPs, the number of points is restored to the original input resolution, with each point represented by a 128-dimensional feature vector. Figure 5 shows the decoder’s detailed structure, using the semantic segmentation branch as an example, including the skip connections between encoder and decoder layers.

2.3.3. Dual Attention-Based Feature Enhancement Module (DAFEM)

To effectively fuse and enhance task-relevant features prior to the final prediction, we propose a Dual Attention-Based Feature Enhancement Module (DAFEM), which integrates feature fusion and attention-based refinement within a unified framework. Specifically, the semantic feature map

F_{S}

and instance feature map

F_{I}

output from the decoder are first processed via 1D convolution layers and combined using element-wise addition to obtain the fused feature

F_{f}

. In the semantic segmentation branch, the feature map from the previous layer of

F_{S}

is upsampled to match the number of points in

F_{f}

and concatenated with

F_{f}

. A subsequent 1D convolution is then applied to extract more semantically expressive features. Similarly, in the instance segmentation branch, the previous-layer feature map of

F_{I}

is upsampled and concatenated with

F_{f}

, followed by a 1D convolution to generate features with stronger instance-level discriminability.

The final fused features from each branch are then refined by two complementary attention mechanisms: spatial attention (SA) and channel attention (CA). The SA module generates a spatial attention map to emphasize geometrically salient regions, enhancing the model’s capacity to perceive spatial structure and contextual dependencies. In parallel, the CA module models inter-channel dependencies by computing a channel attention matrix that quantifies the influence among channels, enabling adaptive re-weighting of feature channels based on their importance. This mechanism enhances the expressiveness and discriminability of the learned features.

Ultimately, the attention-enhanced features from the two branches are used separately for semantic classification (via Argmax) and instance embedding (via MeanShift clustering). By jointly modeling both spatial and channel-wise dependencies, DAFEM improves global context modeling, strengthens feature representation, and enhances the robustness and accuracy of the network in both semantic and instance segmentation tasks.

2.3.4. Loss Function

TSINet consists of two independent branches designed for dual tasks: predicting semantic labels of points for semantic segmentation and generating discriminative instance embeddings for instance segmentation. Accordingly, the overall loss function of TSINet is defined as the sum of the losses from these two branches:

L = L_{s e m} + L_{i n s}

(3)

where

L_{s e m}

denotes the semantic segmentation loss, and

L_{i n s}

represents the instance embedding loss.

The typical cross-entropy loss is used in the semantic segmentation loss

L_{s e m}

, defined as

L_{s e m} = - \sum_{i = 1}^{N} \sum_{j = 1}^{C} x_{j}^{'} (i) l o g (x_{j} (i))

(4)

where

x_{j}^{'} (i)

is the one-hot encoding of the ground-truth semantic label for point

p_{i}

, and

x_{j} (i)

is the predicted probability that point

p_{i}

belongs to class

j

.

Since the input tomato plant point clouds contain an unknown amount of instances, the instance embedding loss

L_{i n s}

is formulated as a composite loss comprising three weighted components:

L_{i n s} = α \cdot L_{p u l l} + {β \cdot L}_{p u s h} + γ \cdot L_{r e g}

(5)

where

L_{p u l l}

encourages points from the same instance to be close in the embedding space,

L_{p u s h}

enforces separation between different instances, and

L_{r e g}

is a regularization term. Their formulations are as follows:

L_{p u l l} = \frac{1}{| I |} \sum_{t = 1}^{| I |} \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} {[‖μ_{t} - e_{i}‖ - δ_{s}]}_{+}^{2}

(6)

L_{p u s h} = \frac{1}{| I | (| I | - 1)} \sum_{t_{A} = 1}^{| I |} \sum_{\begin{matrix} t_{B} = 1 \\ t_{A} \neq t_{B} \end{matrix}}^{| I |} {[2 δ_{d} - ‖μ_{t A} - μ_{t B}‖]}_{+}^{2}

(7)

L_{r e g} = \frac{1}{| I |} \sum_{t = 1}^{| I |} ‖μ_{t}‖

(8)

where

{[x]}_{+} = m a x (0, x)

,

| I |

denotes the number of instances,

N_{t}

is the number of points in the t-th instance,

μ_{t}

is the mean embedding (centroid) of the t-th instance, and

e_{i}

is the embedding of point

i

.

δ_{s}

is the margin for intra-instance compactness, while

2 δ_{d}

is the margin for inter-instance separability.

‖\cdot‖

represents a measurement of distance, like

L

2.

Finally, the semantic labels are obtained via an argmax operation over predicted class probabilities, while the instance labels are generated by applying mean-shift clustering on the learned embeddings.

3. Experimental Results and Analysis

3.1. Experimental Platform

Every experiment in this study was conducted on a server running Ubuntu 22.04. The server has an NVIDIA GeForce RTX 3090 GPU with 24 GB of VRAM, a 14-core Intel (R) Xeon (R) Gold 6330 CPU (2.00 GHz), and 90 GB of RAM. PyTorch 2.1.0 was the deep learning framework utilized, and the GPU computing platform was CUDA version 12.1, along with Python 3.10. During training, the batch size was fixed at eight, and the initial learning rate was set at 0.002. Every 20 epochs, the learning rate decreased by a factor of 0.7. The network was trained using the Adam optimizer, using a momentum parameter of 0.9. A total of 200 training epochs were established, and the model with the smallest validation loss was kept as the best-performing model.

3.2. Model Evaluating Indicator

In terms of the task of semantic segmentation for point clouds of tomato plants, the model performance is evaluated using four widely used assessment indicators, including Precision, Recall, F1-score, and Intersection over Union (IoU). These metrics are defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

I o U = \frac{T P}{T P + F P + F N}

(12)

where FP stands for the number of false positive points, FN for the number of false negative points, and TP for the number of true positive points for a particular semantic class.

Four assessment metrics are used for the instance segmentation task, both at the instance and point levels: mean precision (mPrec), mean recall (mRec), mean coverage (mCov), and mean weighted coverage (mWCov). Descriptions for these metrics are given below:

m P r e c = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {T P}_{i} |}{| P_{i} |}

(13)

m R e c = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {T P}_{i} |}{| G_{i} |}

(14)

m C o v = \frac{1}{I} \sum_{m = 1}^{I} {m a x}_{n} I o U (G_{m}, P_{n})

(15)

m W C o v = \sum_{m = 1}^{I} w_{m} {m a x}_{n} I o U (G_{m}, P_{n})

(16)

where N denotes the number of semantic classes involved in the calculation of mPrec and mRec. In this study, all points are categorized into two classes: stem and leaf, so

N

is set to 2. The term

| {T P}_{i} |

is used to describe how many anticipated instances in semantic class i have an IoU greater than 0.5 with any ground truth instance.

| P_{i} |

represents the overall count of predicted instances in class i, and

| G_{i} |

indicates the total count of ground truth instances in class i.

G_{m}

denotes the point set of the m-th ground truth instance for a given semantic class, and

P_{n}

denotes the point set of the n-th predicted instance in the same class.

3.3. Semantic Segmentation Results

To intuitively evaluate the semantic segmentation performance of TSINet on tomato plant point clouds, we visualized part of the segmentation results, as shown in Figure 6. The tomato is a dicotyledonous plant with a complex canopy structure. As illustrated, TSINet demonstrates excellent segmentation performance, with most of the leaf and stem structures accurately classified. Only a few points near boundary regions were misclassified, which may be attributed to the inherent ambiguity of such transitional areas.

We further conducted a comparative analysis between TSINet and several state-of-the-art deep learning-based point cloud semantic segmentation models, including PointNet [28], PointNet++ [24], DGCNN [26], ASIS [29], and JSNet [30]. The quantitative results are presented in Table 2.

As the earliest deep learning method for point clouds, PointNet exhibits lower overall performance, especially with poor recall and IoU for the stem category. Subsequent methods such as PointNet++, DGCNN, ASIS, and JSNet show progressively better results, indicating that improvements in network architectures and feature extraction capabilities greatly enhance segmentation performance. TSINet outperformed all baseline methods across all evaluation metrics. Specifically, it achieved 97.00% in precision, 96.17% in recall, 96.57% in F1-score, and 93.43% in mean Intersection-over-Union (IoU), demonstrating its superior capability in extracting accurate semantic features from the complex geometry of tomato plant point clouds. Furthermore, the segmentation metrics for the leaf category are consistently higher than those for the stem category, suggesting that leaf segmentation is easier, while stem segmentation is more challenging due to its fewer points or more complex morphology. Notably, TSINet shows remarkable improvements in the stem category, achieving a recall of 93.59% and an IoU of 90.17%, highlighting its strong ability to handle small or difficult-to-segment structures and its robustness in differentiating complex plant geometries.

3.4. Instance Segmentation Results

Figure 7 shows the instance segmentation results produced by TSINet. The model demonstrates strong capability in accurately separating individual leaf instances. Although a small number of misclassified points remain near the interfaces between stems, leaves, and overlapping regions, the majority of both leaf and stem structures are effectively segmented into distinct instances, which is crucial for fine-grained phenotypic analysis.

To further evaluate the instance segmentation performance of TSINet, we conducted a comparative analysis with two representative dual-task segmentation models, ASIS and JSNet. With a mean precision (mPrec) of 81.54%, mean recall (mRec) of 81.69%, mean coverage (mCov) of 81.60%, and mean weighted coverage (mWCov) of 86.40%, TSINet outperformed all other assessment measures, as indicated in Table 3. Notably, the mRec improved by 13.77% compared to ASIS. These results demonstrate that TSINet is highly effective in handling the complex spatial relationships and overlapping structures present in tomato plant point clouds. Its superior capabilities in feature extraction, instance discrimination, and spatial structure modeling enable accurate and robust instance-level segmentation, providing strong support for subsequent organ-level growth monitoring and structural analysis of plants.

4. Discussion

4.1. Ablation Study

In order to assess how well the main TSINet components work, specifically the Geometry-Aware Adaptive Feature Extraction Block (GAFEB) and the Dual Attention-Based Feature Enhancement Module (DAFEM), we conducted two groups of ablation experiments. In particular, the ablation of DAFEM (Group A) was conducted by directly removing its parallel attention mechanisms, namely the channel attention (CA) and spatial attention (SA). For ablation of the GAFEB (Group B), we replaced the EdgeConv and PAConv operations with standard multi-layer perceptrons (MLPs), while maintaining the overall network architecture to ensure fair comparisons.

Table 4 presents the semantic segmentation performance of each group. The results show that removing DAFEM (Group A) led to a consistent decline in all evaluation metrics compared to the full model (Group C), with a particularly notable drop in the “Stem” category. For example, the IoU for Stem decreased from 90.17% to 87.20%, indicating that the dual attention module plays a crucial role in accurately segmenting plant organs with complex structures and ambiguous boundaries.

Similarly, replacing EdgeConv and PAConv in GAFEB with MLPs (Group B) also resulted in a significant performance degradation. For instance, the mean IoU dropped from 93.43% to 90.16%, and the F1-score decreased from 96.57% to 94.76%. These results suggest that, even when the network architecture remains unchanged, MLPs alone are insufficient for effectively capturing the local geometric features and global spatial structures inherent in point cloud data.

In summary, the complete model (Group C) outperformed both ablated versions across all metrics, demonstrating the complementary and effective roles of GAFEB and DAFEM in feature extraction. Their integration significantly enhances the semantic segmentation performance for both leaf and stem components.

In line with the semantic segmentation results, the instance segmentation performance reported in Table 5 further confirms the effectiveness and complementary nature of GAFEB and DAFEM. The complete model (Group C) achieved the best overall results, with consistently superior performance across all evaluation metrics. These findings underscore the pivotal role of the proposed modules in improving TSINet’s capability for accurate 3D plant instance segmentation.

4.2. Cross-Species Evaluation for Model Generalization

We utilized the Pheno4D dataset to perform cross-species validation in order to assess TSINet’s generalization capability. Specifically, the TSINet model pretrained on tomato plants was tested on maize plants. The maize point clouds in Pheno4D underwent preprocessing procedures similar to those applied to the tomato dataset. These included removing ground noise, limiting the number of points per plant to 4096, and converting the data into HDF5 format for testing. We focused on the instance segmentation task of maize plants, as it poses a greater challenge and better reflects the underlying performance of semantic segmentation.

The cross-species experimental results are presented in Table 6. Although the model was trained solely on tomato plants, TSINet still exhibited strong generalization ability when applied to maize plants in the Pheno4D dataset. In the instance segmentation task, TSINet achieved an mPrecision of 58.82%, mRecall of 60.04%, mCov of 62.34%, and mWCov of 64.48%. In summary, the experimental results demonstrate that TSINet not only achieves excellent segmentation performance for tomato plants, but also maintains strong generalization in cross-species scenarios. This highlights its potential for broader application to phenotypic analysis of various crop species.

4.3. Limitations and Future Works

Despite the promising results achieved in this study, several limitations remain. First, the Pheno4D dataset used consists of tomato seedling point clouds collected under greenhouse conditions, which may limit the model’s generalization to more complex field environments and constrain its segmentation performance to the seedling stage. Second, the current method relies on supervised learning, which requires manually annotated data—a process that is labor-intensive and time-consuming, particularly for large-scale datasets. Additionally, although TSINet performs well in segmenting stems and leaves, it may still encounter challenges when dealing with occlusion, noise, or highly overlapping plant structures.

Future work will focus on extending TSINet to field-grown tomato plants under more complex environmental conditions, including drought, high temperatures, and other stress scenarios, and to plants at later developmental stages. We will also explore the integration of semi-supervised and self-supervised learning strategies to reduce the dependency on large annotated datasets. Furthermore, the incorporation of domain adaptation techniques and the development of more advanced feature aggregation modules are expected to enhance the network’s robustness and generalizability. Lastly, real-time deployment and efficiency optimization will be pursued to enable high-throughput, accurate, and non-destructive measurement of tomato phenotypes, thereby facilitating its practical application in precision agriculture and contributing to the advancement of intelligent agriculture.

5. Conclusions

Accurate organ-level segmentation is essential for phenotypic analysis and growth monitoring of plants. This study focuses on the segmentation of tomato plants using the Pheno4D dataset. We develop TSINet, an end-to-end dual-purpose segmentation network that can segment tomato plant point clouds in both instance and semantic ways. TSINet adopts an encoder–decoder architecture, consisting of a shared encoder and two parallel decoder branches dedicated to the respective segmentation tasks. A Geometry-Aware Adaptive Feature Extraction Block (GAFEB) and a Dual Attention-Based Feature Enhancement Module (DAFEM) are further incorporated to enhance feature representations. To address the limitations of the dataset, we employ extensive data augmentation, achieving a 20-fold increase in data volume, which enhances the model’s generalization capability. According to the experimental results, our model performs exceptionally well, with semantic segmentation metrics reaching an average precision of 97.00%, recall of 96.17%, F1-score of 96.57%, and IoU of 93.43%. For instance segmentation, the model attains a mean precision (mPrec) of 81.54%, mean recall (mRec) of 81.69%, mean coverage (mCov) of 81.60%, and mean weighted coverage (mWCov) of 86.40%, surpassing mainstream point cloud segmentation baselines. These results highlight TSINet as an effective and practical solution for high-throughput, automated phenotyping of tomato plants, contributing to the advancement of intelligent agriculture.

Author Contributions

Conceptualization, L.Z. and X.L.; methodology, S.M.; software, S.M.; validation, S.M.; formal analysis, S.M.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, L.Z. and X.L.; visualization, S.M.; supervision, L.Z. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62202281) and Shandong Province College Student Innovation and Entrepreneurship Training Program (No. S202410434023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Roohanitaziani, R.; De Maagd, R.A.; Lammers, M.; Molthoff, J.; Meijer-Dekens, F.; Van Kaauwen, M.P.W.; Finkers, R.; Tikunov, Y.; Visser, R.G.F.; Bovy, A.G. Exploration of a Resequenced Tomato Core Collection for Phenotypic and Genotypic Variation in Plant Growth and Fruit Quality Traits. Genes 2020, 11, 1278. [Google Scholar] [CrossRef] [PubMed]
Kumar, M.; Tomar, M.; Bhuyan, D.J.; Punia, S.; Grasso, S.; Sá, A.G.A.; Carciofi, B.A.M.; Arrutia, F.; Changan, S.; Radha; et al. Tomato (Solanum lycopersicum L.) Seed: A Review on Bioactives and Biomedical Activities. Biomed. Pharmacother. 2021, 142, 112018. [Google Scholar] [CrossRef] [PubMed]
Roșca, M.; Mihalache, G.; Stoleru, V. Tomato Responses to Salinity Stress: From Morphological Traits to Genetic Changes. Front. Plant Sci. 2023, 14, 1118383. [Google Scholar] [CrossRef] [PubMed]
Murphy, K.M.; Ludwig, E.; Gutierrez, J.; Gehan, M.A. Deep Learning in Image-Based Plant Phenotyping. Annu. Rev. Plant Biol. 2024, 75, 771–795. [Google Scholar] [CrossRef]
Tian, K.; Li, J.; Zeng, J.; Evans, A.; Zhang, L. Segmentation of Tomato Leaf Images Based on Adaptive Clustering Number of K-Means Algorithm. Comput. Electron. Agric. 2019, 165, 104962. [Google Scholar] [CrossRef]
Ivanovska, M.; Struc, V.; Pers, J. TomatoDIFF: On-Plant Tomato Segmentation with Denoising Diffusion Models. In Proceedings of the 18th International Conference on Machine Vision and Applications (MVA), Hamamatsu, Japan, 23–25 July 2023. [Google Scholar]
Niu, Z.; Huang, T.; Xu, C.; Sun, X.; Taha, M.F.; He, Y.; Qiu, Z. A Novel Approach to Optimize Key Limitations of Azure Kinect DK for Efficient and Precise Leaf Area Measurement. Agriculture 2025, 15, 173. [Google Scholar] [CrossRef]
Miao, T.; Zhu, C.; Xu, T.; Yang, T.; Li, N.; Zhou, Y.; Deng, H. Automatic Stem-Leaf Segmentation of Maize Shoots Using Three-Dimensional Point Cloud. Comput. Electron. Agric. 2021, 187, 106310. [Google Scholar] [CrossRef]
Liang, X.; Yu, W.; Qin, L.; Wang, J.; Jia, P.; Liu, Q.; Lei, X.; Yang, M. Stem and Leaf Segmentation and Phenotypic Parameter Extraction of Tomato Seedlings Based on 3D Point. Agronomy 2025, 15, 120. [Google Scholar] [CrossRef]
Xing, Y.; Pham, D.; Williams, H.; Smith, D.; Ahn, H.S.; Lim, J.; MacDonald, B.A.; Nejati, M. Look How They Have Grown: Non-Destructive Leaf Detection and Size Estimation of Tomato Plants for 3D Growth Monitoring. arXiv 2023. [Google Scholar] [CrossRef]
Qiao, G.; Zhang, Z.; Niu, B.; Han, S.; Yang, E. Plant Stem and Leaf Segmentation and Phenotypic Parameter Extraction Using Neural Radiance Fields and Lightweight Point Cloud Segmentation Networks. Front. Plant Sci. 2025, 16, 1491170. [Google Scholar] [CrossRef]
Sun, Y.; Guo, X.; Yang, H. Win-Former: Window-Based Transformer for Maize Plant Point Cloud Semantic Segmentation. Agronomy 2023, 13, 2723. [Google Scholar] [CrossRef]
Zhang, L.; Huang, Z.; Yang, Z.; Yang, B.; Yu, S.; Zhao, S.; Zhang, X.; Li, X.; Yang, H.; Lin, Y.; et al. Tomato Stem and Leaf Segmentation and Phenotype Parameter Extraction Based on Improved Red Billed Blue Magpie Optimization Algorithm. Agriculture 2025, 15, 180. [Google Scholar] [CrossRef]
Yan, J.; Tan, F.; Li, C.; Jin, S.; Zhang, C.; Gao, P.; Xu, W. Stem–Leaf Segmentation and Phenotypic Trait Extraction of Individual Plant Using a Precise and Efficient Point Cloud Segmentation Network. Comput. Electron. Agric. 2024, 220, 108839. [Google Scholar] [CrossRef]
Yang, X.; Miao, T.; Tian, X.; Wang, D.; Zhao, J.; Lin, L.; Zhu, C.; Yang, T.; Xu, T. Maize Stem–Leaf Segmentation Framework Based on Deformable Point Clouds. ISPRS J. Photogramm. Remote Sens. 2024, 211, 49–66. [Google Scholar] [CrossRef]
Hao, H.; Wu, S.; Li, Y.; Wen, W.; Fan, J.; Zhang, Y.; Zhuang, L.; Xu, L.; Li, H.; Guo, X.; et al. Automatic Acquisition, Analysis and Wilting Measurement of Cotton 3D Phenotype Based on Point Cloud. Biosyst. Eng. 2024, 239, 173–189. [Google Scholar] [CrossRef]
Liu, Z.; Zhao, J.; Hu, Y.; Li, R.; Deng, Q.; Guan, R.; Yang, R.; Xu, Z.; Zhou, G. FACNet: A High-Precision Pumpkin Seedling Point Cloud Organ Segmentation Method. Comput. Electron. Agric. 2025, 231, 110049. [Google Scholar] [CrossRef]
Yao, J.; Gong, Y.; Xia, Z.; Nie, P.; Xu, H.; Zhang, H.; Chen, Y.; Li, X.; Li, Z.; Li, Y. Facility of Tomato Plant Organ Segmentation and Phenotypic Trait Extraction via Deep Learning. Comput. Electron. Agric. 2025, 231, 109957. [Google Scholar] [CrossRef]
Song, J.; Ma, B.; Xu, Y.; Yu, G.; Xiong, Y. Organ Segmentation and Phenotypic Information Extraction of Cotton Point Clouds Based on the CotSegNet Network and Machine Learning. Comput. Electron. Agric. 2025, 236, 110466. [Google Scholar] [CrossRef]
Xie, K.; Cui, C.; Jiang, X.; Zhu, J.; Liu, J.; Du, A.; Yang, W.; Song, P.; Zhai, R. Automated 3D Segmentation of Plant Organs via the Plant-MAE: A Self-Supervised Learning Framework. Plant Phenomics 2025, 7, 100049. [Google Scholar] [CrossRef]
Liu, F.-Y.; Geng, H.; Shang, L.-Y.; Si, C.-J.; Shen, S.-Q. A Cotton Organ Segmentation Method with Phenotypic Measurements from a Point Cloud Using a Transformer. Plant Methods 2025, 21, 37. [Google Scholar] [CrossRef]
Dong, S.; Fan, X.; Li, X.; Liang, Y.; Zhang, M.; Yao, W.; Yang, X.; Wang, Z. Automatic 3D Plant Organ Instance Segmentation Method Based on PointNeXt and Quickshift++. Plant Phenomics 2025, 7, 100065. [Google Scholar] [CrossRef]
Schunck, D.; Magistri, F.; Rosu, R.A.; Cornelißen, A.; Chebrolu, N.; Paulus, S.; Léon, J.; Behnke, S.; Stachniss, C.; Kuhlmann, H.; et al. Pheno4D: A Spatio-Temporal Dataset of Maize and Tomato Plant Point Clouds for Phenotyping and Advanced Plant Analysis. PLoS ONE 2021, 16, e0256340. [Google Scholar] [CrossRef] [PubMed]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Comaniciu, D.; Meer, P. Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds; ACM Transactions on Graphics (tog): New York, NY, USA, 2019. [Google Scholar]
Xu, M.; Ding, R.; Zhao, H.; Qi, X. PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, X.; Liu, S.; Shen, X.; Shen, C.; Jia, J. Associatively Segmenting Instances and Semantics in Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhao, L.; Tao, W. JSNet: Joint Instance and Semantic Segmentation of 3D Point Clouds. AAAI 2020, 34, 12951–12958. [Google Scholar] [CrossRef]

Figure 1. Demonstration of some point clouds from Pheno4D.

Figure 2. Examples of raw and annotated tomato plant point clouds from Pheno4D. (a) Raw 3D point cloud of a tomato seedling. (b) Instance-level annotation distinguishing individual leaves using unique colors. (c) Annotated point cloud showing semantic labels (stem in red, leaf in blue).

Figure 3. The architecture of TSINet. It adopts a dual-branch encoder–decoder structure for simultaneous semantic and instance segmentation of tomato plant point clouds. The network consists of a shared encoder, parallel decoders, and a dual attention module integrating spatial and channel attention mechanisms.

N_{0}

>

N_{1}

>

N_{2}

>

N_{3}

>

N_{4}

denote point numbers.

Figure 3. The architecture of TSINet. It adopts a dual-branch encoder–decoder structure for simultaneous semantic and instance segmentation of tomato plant point clouds. The network consists of a shared encoder, parallel decoders, and a dual attention module integrating spatial and channel attention mechanisms.

N_{0}

>

N_{1}

>

N_{2}

>

N_{3}

>

N_{4}

denote point numbers.

Figure 4. Architecture and key operations of the proposed Geometry-Aware Adaptive Feature Extraction Block (GAFEB). (a) Overall structure of GAFEB. (b) Illustration of the EdgeConv operation: a k-nearest neighbor (kNN) graph is constructed for each sampled point, and edge features are computed based on the relative features between the point and its neighbors. (c) Illustration of the PAConv operation: convolution kernels are adaptively weighted based on spatial relationships within each local neighborhood, enhancing the representation of diverse geometric structures.

Figure 5. The architecture of the semantic segmentation decoder branch, including skip connections between encoder and decoder layers for multi-scale feature fusion and spatial detail recovery.

Figure 6. Visualization of semantic segmentation results by TSINet. In the color legend, red denotes leaf regions, and green denotes stem regions. Sem.GT refers to the point clouds of plants annotated with semantic ground truth labels.

Figure 7. Visualization of instance segmentation results by TSINet. The colors of the instance labels are assigned randomly, with each color representing a distinct plant organ instance. Ins.GT refers to the point clouds of plants annotated with instance ground truth labels.

Table 1. Point cloud distribution before and after data augmentation.

	Overall	Training Set	Test Set
Initial dataset	77	61	15
Augmented Dataset	1540	1220	320

Table 2. Performance comparison of different semantic segmentation models.

Index	Part	PointNet [28]	PointNet++ [24]	DGCNN [26]	ASIS [29]	JSNet [30]	TSINet (Ours)
Precision (%)	Leaf	95.35	97.16	97.35	97.27	97.34	97.89
	Stem	96.10	94.99	95.53	95.19	95.27	96.10
	Mean	95.72	96.07	96.44	96.23	96.30	97.00
Recall (%)	Leaf	98.85	98.40	98.57	98.47	98.49	98.74
	Stem	85.46	91.32	91.91	91.67	91.88	93.59
	Mean	92.15	94.86	95.24	95.07	95.18	96.17
F1-score (%)	Leaf	97.07	97.78	97.96	97.87	97.91	98.32
	Stem	90.47	93.12	93.69	93.40	93.54	94.83
	Mean	93.77	95.45	95.82	95.63	95.73	96.57
IoU (%)	Leaf	94.30	95.65	96.00	95.82	95.90	96.69
	Stem	82.59	87.12	88.13	87.62	87.87	90.17
	Mean	88.45	91.39	92.06	91.72	91.89	93.43

Table 3. Performance comparison of different instance segmentation models.

Methods	mPrec (%)	mRec (%)	mCov (%)	mWCov (%)
ASIS [29]	72.20	67.92	70.92	74.62
JSNet [30]	74.61	70.04	72.59	79.12
TSINet (ours)	81.54	81.69	81.60	86.40

Table 4. Ablation study of semantic segmentation.

Index	Group	Component		Leaf	Stem	Mean
Index	Group	GAFEB	DAFEM	Leaf	Stem	Mean
Precision (%)	A	√	×	97.15	95.11	96.13
	B	×	√	96.91	93.63	95.27
	C	√	√	97.89	96.10	97.00
Recall (%)	A	√	×	98.44	91.29	94.86
	B	×	√	97.96	90.57	94.26
	C	√	√	98.74	93.59	96.17
F1-score (%)	A	√	×	97.79	93.16	95.48
	B	×	√	97.43	92.08	94.76
	C	√	√	98.32	94.83	96.57
IoU (%)	A	√	×	95.68	87.20	91.44
	B	×	√	94.99	85.32	90.16
	C	√	√	96.69	90.17	93.43

Note: “√” indicates the component is included; “×” indicates the component is excluded.

Table 5. Ablation study of instance segmentation.

Index	Group	Component		Mean
Index	Group	GAFEB	DAFEM	Mean
mPrec (%)	A	√	×	80.14
	B	×	√	79.86
	C	√	√	81.54
mRec (%)	A	√	×	78.43
	B	×	√	78.10
	C	√	√	81.69
mCov (%)	A	√	×	81.82
	B	×	√	79.82
	C	√	√	81.60
mWCov (%)	A	√	×	83.97
	B	×	√	82.14
	C	√	√	86.40

Note: “√” indicates the component is included; “×” indicates the component is excluded.

Table 6. Cross-species instance segmentation performance: training on tomato, testing on maize.

Methods	mPrec (%)	mRec (%)	mCov (%)	mWCov (%)
TSINet (Testing on Maize)	58.82	60.04	62.34	64.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, S.; Lu, X.; Zhang, L. TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds. Appl. Sci. 2025, 15, 8406. https://doi.org/10.3390/app15158406

AMA Style

Ma S, Lu X, Zhang L. TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds. Applied Sciences. 2025; 15(15):8406. https://doi.org/10.3390/app15158406

Chicago/Turabian Style

Ma, Shanshan, Xu Lu, and Liang Zhang. 2025. "TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds" Applied Sciences 15, no. 15: 8406. https://doi.org/10.3390/app15158406

APA Style

Ma, S., Lu, X., & Zhang, L. (2025). TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds. Applied Sciences, 15(15), 8406. https://doi.org/10.3390/app15158406

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TSINet: A Semantic and Instance Segmentation Network for 3D Tomato Plant Point Clouds

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Data Preprocessing

2.3. Network Architecture

2.3.1. Geometry-Aware Adaptive Feature Extraction Block (GAFEB)

2.3.2. Dual-Branch Feature Decoder

2.3.3. Dual Attention-Based Feature Enhancement Module (DAFEM)

2.3.4. Loss Function

3. Experimental Results and Analysis

3.1. Experimental Platform

3.2. Model Evaluating Indicator

3.3. Semantic Segmentation Results

3.4. Instance Segmentation Results

4. Discussion

4.1. Ablation Study

4.2. Cross-Species Evaluation for Model Generalization

4.3. Limitations and Future Works

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI