GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization

Meng, Zhichao; Du, Shan; Wang, Bo; Pan, Jun; Hu, Dong; Du, Xiaoqiang; Yang, Qinghua

doi:10.3390/agriculture16121322

Open AccessArticle

GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization

by

Zhichao Meng

¹,

Shan Du

²,

Bo Wang

¹,

Jun Pan

^1,2

,

Dong Hu

¹

,

Xiaoqiang Du

^1,3,* and

Qinghua Yang

^1,*

¹

College of Optical, Mechanical and Electrical Engineering, Zhejiang A&F University, Hangzhou 311300, China

²

Innuovo Technology Co., Ltd., Jinhua 322118, China

³

Zhejiang Key Laboratory of Intelligent Sensing and Robotics for Agriculture, Hangzhou 310018, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(12), 1322; https://doi.org/10.3390/agriculture16121322 (registering DOI)

Submission received: 17 April 2026 / Revised: 25 May 2026 / Accepted: 11 June 2026 / Published: 15 June 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Tomato harvesting in protected horticulture is a critical task that faces challenges due to labor shortages, high environmental stress in greenhouses, and the complex nature of clustered fruit arrangements. This study proposes the global–local sequence ranking network (GL-SeqNet), a novel model designed to optimize intra-cluster tomato harvesting sequences by integrating global and local features using a deep learning approach. GL-SeqNet fuses the global structural information of tomato clusters with local fruit attributes to dynamically update the harvesting sequence, ensuring the optimal target is selected at each step. The model features a dual-stream architecture, comprising separate global and local backbones, and utilizes a ranking head for prioritizing intra-cluster targets. An AI-based image object-removal tool was used to simulate the dynamic structural changes within a cluster during harvesting, facilitating the creation of a state evolution dataset. The experimental results showed that the best overall performance was achieved with a global resolution of 112 × 112 and a local resolution of 56 × 56, yielding a Top-1 accuracy of 0.950, a position match rate (PMR) of 0.970, and an inference time of only 22.6 ms, along with faster convergence. The results underscore the potential of global and local fusion strategies and ranking-based learning for effective harvesting sequence optimization.

Keywords:

tomato harvesting sequence optimization; global and local feature fusion; global-local sequence ranking network (GL-SeqNet); deep learning; AI-based image object-removal

1. Introduction

Tomatoes are a primary crop in protected horticulture, serving both the fresh market and the processing industry, and are widely cultivated worldwide [1,2,3]. However, the continuous decline in labor supply, coupled with accelerated population aging, has steadily increased the cost of manual harvesting [4]. In addition, the high temperature and humidity typical of greenhouse environments markedly intensify the difficulty of harvesting. Under the dual pressures of rising costs and challenging environmental conditions, harvesting practices are rapidly shifting from manual to mechanized and intelligent modes. Meanwhile, the large-scale and structured development of protected horticulture has provided a solid foundation for the transition of harvesting robots from research to practical application [5].

Mechanized harvesting not only helps alleviate labor shortages and ensure operational efficiency and consistent quality but also provides conditions for data-driven and standardized production management [6]. However, tomatoes are mostly distributed in clusters, where severe occlusion among fruits and limited operating space pose significant challenges for grasping and path planning [7,8]. One of the primary challenges lies in selecting intra-cluster harvesting targets and optimizing harvesting sequences. If the harvesting sequence is improperly designed, the manipulator is easily disturbed by neighboring fruits, leading to failure. Therefore, it is urgent to establish a method for stepwise optimal target selection and dynamic sequence updating, enabling the system to determine the best target at each harvesting step.

In recent years, convolutional neural networks (CNNs) have significantly enhanced the visual capabilities, covering tasks such as detection [9,10,11,12,13], pose estimation [14,15,16,17,18], and keypoint localization [19,20,21,22], especially in the field of harvesting robots. Zhao et al. [23] augmented YOLOv7 with a keypoint head and a mask head, and proposed the YOLOv7-hv model for cucumber pose keypoint detection and instance segmentation. Experimental results showed a keypoint detection OKS of 0.882, a segmentation mIoU of 93.8%, and an overall network speed of 43 FPS. Liu et al. [24] proposed SIAB and the Simpleformer, which were integrated with CNNs into CS-Net, achieving higher segmentation accuracy and faster inference than SOTA methods. Zhao et al. [25] proposed a lightweight end-to-end model, YOLO-GP, using Ghost Bottleneck and adding keypoint prediction to achieve joint detection of grape bunches and picking points. Experiments on the Grape-PP dataset showed 93.27% mAP for cluster detection and a picking-point distance error of less than 40 pixels. These studies demonstrate that CNNs are highly effective in extracting features from fruits and vegetables. However, their applications remain largely limited to the perception level, such as object pose estimation. They still lack sequence decision-making mechanisms specifically designed for harvesting tasks.

Accordingly, harvesting sequence optimization has been studied in two main directions: path or time-optimal planning, which focuses on global travel efficiency [26,27], and intra-cluster sequence planning, which focuses on local geometric constraints. For example, Lin et al. [28] employed YOLOX-S for tea bud detection and modeled the harvesting sequence as a traveling salesman problem (TSP), which was solved using a modified pointer network. Experiments demonstrated that the algorithm achieved a runtime of 1.69 ms for sequence planning of up to 100 tea buds, satisfying real-time requirements. Wang Xiaorong et al. [29] proposed a safflower filament harvesting sequence planning method based on 3D TSP and Ptr-Nets-AC, which demonstrated higher training efficiency on multi-scale datasets and shorter, faster paths during testing. Dai et al. [30] optimized tomato cluster harvesting sequences by combining density clustering with the shortest motion path, and selected the tomato closest to the end-effector within the cluster as the harvesting target. Greenhouse experiments achieved a collision-free success rate of 70.9% with an average operation cycle of 8.6 s. Although a considerable body of research has focused on path or time-optimal models, these approaches are not directly applicable to intra-cluster harvesting sequence optimization of clustered fruits. Moreover, existing intra-cluster sequence planning methods for clustered fruits often rely on heuristic or empirically defined rules, which suffer from limited robustness and adaptability.

This study focuses on optimizing the intra-cluster harvesting sequence for clustered fruits, emphasizing the integration of both cluster-level global structural information and fruit-specific local features in the decision-making process. In recent years, multimodal fusion methods based on convolutional neural networks have provided a feasible approach for unified modeling of such information. Huang et al. [31] proposed a multimodal fusion network, CornMFN, which employed ConvNeXt to extract image features and an MLP to represent meteorological variables, while cross-attention was used to fuse heterogeneous information. This approach achieved an accuracy of 91.19% for maize phenology classification. Liu et al. [32] developed a multimodal instance segmentation network, YOLACTFusion, which used dual ResNet-50 backbones to extract RGB and NIR features, and integrated them through parallel attention for multi-scale cross-modal fusion. On the tomato stem segmentation task, YOLACTFusion improved the mAP by 7.09% compared with YOLACT.

Based on the above, this paper proposes the GL-SeqNet model, a novel method for optimizing tomato harvesting sequences within clusters. The model leverages dual-channel features of cluster structure and individual fruit geometry, and employs deep learning to achieve stepwise optimal target selection with dynamic sequence updating after each grasp. After each simulated harvesting action, the remaining cluster structure is updated dynamically, allowing the model to continuously adjust the subsequent picking order. The main contributions of this study are as follows:

(1): A novel GL-SeqNet model is proposed that integrates feature information from the entire tomato cluster and individual tomatoes to realize stepwise optimal target selection.
(2): An AI-based image object-removal tool is introduced to simulate intra-cluster structural changes in tomatoes after each harvesting action, enabling the construction of a sequence state evolution dataset that better aligns training with real-world conditions.
(3): A systematic comparison is conducted among RankNet, ListNet, and mean squared error (MSE) loss functions, as well as different input resolutions, to analyze their effects on ranking quality and system real-time performance.

2. Materials and Methods

2.1. Image Acquisition

Data collected under different regions and greenhouse conditions have a significant impact on the generalization performance of the model. To improve its applicability and robustness, sampling was conducted at two locations: the Yangdu Base in Jiaxing, Zhejiang Province (collected in 2022), and the Wuwangnong Farm in Hangzhou, Zhejiang Province (collected in 2024). The data acquisition devices included a Redmi K30 smartphone (Xiaomi Corporation, Beijing, China) and an Intel RealSense L515 camera (Intel Corporation, Santa Clara, CA, USA), with the original image resolutions being 1080 × 1920 and 1280 × 720, respectively. All images were stored in JPEG (Joint Photographic Experts Group) format. Figure 1a,b show representative tomato plant samples collected from the two sites.

2.2. Dataset Construction

In this study, a tomato cluster is defined as a group of adjacent fruits within an image. The dataset construction process is illustrated in Figure 2. First, the original images were manually segmented from the tomato clusters to obtain cluster images with complete geometric relationships preserved, while samples with severe blur, overexposure, or unclear boundaries were removed. Subsequently, the mature fruits within each cluster were also manually segmented. The annotations were specifically handled by a researcher with extensive practical experience in tomato-harvesting robots. Since the model input consists of one global image and a set of local images, the number of local images must remain consistent. In contrast, the number of mature fruits within a cluster varies. To address this, a black-image padding strategy was adopted, unifying the number of local images to five by adding pure black images. The choice of five images was based on the observation that the number of mature fruits in a single cluster rarely exceeded five. The annotation strategy was designed to simulate human decision-making preferences and to serve as the learning objective of the model. Sequence labels started from 1, while the padded black images were uniformly labeled as 6. To further approximate the state evolution in real harvesting processes, an AI-based image object-removal tool (FliFlik KleanOut for Photo v6.2.0) was used to remove the selected optimal harvesting target and fill the masked region with background content, thereby generating post-harvesting images. In this new state, the remaining mature fruits were re-annotated for sequence labeling, and this process was iterated until all mature fruits within the cluster were processed. This ensured that the training of GL-SeqNet better matched real harvesting scenarios.

A total of 200 tomato clusters were obtained and divided into training and validation sets with a ratio of 4:1, yielding 160 clusters for training and 40 clusters for validation.

2.3. GL-SeqNet Model

Currently, numerous algorithms have been developed for object classification [33,34], detection [35,36], segmentation [37,38], and keypoint regression [39,40], making significant contributions to visual perception. However, visual perception alone is insufficient for accurate decision-making in mechanized tomato harvesting. The lack of decision modeling often causes repeated obstructions of the manipulator in clustered fruit scenarios, thereby reducing harvesting success rates. Existing deep learning models remain limited in performing optimal target selection and dynamic sequence updating. To address these challenges, this study proposes GL-SeqNet, which enables intra-cluster tomato sequential harvesting planning through the fusion of global and local information and learning-to-rank.

2.3.1. GL-SeqNet Model Architecture

In selecting the optimal harvesting target within a tomato cluster, both global and local information are indispensable. Global information characterizes cluster topology and the relationships with the main-stem and fruit pedicels, thereby providing macroscopic guidance for harvesting sequence decisions. In contrast, local information captures fruit attributes such as maturity, surface appearance, occlusion intensity, and potential collision risks, offering cues for target prioritization. Based on this, a novel GL-SeqNet model is proposed, with its architecture illustrated in Figure 3. The model comprises four main components: a global backbone, a local backbone, a global and local feature fusion module, and a ranking head. The design concept is as follows: the global backbone captures cluster structural information, while the local backbone extracts detailed fruit features. These are integrated through cross-fusion to achieve complementary information, and the ranking head subsequently outputs harvesting priorities, thereby generating a rational intra-cluster harvesting sequence.

(1): Input and dual-stream encoding: The cluster-cropped image is used as the input to the global branch, while five mature fruit local images are provided as inputs to the local branch. Features are extracted by lightweight modules, the global backbone and the local backbone, producing a global feature $g_f e a t \in R^{B \times D}$ and a local feature $l_f e a t \in R^{B \times N \times D}$ . The activations use Leaky ReLU with a dropout rate of 0.2. Both the global backbone and the local backbone extract features through four cascaded Residual Blocks. Each Residual Block comprises two 3 × 3 convolutional layers (Conv), each followed by batch normalization (BN) and a Leaky ReLU activation, with skip connections configured according to the input–output dimensions and stride. The channel width is progressively expanded to 16, 32, and 64, ensuring sufficient representational capacity while alleviating gradient vanishing through residual connections.
(2): Global and local feature fusion: After feature extraction by the global and local branches, feature concatenation (cat) is applied for fusion. Specifically, the global feature vector $g_f e a t$ is broadcast along the candidate dimension to [B, N, D], and concatenated with each local feature $l_f e a t$ along the channel dimension, resulting in a fused representation of [B, N, 2D].
(3): Ranking head: Following the above extraction and fusion, a fully connected layer outputs a priority score $s c o r e s \in R^{B \times N}$ for each candidate fruit, where a larger score indicates a higher harvesting priority. During training, the scores can be optimized with RankNet [41], ListNet [42,43], or MSE [44] losses. During inference, the scores are sorted in descending order to generate the intra-cluster harvesting sequence.

2.3.2. Model Loss Function

The loss function measures the deviation between the model output and the supervisory target, and serves as the driving signal for backpropagation in training. In this study, three types of loss functions are employed: RankNet, ListNet, and MSE. Among them, RankNet loss is a pairwise ranking loss based on neural networks, which optimizes the model by minimizing the relative ranking errors of sample pairs with precedence relationships. Specifically, RankNet maps the score difference in candidate pairs into a logistic surrogate loss, thereby approximating discrete ranking criteria in a continuous space. The pairwise labels are defined as follows:

y_{i j} = s i g n (g_{i}^{*} - g_{j}^{*}) \in {- 1, 0, + 1}, i < j

(1)

where

i

and

j

represent the indices of two different candidates within the same cluster;

y_{i j}

represents

(i, j)

paired labels; +1 represents

i

precedes

j

; −1 represents

j

precedes

i

; 0 represents a tie (in which case the pair is skipped); and

g_{i}^{*}

represents the gain of the i-th candidate obtained by mapping the annotation rank.

Based on these definitions, the RankNet loss is given as follows:

\{\begin{cases} P = {(i, j) | 1 \leq i < j \leq L, g_{i} \neq g_{j}} \\ L o s s_{R a n k N e t} = \frac{1}{| P |} \sum_{(i, j) \in P} \log ​ (1 + \exp [- y_{i j} (s_{i} - s_{j})]) \end{cases}

(2)

where

P

represents the set of effective pairs;

L

represents the number of candidates within the cluster;

s_{i}

represents the predicted score of the i-th candidate.

ListNet is a listwise ranking loss function. Its underlying idea is to normalize the candidate score vector into a probability distribution and minimize the cross-entropy between the target distribution and the predicted distribution, thereby approximating the correct order at the list level.

The ListNet loss formula is given as follows:

\{\begin{cases} q_{i} = \frac{\exp (g_{i}^{*})}{\sum_{k = 1}^{L} \exp (g_{k}^{*})}, i = 1, \dots, L \\ {\hat{p}}_{i} = \frac{\exp (s_{i})}{\sum_{k = 1}^{L} \exp (s_{k})}, i = 1, \dots, L \\ L o s s_{ListNet} = - \sum_{i = 1}^{L} q_{i} \log {\hat{p}}_{i} \end{cases}

(3)

where

i

represents the index of the i-th candidate within the same cluster;

L

represents the number of candidates within the cluster;

s_{i}

represents the predicted score of the i-th candidate;

g_{i}^{*}

represents the gain of the i-th candidate obtained by mapping the annotation rank;

q_{i}

represents the target distribution obtained by applying softmax to the gain vector;

{\hat{p}}_{i}

represents the predicted distribution obtained by applying softmax to the score vector.

MSE follows a pointwise supervision paradigm, in which the model output scores are directly aligned with the annotated gains. The learning objective is to minimize the element squared error in order to capture the overall scale. The formulation is given as follows:

L o s s_{M S E} = \frac{1}{L} \sum_{i = 1}^{L} {(s_{i} - g_{i}^{*})}^{2}

(4)

where

i

represents the index of the i-th candidate within the same cluster;

L

represents the number of candidates within the cluster;

s_{i}

represents the predicted score of the i-th candidate;

g_{i}^{*}

represents the gain of the i-th candidate obtained by mapping the annotation rank.

RankNet minimizes relative order errors at the pairwise level and directly approximates the optimal ranking objective. ListNet aligns the target and predicted probability distributions at the listwise level, providing a closer approximation of overall ranking quality. MSE performs pointwise regression from scores to gains, which offers stability and ease of training but imposes weaker direct constraints on relative order.

2.4. Evaluation Metrics

Tomato harvesting robots operate on a cluster-by-cluster basis, where the mature fruits within each cluster must be prioritized to identify the optimal harvesting target. In this study, the accuracy of the optimal harvesting target is regarded as the primary evaluation metric, while the correctness of the entire sequence is considered a secondary metric. Top-1 accuracy [45] is used to assess the accuracy of the optimal harvesting target, and Position Match Rate (PMR) is employed to evaluate the overall ranking quality.

Top-1 accuracy is a widely used metric in both classification and ranking tasks. In this study, the Top-1 accuracy of the optimal harvesting target is defined as follows:

\{\begin{cases} {\hat{y}}_{n} = \arg \max_{i} s_{n, i} \\ Top-1 = \frac{1}{N} \sum_{n = 1}^{N} 1 ​ ({\hat{y}}_{n} = y_{n}) \end{cases}

(5)

where

y_{n}

represents the index of the annotated optimal harvesting target within the n-th cluster;

{\hat{y}}_{n}

represents the index of the predicted optimal harvesting target for the n-th cluster; n represents the cluster index;

N

represents the total number of clusters in the validation set;

1 (\cdot)

represents the indicator function, which equals 1 if the condition is true and 0 otherwise;

Top-1

represents the accuracy of the optimal harvesting target.

PMR was first used as an evaluation metric in learning-to-rank and sequence prediction tasks. This metric measures the consistency ratio between the predicted sequence and the ground-truth sequence at each position. A larger PMR indicates a higher correctness of the predicted sequence. Its formula is defined as follows:

\{\begin{cases} m (y_{i}^{j}, {\hat{y}}_{i}^{j}) = \{\begin{array}{l} 1, & y_{i}^{j} = {\hat{y}}_{i}^{j} \\ 0, & y_{i}^{j} \neq {\hat{y}}_{i}^{j} \end{array} \\ P M R = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{L} \sum_{j = 1}^{L} m (y_{i}^{j}, {\hat{y}}_{i}^{j}) \end{cases}

(6)

where

N

represents the total number of clusters in the validation set;

L

represents the number of tomato local images within each cluster;

y_{i}^{j}

represents the element at position j of the ground-truth sequence for the i-th sample;

{\hat{y}}_{i}^{j}

represents the element at position j of the predicted sequence for the i-th sample.

2.5. Experimental Setups

To evaluate the performance of GL-SeqNet, experiments were conducted on the dataset. All experiments were conducted on a computer running Ubuntu 20.04 LTS, equipped with an NVIDIA GeForce RTX 2060 SUPER GPU, an Intel Core i5-12490F CPU, and 16 GB of RAM. The program was implemented in Python 3.10, using PyTorch 1.12.1, CUDA 11.3, Torchvision 0.13.1, NumPy 1.23.0, and OpenCV 4.12.

To ensure reproducibility, the random seed was fixed at 1024, and the random number generators of Python, NumPy, and PyTorch were controlled. The cluster was regarded as the minimal unit of data samples. The batch size was set to 1, with each batch consisting of one global image and five corresponding local images. Training was performed for 175 epochs. Except for the compared factors, all other hyperparameters were kept constant. Experiments were conducted to compare the effects of different input resolutions and different loss functions.

3. Results

3.1. Experiments on Different Resolutions and Loss Functions

This study first compared the effects of different input resolutions under various loss functions on the constructed dataset using the GL-SeqNet model described in Section 2.3, and the results are summarized in Table 1. From the perspective of resolution, when RankNet loss was applied, reducing the resolution from 224 × 224 to 56 × 56 improved the Top-1 accuracy from 0.925 to 0.950 and the PMR from 0.950 to 0.970, while the inference time decreased from 22.8 ms to 22.3 ms, demonstrating a clear performance gain. In contrast, when ListNet loss was employed, changes in resolution had little effect on the evaluation metrics. Under the MSE loss, reducing the resolution from 224 × 224 to 56 × 56 increased the Top-1 accuracy from 0.850 to 0.900 and the PMR from 0.920 to 0.945, while the inference time decreased from 23.0 ms to 22.3 ms. Overall, with the exception of ListNet, all models achieved their best performance at the resolution of 56 × 56 under the same loss function.

From the perspective of loss functions, both RankNet and ListNet achieved high and stable performance, with Top-1 accuracy and PMR remaining in the ranges of 0.925 to 0.950 and 0.950 to 0.970, respectively. This indicates that learning-to-rank loss functions are more suitable for the task of optimizing intra-cluster harvesting sequences. However, the overall performance of the MSE loss was significantly lower, particularly at a resolution of 224 × 224, with a Top-1 accuracy of 0.850 and a PMR of 0.920. This result indicates that the traditional point-to-point regression loss is difficult to effectively model the relative priority relationship between fruits within a tomato cluster, and therefore has limitations in the sorting task.

Overall, when the global and local resolutions were kept the same, GL-SeqNet achieved its best comprehensive performance at a resolution of 56 × 56 with the RankNet loss function, reaching a Top-1 accuracy of 0.950, a PMR of 0.970, and an inference time of only 22.3 ms. These results confirm that the proposed global and local fusion ranking approach maintains high accuracy even under low-resolution inputs, and the ranking-based loss functions more effectively drive the model to capture intra-cluster fruit harvesting priorities. Therefore, the proposed method not only demonstrates superior accuracy but also ensures efficiency, highlighting its potential for deployment in practical greenhouse harvesting scenarios.

As shown in Figure 4, the loss curves of the nine experimental configurations during training are presented. Overall, the training losses of all nine configurations exhibit a stable downward trend, indicating that the models were well fitted without obvious underfitting or overfitting.

As shown in Figure 5, the performance curves of the nine experimental configurations during training are presented. Overall, for the same loss function, different resolutions exhibited similar trends in performance metrics. However, the best results were achieved when both global and local images were set to 56 × 56 with the RankNet loss function.

When the RankNet loss function was applied, the ranking visualization results at resolutions of 224 × 224, 112 × 112, and 56 × 56 are shown in Figure 6a, Figure 6b, and Figure 6c, respectively. In each subfigure, the left panel displays the global image and the right panel displays the local images. The red label “P” indicates the predicted harvesting sequence, while the green label “G” denotes the ground-truth sequence. As illustrated, the model fully predicted the correct harvesting sequence at a resolution of 56 × 56. In contrast, at resolutions of 224 × 224 and 112 × 112, the model incorrectly predicted the first and second harvesting targets, failing to effectively capture the subtle occlusion relationship between them. This visualization further validated the results reported in Table 1.

At a resolution of 56 × 56, the ranking visualization results using RankNet, ListNet, and MSE loss functions are shown in Figure 7a, Figure 7b, and Figure 7c, respectively. The model correctly predicted the entire harvesting sequence with RankNet loss, whereas with ListNet and MSE losses, it incorrectly predicted the second and third harvesting targets, failing to effectively capture the influence of green unripe tomatoes on harvesting priority. This visualization further validated the results reported in Table 1.

3.2. Comparison of Different Global and Local Resolution Combinations

Given that the resolution of the original global image is usually higher than that of the local images, preliminary experiments have demonstrated that the best performance was achieved when both global and local images were set to 56 × 56. To further evaluate the effect of global image resolution under the condition that local images were fixed at 56 × 56, the global image resolution was set to 112 × 112 and 224 × 224 for comparative experiments, with the results summarized in Table 2. The results showed that increasing the global resolution from 56 × 56 to 224 × 224 or 112 × 112 had little impact on the ranking performance, with Top-1 accuracy and PMR remaining unchanged. However, the computational cost increased substantially with higher global resolutions. Specifically, FLOPs increased from 0.734 G at 56 × 56 to 1.100 G at 112 × 112 and further to 2.567 G at 224 × 224, representing increases of 1.499 times and 3.498 times, respectively.

Figure 8 shows the evolution of the validation set performance metrics during training. As shown, the performance curves under the three global resolutions exhibit highly consistent overall trends. However, at a resolution of 112 × 112, the model reached its optimal performance more quickly and maintained higher stability. Therefore, the combination of a global resolution of 112 × 112 and a local resolution of 56 × 56 can be considered the most effective setting.

4. Discussion

In this study, a novel GL-SeqNet model was proposed for optimizing the intra-cluster tomato harvesting sequence, enabling the effective prediction of optimal harvesting targets. Systematic comparative experiments were conducted to evaluate its performance under different resolutions and loss functions.

Regarding the impact of resolution on performance, the results showed that when the global and local image resolutions were identical, the model achieved the best balance between accuracy and efficiency at a resolution of 56 × 56. In particular, under RankNet and MSE losses, reducing the resolution from 224 × 224 to 56 × 56 significantly improved the Top-1 accuracy and PMR, while further shortening the inference time. This indicates that excessively high resolutions do not provide additional information gain for intra-cluster fruit ranking tasks; instead, they may introduce redundant details and noise, thereby interfering with the model’s ability to learn key topological relationships. By contrast, low-resolution inputs preserve the essential geometric and relative positional information while improving generalization ability and computational efficiency, thereby confirming the accuracy of the global and local fusion framework under lightweight input conditions.

In terms of loss function selection, RankNet and ListNet consistently maintained high Top-1 accuracy and PMR, clearly outperforming MSE. Ranking-based losses are better suited to modeling the relative priority relationships among fruits within a cluster, which aligns more closely with the logic of harvesting decisions. In contrast, MSE, as a pointwise regression loss, only measures numerical deviations between predictions and labels without modeling relative order relationships, leading to inherent limitations in performance.

With the local resolution fixed at 56 × 56, further comparisons were conducted using different global image resolutions. The results showed that increasing the global resolution from 56 × 56 to 112 × 112 or 224 × 224 did not yield significant differences in Top-1 accuracy or PMR, which remained stable at 0.950 and 0.970, respectively. The only observed differences are in inference time and fitting speed, where a 112 × 112 resolution achieved faster convergence, and a 56 × 56 resolution provided slightly faster speed. Overall, the combination of a global resolution of 112 × 112 with a local resolution of 56 × 56 provided a more balanced trade-off in terms of accuracy, convergence, and stability, making it the most favorable input configuration.

Overall, GL-SeqNet exhibited high accuracy and efficiency, underscoring its potential for deployment in greenhouse harvesting robots. Nevertheless, certain limitations remain.

(1): This study focused on static image-based ranking. It did not systematically assess robustness under dynamic conditions such as wind disturbances and foliage motion, which are common in practical greenhouse environments.
(2): Although GL-SeqNet effectively models intra-cluster harvesting sequence optimization, it does not incorporate joint optimization with related tasks such as manipulator grasp pose estimation.
(3): The current dataset scale remains relatively limited and was collected under specific greenhouse conditions. In addition, part of the dataset was constructed using AI-based object removal to simulate post-harvesting cluster states, which may still differ from real harvesting scenarios. Therefore, further validation using multi-greenhouse, multi-device, and real sequential harvesting datasets is still required to comprehensively evaluate the robustness and generalization capability of the proposed method.
(4): The current framework focuses on image-level intra-cluster harvesting sequence prediction and does not yet integrate key robotic components such as grasp pose estimation, motion planning, and collision avoidance. As a result, the proposed method should be regarded as a decision-level ranking module rather than a fully deployed robotic harvesting system. This also limits the direct evaluation of end-to-end robotic performance in real-world harvesting tasks.

Future research will focus on improving dynamic environment adaptability, expanding large-scale multi-scene datasets, incorporating real sequential harvesting image validation, and establishing a more complete robotic harvesting pipeline by integrating sequence prediction with grasp pose estimation and motion planning modules. In addition, multi-task collaborative optimization will be further investigated to enhance the robustness, efficiency, and practical deployment capability of tomato harvesting robots.

5. Conclusions

This study proposed GL-SeqNet, a sequence ranking model that integrates global and local information to address the challenge of intra-cluster tomato harvesting sequence optimization. By introducing stepwise optimal target selection and dynamic sequence updating, together with a sequence state evolution dataset constructed using an AI-based image removal tool, the model can effectively simulate the actual harvesting process. The experimental results demonstrate that:

(1): With respect to resolution, low-resolution inputs can ensure high prediction accuracy while significantly improving inference efficiency.
(2): With respect to loss functions, the ranking-based RankNet and ListNet substantially outperform MSE, better capturing the priority relationships among fruits within a cluster.
(3): The best overall performance was achieved with a global resolution of 112 × 112 and a local resolution of 56 × 56, yielding a Top-1 accuracy of 0.950, a PMR of 0.970, and an inference time of only 22.6 ms, along with faster convergence.

In summary, GL-SeqNet achieved excellent performance and provided valuable decision support for greenhouse harvesting robots.

Author Contributions

Conceptualization, Z.M., methodology, Z.M., X.D. and Q.Y., software, Z.M., investigation, S.D., B.W., J.P., D.H., X.D. and Q.Y., writing—original draft preparation, Z.M., writing—review and editing, Z.M., S.D., B.W., J.P., D.H., X.D. and Q.Y., funding acquisition, Z.M., J.P., D.H., X.D. and Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Zhejiang Provincial Natural Science Foundation (Grant No. LQN26C130002, Grant No. LD24E050006), the Talent Development Project (Grant No. 2025LFR071, Grant No. 2023LFR069, Grant No. 2024LFR054), the Central Guidance for Local Scientific and Technological Development Funding Project (Grant No. ZYYD2025CG21), and China Postdoctoral Science Foundation (Grant No. 2024M752870).

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

Authors Shan Du and Jun Pan were employed by the company Innuovo Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mal, S.; Sarkar, D.; Mandal, B.; Basak, P.; Debnath, S.; Chattopadhyay, A.; Pramanik, K. Improving quality of tomato (Solanum lycopersicum L.) fruits for fresh consumption and processing with optimised boron application. J. Food Compos. Anal. 2025, 140, 107255. [Google Scholar] [CrossRef]
Szabo, K.; Varvara, R.A.; Ciont, C.; Macri, A.M.; Vodnar, D.C. An updated overview on the revalorization of bioactive compounds derived from tomato production and processing by-products. J. Clean. Prod. 2025, 497, 145–151. [Google Scholar] [CrossRef]
Zhang, J.; Xiang, L.; Liu, Y.; Jing, D.; Zhang, L.; Liu, Y.; Li, J. Optimizing irrigation schedules of greenhouse tomato based on a comprehensive evaluation model. Agric. Water Manag. 2024, 295, 108741. [Google Scholar] [CrossRef]
Hidgot, A.; Zeweld, W. Measuring and expounding technical and cost efficiencies of smallholder tomato producers in Northern Ethiopia. Clean. Circ. Bioecon. 2024, 9, 100124. [Google Scholar] [CrossRef]
Nguyen, G.N.; Singh, Z. Recent advances in research and development for vegetable crops under protected cultivation. Front. Plant Sci. 2024, 15, 1459919. [Google Scholar] [CrossRef]
Wang, J.; Shan, C.; Gou, F.; Qian, Z.; Ni, Y.; Liu, Z.; Jin, C. A review of key technologies and intelligent applications in soybean mechanized harvesting: Chinese and international perspectives. Biosyst. Eng. 2025, 50, 79–104. [Google Scholar] [CrossRef]
Rong, J.; Zheng, W.; Qi, Z.; Yuan, T.; Wang, P. RTMFusion: An enhanced dual-stream architecture algorithm fusing RGB and depth features for instance segmentation of tomato organs. Measurement 2025, 239, 115484. [Google Scholar] [CrossRef]
Sun, T.; Zhang, W.; Gao, X.; Zhang, W.; Li, N.; Miao, Z. Efficient occlusion avoidance based on active deep sensing for harvesting robots. Comput. Electron. Agric. 2024, 225, 109360. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Zhang, Y.; Chuah, J.H. An intelligent grading system for mangosteen based on improved convolutional neural network. Knowl. Based Syst. 2025, 309, 112904. [Google Scholar] [CrossRef]
Zhang, F.; Jin, X.; Jiang, J.; Lin, G.; Wang, M.; An, S.; Lyu, Q. Fine-grained recognition of citrus varieties via wavelet channel attention network. Knowl. Based Syst. 2025, 311, 113128. [Google Scholar] [CrossRef]
Bhattarai, U.; Karkee, M. A weakly-supervised approach for flower/fruit counting in apple orchards. Comput. Ind. 2022, 138, 103635. [Google Scholar] [CrossRef]
Ren, Z.; Tang, X.; Ren, G.; Wu, D. Research on improved fast-RCNN target detection algorithm based on Kolmogorov-Arnold network. Appl. Intell. 2026, 56, 63. [Google Scholar] [CrossRef]
Chen, Y.; Xie, X.; Yin, W.; Li, B.A.; Li, F. Structure guided network for human pose estimation. Appl. Intell. 2023, 53, 21012–21026. [Google Scholar] [CrossRef]
Jiang, Z.; An, T.; Tong, Z.; Li, Z.; Du, Y.; Xie, T.; Li, R. MTF-Net: A mediator transformer-based fusion network with MOE for 6D object pose estimation. Knowl. Based Syst. 2025, 330, 114674. [Google Scholar] [CrossRef]
Wu, S.; Wang, B. DRSI-Net: Dual-residual spatial interaction network for multi-person pose estimation. Knowl. Based Syst. 2024, 295, 111836. [Google Scholar] [CrossRef]
Żywanowski, K.; Łysakowski, M.; Nowicki, M.R.; Jacques, J.T.; Tadeja, S.K.; Bohné, T.; Skrzypczyński, P. Vision-based hand pose estimation methods for augmented reality in industry: Crowdsourced evaluation on HoloLens 2. Comput. Ind. 2025, 171, 104328. [Google Scholar] [CrossRef]
Govi, E.; Sapienza, D.; Toscani, S.; Cotti, I.; Franchini, G.; Bertogna, M. Addressing challenges in industrial pick and place: A deep learning-based 6 degrees-of-freedom pose estimation solution. Comput. Ind. 2024, 161, 104130. [Google Scholar] [CrossRef]
Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2023, 24, 727–743. [Google Scholar] [CrossRef]
Zang, Q.; Zhang, J.; Bo, L.; Xiao, Y.; Gao, G.; Zhang, H.; Ren, Y. A fully automatic adjacent key-points localization framework for minimal repeated pattern detection in printed fabric images. Knowl. Based Syst. 2024, 300, 112157. [Google Scholar] [CrossRef]
Wu, J.; Lee, H.J. Optimizing offset-regression by relay point for bottom-up human pose estimation. Appl. Intell. 2023, 53, 30535–30551. [Google Scholar] [CrossRef]
Duan, W.; Wang, F.; Li, H.; Liu, N.; Fu, X. Lameness detection in dairy cows from overhead view: High-precision keypoint localization and multi-feature fusion classification. Front. Vet. Sci. 2025, 12, 1675181. [Google Scholar] [CrossRef]
Zhao, G.; Dong, S.; Wen, J.; Ban, Y.; Zhang, X. Selective fruit harvesting prediction and 6D pose estimation based on YOLOv7 multi-parameter recognition. Comput. Electron. Agric. 2025, 229, 109815. [Google Scholar] [CrossRef]
Liu, L.; Li, G.; Du, Y.; Li, X.; Wu, X.; Qiao, Z.; Wang, T. CS-Net: Conv-SimpleFormer network for agricultural image segmentation. Pattern Recognit. 2024, 147, 110140. [Google Scholar] [CrossRef]
Zhao, R.; Zhu, Y.; Li, Y. An end-to-end lightweight model for grape and picking point simultaneous detection. Biosyst. Eng. 2022, 223, 174–188. [Google Scholar] [CrossRef]
Li, H.; He, Z.; Wang, Y.; Ding, X.; Cui, Y. Research on the mechanized harvesting strategy for clustered kiwi fruits based on deep reinforcement learning. Comput. Electron. Agric. 2025, 237, 110686. [Google Scholar] [CrossRef]
Wee, B.S.; Chin, C.S.; Sharma, A. Survey of mushroom harvesting agricultural robots and systems design. IEEE Trans. AgriFood Electron. 2024, 2, 59–80. [Google Scholar] [CrossRef]
Lin, G.; Xiong, J.; Zhao, R.; Li, X.; Hu, H.; Zhu, L.; Zhang, R. Efficient detection and picking sequence planning of tea buds in a high-density canopy. Comput. Electron. Agric. 2023, 213, 108213. [Google Scholar] [CrossRef]
Wang, X.; Zhou, J.; Xu, Y.; Liu, Z. Research on low-loss and high-efficiency picking sequence planning of safflower filaments based on improved deep reinforcement learning. Comput. Electron. Agric. 2025, 237, 110692. [Google Scholar]
Dai, N.; Fang, J.; Yuan, J.; Liu, X. 3MSP2: Sequential picking planning for multi-fruit congregated tomato harvesting in multi-clusters environment based on multi-views. Comput. Electron. Agric. 2024, 225, 109303. [Google Scholar] [CrossRef]
Huang, Y.; Lyu, B.; Gao, T.; Wu, X.; Duan, Y. CornMFN: A multimodal fusion network for corn phenology stage identification. Smart Agric. Technol. 2025, 12, 101202. [Google Scholar] [CrossRef]
Liu, C.; Feng, Q.; Sun, Y.; Li, Y.; Ru, M.; Xu, L. YOLACTFusion: An instance segmentation method for RGB-NIR multimodal image fusion based on an attention mechanism. Comput. Electron. Agric. 2023, 213, 108186. [Google Scholar] [CrossRef]
Jiang, T.; Li, Y.; Li, Y.; Xing, W.; Yu, M.; Xie, F.; Ta, D. A segmentation knowledge-based global-local attention network for tumor classification in breast ultrasound images. Pattern Recognit. 2025, 171, 112152. [Google Scholar] [CrossRef]
Restrepo-Arias, J.F.; Branch-Bedoya, J.W.; Awad, G. Image classification on smart agriculture platforms: Systematic literature review. Artif. Intell. Agric. 2024, 13, 1–17. [Google Scholar] [CrossRef]
Chin, R.; Catal, C.; Kassahun, A. Plant disease detection using drones in precision agriculture. Precis. Agric. 2023, 24, 1663–1682. [Google Scholar] [CrossRef]
Wan, P.; Toudeshki, A.; Tan, H.; Ehsani, R. A methodology for fresh tomato maturity detection using computer vision. Comput. Electron. Agric. 2018, 146, 43–50. [Google Scholar] [CrossRef]
Lei, L.; Yang, Q.; Yang, L.; Shen, T.; Wang, R.; Fu, C. Deep learning implementation of image segmentation in agricultural applications: A comprehensive review. Artif. Intell. Rev. 2024, 57, 1. [Google Scholar] [CrossRef]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Liu, F.; Liu, H.; Wu, Q.; Han, Z.; Pang, S.; Wang, S.; Zhao, L. Pod-Pose: An efficient top-down keypoint detection model for fine-grained pod phenotyping in mature soybean. Plant Methods 2025, 21, 82. [Google Scholar] [CrossRef]
Zhang, F.; Gao, J.; Song, C.; Zhou, H.; Zou, K.; Xie, J.; Zhang, J. TPMv2: An end-to-end tomato pose method based on 3D keypoints detection. Comput. Electron. Agric. 2023, 210, 107878. [Google Scholar] [CrossRef]
Burges, C.J. From RankNet to LambdaRank to LambdaMART: An overview. Learning 2010, 11, 81. [Google Scholar]
Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; Li, H. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 129–136. [Google Scholar]
Buyl, M.; Missault, P.; Sondag, P.A. RankFormer: Listwise learning-to-rank using listwide labels. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3762–3773. [Google Scholar]
Hasan, M.; Marjan, M.A.; Uddin, M.P.; Afjal, M.I.; Kardy, S.; Ma, S.; Nam, Y. Ensemble machine learning-based recommendation system for effective prediction of suitable agricultural crop cultivation. Front. Plant Sci. 2023, 14, 1234555. [Google Scholar] [CrossRef] [PubMed]
Bera, A.; Krejcar, O.; Bhattacharjee, D. Rafa-Net: Region attention network for food items and agricultural stress recognition. IEEE Trans. AgriFood Electron. 2024, 3, 121–133. [Google Scholar] [CrossRef]

Figure 1. Data collection: (a) Samples collected at Yangdu Base, Jiaxing City, Zhejiang Province; (b) Samples collected at Wuwangnong Farm, Hangzhou City, Zhejiang Province.

Figure 2. Dataset construction workflow.

Figure 3. GL-SeqNet model architecture.

Figure 4. Training loss curves of nine experimental configurations: (a) RankNet, 224 × 224; (b) RankNet, 112 × 112; (c) RankNet, 56 × 56; (d) ListNet, 224 × 224; (e) ListNet, 112 × 112; (f) ListNet, 56 × 56; (g) MSE, 224 × 224; (h) MSE, 112 × 112; (i) MSE, 56 × 56.

Figure 5. Training performance curves of nine experimental configurations: (a) RankNet, 224 × 224; (b) RankNet, 112 × 112; (c) RankNet, 56 × 56; (d) ListNet, 224 × 224; (e) ListNet, 112 × 112; (f) ListNet, 56 × 56; (g) MSE, 224 × 224; (h) MSE, 112 × 112; (i) MSE, 56 × 56.

Figure 6. Ranking visualization results under different resolutions with the RankNet loss function: (a) RankNet, 224 × 224; (b) RankNet, 112 × 112; (c) RankNet, 56 × 56.

Figure 7. Ranking visualization results at a resolution of 56 × 56 under different loss functions: (a) RankNet, 56 × 56; (b) ListNet, 56 × 56; (c) MSE, 56 × 56.

Figure 8. Validation performance curves under different global resolutions with the local resolution fixed at 56 × 56: (a) Global 224 × 224, Local 56 × 56; (b) Global 112 × 112, Local 56 × 56; (c) Global 56 × 56, Local 56 × 56.

Table 1. Experimental results of GL-SeqNet under different input resolutions and loss functions.

Global Resolution	Local Resolution	Loss Function	Top-1	PMR	Time/ms
224 × 224	224 × 224	RankNet	0.925	0.950	22.8
112 × 112	112 × 112	RankNet	0.925	0.960	22.6
56 × 56	56 × 56	RankNet	0.950	0.970	22.3
224 × 224	224 × 224	ListNet	0.925	0.950	22.9
112 × 112	112 × 112	ListNet	0.925	0.950	22.7
56 × 56	56 × 56	ListNet	0.925	0.950	22.2
224 × 224	224 × 224	MSE	0.850	0.920	23.0
112 × 112	112 × 112	MSE	0.850	0.925	22.8
56 × 56	56 × 56	MSE	0.900	0.945	22.3

Table 2. Experimental results of GL-SeqNet performance under different global and local resolution combinations.

Global Resolution	Local Resolution	Loss Function	Top-1	PMR	Time/ms	FLOPs/G
224 × 224	56 × 56	RankNet	0.950	0.970	22.7	2.567
112 × 112	56 × 56	RankNet	0.950	0.970	22.6	1.100
56 × 56	56 × 56	RankNet	0.950	0.970	22.3	0.734

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Meng, Z.; Du, S.; Wang, B.; Pan, J.; Hu, D.; Du, X.; Yang, Q. GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization. Agriculture 2026, 16, 1322. https://doi.org/10.3390/agriculture16121322

AMA Style

Meng Z, Du S, Wang B, Pan J, Hu D, Du X, Yang Q. GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization. Agriculture. 2026; 16(12):1322. https://doi.org/10.3390/agriculture16121322

Chicago/Turabian Style

Meng, Zhichao, Shan Du, Bo Wang, Jun Pan, Dong Hu, Xiaoqiang Du, and Qinghua Yang. 2026. "GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization" Agriculture 16, no. 12: 1322. https://doi.org/10.3390/agriculture16121322

APA Style

Meng, Z., Du, S., Wang, B., Pan, J., Hu, D., Du, X., & Yang, Q. (2026). GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization. Agriculture, 16(12), 1322. https://doi.org/10.3390/agriculture16121322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GL-SeqNet: Global–Local Fusion for Intra-Cluster Tomato Harvesting Sequence Optimization

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Dataset Construction

2.3. GL-SeqNet Model

2.3.1. GL-SeqNet Model Architecture

2.3.2. Model Loss Function

2.4. Evaluation Metrics

2.5. Experimental Setups

3. Results

3.1. Experiments on Different Resolutions and Loss Functions

3.2. Comparison of Different Global and Local Resolution Combinations

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI