A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices

Hai, Tong; Wang, Wuxiong; Yan, Fengyi; Liu, Mingyu; Li, Chengze; Li, Shengrong; Hu, Ruojia; Lv, Chunli

doi:10.3390/agronomy15051242

Open AccessArticle

A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices

by

Tong Hai

^1,†,

Wuxiong Wang

^1,†,

Fengyi Yan

^1,†,

Mingyu Liu

¹,

Chengze Li

¹,

Shengrong Li

¹,

Ruojia Hu

^1,2 and

Chunli Lv

^1,*

¹

China Agricultural University, Beijing 100083, China

²

School of Foreign Studies, Hebei University, Baoding 071000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2025, 15(5), 1242; https://doi.org/10.3390/agronomy15051242

Submission received: 28 April 2025 / Revised: 15 May 2025 / Accepted: 19 May 2025 / Published: 20 May 2025

(This article belongs to the Special Issue Application of Deep and Machine Learning in Crop Monitoring and Management)

Download

Browse Figures

Versions Notes

Abstract

A deep learning-based tree pruning evaluation system is proposed in this study, which integrates hyperspectral images, sensor data, and expert system rules. The system aims to enhance the accuracy and robustness of tree pruning tasks through multimodal data fusion and online learning strategies. Various models, including Mask R-CNN, SegNet, Tiny-Segformer, Box2Mask, CS-Net, SVM, MLP, and Random Forest, were used in the experiments to perform tree segmentation and pruning evaluation, with comprehensive performance assessments conducted. The experimental results demonstrate that the proposed model excels in the tree segmentation task, achieving a precision of 0.94, recall of 0.90, F1 score of 0.92, and mAP@50 and mAP@75 of 0.91 and 0.90, respectively, outperforming other comparative models. These results confirm the effectiveness of multimodal data fusion and dynamic optimization strategies in improving the accuracy of tree pruning evaluation. The experiments also highlight the critical role of sensor data in pruning evaluation, particularly when combined with the online learning strategy, as the model can progressively optimize pruning decisions and adapt to environmental changes. Through this work, the potential and prospects of the deep learning-based tree pruning evaluation system in practical applications are demonstrated.

Keywords:

agricultural management; tree pruning; sensor data evaluation; multi-modal data; deep learning

1. Introduction

As one of the world’s most important economic fruits, apple cultivation plays a pivotal role in agricultural production. With a vast cultivation area and significant yield, it makes a substantial contribution to farmers’ income and regional economic development. However, tree pruning is a key process in ensuring the healthy growth of apple trees, improving fruit quality, and enhancing yield [1]. Traditional pruning methods mainly rely on farmers’ experience and manual labor, which may lead to improper pruning, inefficiency, and result in disorganized tree structure, uneven fruit distribution, and pest outbreaks, thus affecting the overall yield and quality of apples [2]. Therefore, developing efficient and precise pruning technologies is of great significance for improving the scientific and intelligent management of apple cultivation. However, achieving precise and efficient pruning is not a simple task, as shown in Figure 1.

The complexity of the tree structure significantly increases the difficulty of pruning. Apple tree branches are often densely intertwined, and there are substantial variations in the health status of different branches. The accurate identification of diseased, weak, and improperly growing branches during pruning is particularly challenging under the conditions of uneven natural light and considerable background interference (e.g., the ground or other trees) [3]. For instance, pest-infested branches may be hidden behind healthy branches, or weak branches, being small and of similar color, and may be difficult to detect quickly [4]. Furthermore, pruning strategies need to be adjusted according to the tree’s needs at different growth stages. For example, during the early growth stages, the focus of pruning is on optimizing the tree structure, while in the fruit maturation stage, balancing fruit distribution and light conditions is required [5]. These varying needs impose high demands on the scientific nature and flexibility of pruning plans. Finally, the evaluation and feedback of pruning effects are often lacking systemically [6]. Currently, most orchards find it difficult to evaluate the specific impact of pruning on tree health and fruit yield immediately after pruning. This lack of feedback hampers the optimization of pruning strategies and reduces the long-term management effectiveness of orchards.

Apple tree pruning traditionally relies on individual farmer experience, but this method has notable drawbacks. First, since the pruning process depends on personal experience, ensuring consistency in pruning quality is difficult, and improper pruning can negatively affect tree health and fruit development [7]. Moreover, in scenarios where branches are densely intertwined, traditional methods struggle to accurately identify branches that need pruning, potentially overlooking pest-infested branches or mistakenly pruning healthy branches, thus weakening the overall orchard management [8]. Researchers have attempted to introduce automation and intelligence, such as rule-based pruning systems and mechanical-assisted pruning tools [9,10]. However, these methods still face significant limitations in practical applications. Rule-based pruning systems typically rely on pre-established rules, lacking the flexibility to adapt to the orchard’s complexity and the dynamic growth needs of the trees, making them difficult to apply to different growth stages of apple trees. While mechanical pruning equipment improves pruning efficiency to some extent, the lack of a precise identification of tree structure, branch health status, and fruit distribution makes it prone to over-pruning or insufficient pruning, further affecting tree health and fruit quality [11]. This indicates that it is urgent to develop an intelligent system that combines dynamic tree structure evaluation and precise pruning strategies. Although existing research solutions have achieved some progress, several limitations still exist. For example, Sapkota et al. studied the performance of YOLOv8 and Mask R-CNN in segmenting apple tree trunks and branches during the dormant period [12]. Tong et al. proposed a method for locating apple tree pruning points based on RGB-D images, using the SOLOv2 instance segmentation model to segment trunks, branches, and supports, and accurately locate pruning points using depth information [7]. Majeed et al. used the SegNet network to achieve the semantic segmentation of apple tree trunks, branches, and support lines, providing technical support for automated pruning [1]. You et al. combined deep learning with geometric constraints and proposed a point-cloud-data-based tree skeletonization method, significantly improving the ability to model tree structures [13]. Chen et al. explored the application of U-Net, DeepLabv3, and Pix2Pix for segmenting apple tree branches in partially occluded scenes, demonstrating the potential of deep learning models in complex scenarios [14].

Despite the important progress achieved in these studies, several technical bottlenecks still exist in practical applications. First, the fusion and processing of multispectral data remain one of the key challenges in current technological development. In apple tree pruning tasks, single-spectral data often fail to provide sufficient feature information. While multispectral data can improve resolution and capture more details, the high-dimensional computational burden and model complexity brought by multi-channel data processing have not yet been fully addressed [15]. Secondly, recognizing complex tree structures is another significant challenge. Apple tree branches are often intertwined densely and exhibit diverse shapes. Traditional deep learning models perform poorly when handling such irregular objects, leading to false negatives or false positives that compromise pruning quality [16]. In addition, environmental factors such as lighting and seasonal changes significantly affect image quality in actual orchards, limiting the model’s ability to generalize across different scenarios [17].

To address the challenges of apple tree branch pruning complexity and the limitations of the aforementioned methods, a pruning strategy and evaluation system based on multimodal data fusion and dynamic learning mechanisms is proposed. Combining advanced sensor technologies and real-time feedback mechanisms, innovative solutions to the following key challenges are provided:

By introducing hyperspectral data and the Trim-SegTransformer model, multispectral information is utilized to enhance the distinction between branches and background, improving instance segmentation accuracy. Combining the Path Attention Mechanism and multimodal Transformer for feature fusion enables the pruning model to accurately recognize tree structures even in complex environments, providing reliable input for subsequent pruning decisions.
The Trim Scheme is proposed, integrating pruning rules provided by expert systems and historical scores from the pruning evaluation model to construct a Transformer-based multimodal fusion network for intelligent pruning strategy optimization. Initially, the model relies on expert rules, while later, it utilizes online learning through the Path–Trim Loss Function, allowing pruning strategies to continuously adapt to different tree growth environments, improving flexibility and accuracy.
A pruning evaluation model based on a temporal Transformer is designed, incorporating sensor data (such as light intensity, temperature, humidity, and branch angles) along with pruning history to achieve a dynamic evaluation of pruning effectiveness. A weighted-time update strategy is employed to gradually increase the weight of historical data over time, ensuring a more scientifically sound evaluation of pruning schemes, optimizing future pruning strategies, and enhancing tree health and yield.

The structure of this paper is arranged as follows. Section 1 introduces the research background, existing methods and limitations, research innovations and solutions, as well as the research objectives and major contributions. Section 2 elaborates on the dataset collection process, the deep learning-based health assessment, multispectral data fusion, pruning strategy formulation and optimization, and the design and implementation of the space-state-based attention mechanism. Section 3 presents the experimental process, results, and analysis. Section 4 summarizes the research findings, discusses the limitations, and proposes future research directions.

2. Materials and Methods

2.1. Hyperspectral Data Collection

To ensure the comprehensiveness and effectiveness of the data, data were collected from apple orchards in Qixia City, Shandong Province, and Linhe District, Bayannur City, Inner Mongolia Autonomous Region, from March 2023 to March 2024. The equipment used was the “Headwall Photonics Nano-Hyperspec”, which features an excellent spectral range from 400 nm to 1000 nm with a resolution of 2.5 nm. Each scan collects data across 2048 spectral bands, with a spatial resolution of 1.5 mm, allowing for the rapid acquisition of large-scale hyperspectral data. However, the full hyperspectral data possess high dimensionality and considerable information redundancy. Directly using such data as model input would substantially increase computational costs during training and may introduce noise, thereby impairing model convergence stability. Based on prior knowledge of spectral responses in the agricultural domain and insights from relevant studies, three key visible bands were selected: blue (450–520 nm), green (520–590 nm), and red (620–760 nm). To standardize the input format and enhance the accuracy of spectral feature representation, spectral interpolation and resampling at 1 nm intervals were applied to the selected visible bands. For model input construction, individual spectral slices corresponding to each target object were extracted from the resampled data within the three selected spectral regions. This approach significantly reduced feature dimensionality, thereby improving training efficiency and enhancing the feasibility of model deployment. The number of samples corresponding to the three spectral bands was 2781, 2593, and 2964, respectively. Detailed statistics are provided in Table 1.

To minimize the influence of lighting variations, reference objects are typically used for calibration. However, since the angle of sunlight dynamically changes, traditional reference-plate methods may not always accurately correct lighting. Therefore, a reference sphere 80 mm in diameter was used in place of the traditional reference plate, as shown in Figure 2.

To generate a complete and accurate hyperspectral image reflecting the overall shape and spectral information of the trees, multiple collected images must be stitched together. To overcome lighting differences, the reference-object calibration method was employed to standardize the lighting intensity across the images, using the aforementioned 80 mm diameter reference sphere. After completing the light difference correction, spatial alignment is necessary to ensure the accurate stitching of different images into a complete one. Feature-matching methods based on color, texture, and geometric shapes were used. Harris corner detection was first applied to extract significant feature points from each image. Then, the Scale-Invariant Feature Transform (SIFT) algorithm was used to find matching feature points between different images. By matching the feature points, displacement and rotation information between images could be initially determined, followed by geometric transformations. After image alignment, weighted averaging was used to fuse the images, seamlessly stitching them into a large image. To simplify the data and improve subsequent processing efficiency, Principal Component Analysis (PCA) and Partial Least Squares (PLS) methods were used for dimensionality reduction.

2.2. Hyperspectral Data Annotation

In order to ensure the accuracy of the annotations, a combination of manual annotation and automatic auxiliary annotation was adopted in this study. To improve efficiency and ensure consistency, ENVI 6.1 software was selected as the primary annotation platform. The spectral data from multiple bands of the hyperspectral images were loaded into ENVI software to observe the reflectance information across different bands. Based on the spectral features in each band, the trees’ leaves, branches, background, and other parts were gradually annotated. For complex areas, such as the boundary between leaves and branches, cross-validation using multi-band information was employed to ensure annotation accuracy.

In addition to manual annotation, automatic auxiliary annotation methods were also employed to improve annotation efficiency. Therefore, the study integrated machine learning and deep learning techniques on top of ENVI for auxiliary annotation. Specifically, a Support Vector Machine (SVM)-based classification algorithm and a Convolutional Neural Network (CNN)-based image segmentation algorithm were used for pre-annotation. The SVM classification method performed a preliminary classification of different regions based on the spectral features of the hyperspectral images. SVM was able to differentiate regions such as leaves, branches, and background using the spectral features from the training data, providing initial labels for manual annotation. Although these preliminary annotations were not entirely accurate, they served as a reference for experts, reducing the workload for manual intervention. With the help of CNN algorithms, a more precise segmentation of regions such as leaves and branches was achieved, especially at the boundaries between the branches and leaves. Deep learning models were able to recognize features of these complex areas by learning from large-scale annotated data. The CNN-based automatic annotation tool quickly labeled new images based on the existing training data and was further refined by experts through human–computer interaction.

2.3. Sensor Data Collection

In this study, a variety of sensor devices were employed, and through proper configuration and placement, environmental data closely related to pruning effects were collected in real time, as shown in Figure 3. For monitoring light intensity, a photosynthetically active radiation (PAR) sensor was selected as the primary detection device, the model of which was LI-COR LI-250A Light Meter + LI-190R Quantum Sensor. The sensors were distributed on the branches and suspended in the upper canopy, with a collection frequency set to every two hours to capture the dynamic changes in canopy light. A capacitive temperature and humidity sensor, SHT30, was also chosen. This sensor monitors humidity through changes in capacitance and measures temperature using a thermistor. To ensure the sensor was placed in different positions on the upper, middle, and lower levels of the canopy as well as inside the canopy, clamps and ties were used to attach the sensors to the branches, ensuring the sensing surface was unobstructed and unpolluted. Data were recorded every hour and transmitted to the central data collection system for processing and analysis. Additionally, the ADXL345 tilt angle sensor was used to monitor the growth angle of the branches in real-time, measuring the inclination of the branches using gravity-sensing principles. The sensors were installed at the base and middle of the branches, collecting data once a month.

2.4. Tree Pruning Strategy and Evaluation System

The tree pruning strategy and evaluation system proposed in this paper combines hyperspectral image data, expert systems, and sensor data to optimize fruit tree pruning through intelligent methods, as shown in Figure 4.

The system is mainly divided into two modules: an instance segmentation-based tree pruning model and a sensor data-based pruning evaluation model. First, the processed hyperspectral image data are input into the Trim-SegTransformer model. This model extracts features from hyperspectral images and generates instance segmentation masks (Mask), accurately identifying the tree structure. Based on these masks, the model combines the pruning rules from the expert system to generate an initial pruning plan (Trim Plan). The pruning plan is then optimized by the Path Attention Mechanism to ensure detailed pruning, especially in branch extraction. The model training is optimized using the Path–Trim Loss Function, where Path Loss ensures the accuracy of the tree structure instance segmentation, and Trim Loss combines expert system rules with historical pruning scores from the evaluation system, gradually optimizing the pruning plan through online learning. Once the pruning result is generated, the pruning evaluation model based on sensor data starts working. This model evaluates pruning effects in real time by using a multimodal network from Transformer, combining sensor data such as light intensity, temperature, humidity, and branch angles. Sensor data are collected daily and fed back into the model. In the early stages, the evaluation is more reliant on the rules of the expert system. As time passes, the accumulation of sensor data allows the evaluation model to gradually adjust its scores based on real environmental data, further optimizing the pruning strategy. The entire system continuously updates and optimizes the pruning plan to achieve precise and efficient tree pruning.

2.5. Trim-Segtransformer

As shown in Figure 5, the Trim-SegTransformer model in this paper is mainly used for tree pruning tasks. The core design idea is to leverage the powerful modeling capability of the Transformer model, combining hyperspectral image data and the Path Attention Mechanism to precisely segment tree branches and generate corresponding pruning plans.

In the entire network, the most important mathematical analysis lies in how the Path Attention and Transformer Encoder effectively combine hyperspectral image data with pruning tasks. The Path Attention mechanism, by focusing on the paths, enables the accurate extraction of key information within complex tree branch structures. This is particularly effective in cases where the tree structure is complicated and background interference is strong, as it can significantly reduce mis-segmentation and missed segmentation.

2.5.1. Path Attention Mechanism

The Path Attention Mechanism strengthens local structural relationships by introducing path information, allowing the model to focus more on important paths when processing tree structures, thus generating more accurate pruning plans. First, the input features are linearly transformed to generate query, key, and value matrices. Specifically, assuming the input feature matrix X has dimensions

H \times W \times C

, where H is the height of the feature map, W is the width, and C is the number of channels. After linear transformation, the query, key, and value matrices have dimensions of

H \times W \times d

, where d is the feature dimension. In the Path Attention Mechanism, these generated features are concatenated into a new feature matrix, which introduces more information to capture the spatial relationships between branches. Then, these concatenated features are sent to the Path Attention module for further processing and weighting, ensuring that the model focuses on the key information and path structure of the branches, as shown in Figure 6.

The Path Attention Mechanism computes attention scores differently from the traditional self-attention mechanism, aiming to better capture the spatial relationships between branches. By introducing path information, the model can focus on the key paths between branches, generating more accurate pruning plans. First, the input feature matrix X undergoes a linear transformation to generate the query matrix Q, the key matrix K, and the value matrix V, which are computed as follows:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(1)

Here,

W_{Q}

,

W_{K}

, and

W_{V}

are the learned weight matrices for query, key, and value, respectively, and X is the input feature matrix. After these linear transformations, Q, K, and V will have dimensions of

H \times W \times d

, where H and W are the height and width of the image, and d is the feature dimension. Next, unlike the standard self-attention mechanism, we do not simply calculate the dot product of Q and K. Instead, we incorporate path information P into the calculation of attention scores. The path information P helps identify which paths in the tree structure are more crucial, thus assisting the model in assigning higher weights to important paths when calculating the attention score. Therefore, the Path Attention score A is calculated as follows:

A = PathAttention (\frac{Q K^{T}}{\sqrt{d}} + P)

(2)

Here,

\frac{Q K^{T}}{\sqrt{d}}

is the scaled dot product of the query and key matrices, and P represents the path information, which provides the model with important path information within the branch structure, helping the model focus more on critical information.

\sqrt{d}

is the scaling factor used to prevent the dot product from becoming too large, which could lead to vanishing gradients. The computed Path Attention score A is then passed through the Mask and Softmax operations to ensure that the final attention distribution is a valid probability distribution. This allows the model to generate a weighted pruning plan based on the importance of each path. This process is mathematically expressed as:

A^{'} = Softmax (Mask (\frac{Q K^{T}}{\sqrt{d}} + P))

(3)

After the Softmax operation, the resulting

A^{'}

is a normalized attention matrix representing the importance weight of each path. The model then generates the final pruning path by weighting the value matrix V:

Y = A^{'} V

. Here, Y is the weighted output, representing the feature information weighted according to the importance of each path, ultimately determining the choice of pruning paths. The Path Attention Mechanism not only considers the global dependencies between branches but also enhances the focus on local structures, especially in the extraction of branches and trunks. This reduces background interference and mis-segmentation, thus generating more precise pruning plans. This design is particularly well suited for tree pruning tasks, where branch structures are often complex and highly irregular.

2.5.2. Trim Scheme

The Trim Scheme proposed in this study is an intelligent pruning strategy that integrates pruning rules provided by the expert system and evaluation scores from the pruning evaluation model, dynamically optimizing pruning decisions through deep learning. The key aspect of this scheme is multimodal data fusion, where textual pruning rules from the expert system, real-time sensor data (such as light intensity, temperature and humidity, and branch angles), and pruning evaluation scores are jointly encoded. A decoder then processes this information to generate the final pruning mask, ensuring the rationality, interpretability, and dynamic optimization of the pruning plan, as shown in Figure 7.

At the implementation level, the Trim Scheme consists of three main components: multimodal encoding, feature fusion, and trim mask decoding. In the multimodal encoding phase, the expert system provides pruning rules in unstructured text form, whereas sensor data and pruning scores are structured as sequential numerical data. Different encoding methods are employed to integrate these modalities. Text-based pruning rules are processed using a Transformer-based Text Encoder, transforming the pruning rules

R

into a high-dimensional semantic vector representation

E_{r}

:

E_{r} = \sum_{i = 1}^{N_{r}} α_{i} \cdot W_{r} \cdot h_{i}, α_{i} = \frac{exp (q^{T} h_{i})}{\sum_{j = 1}^{N_{r}} exp (q^{T} h_{j})},

(4)

where

h_{i} \in R^{d^{'}}

denotes the encoded vector of the i-th word in the textual pruning rule,

q \in R^{d^{'}}

represents a trainable query vector,

W_{r} \in R^{d \times d^{'}}

is a learnable projection matrix, and

N_{r}

is the number of tokens in the rule sequence. This formulation enables the pruning rule to be transformed into a fixed-dimensional feature representation

E_{r} \in R^{d}

, which is then used for subsequent pruning decision making. For sensor data, to ensure temporal synchronization and completeness across different sensor modalities, a unified preprocessing procedure was applied to the raw sensor data. All sensor readings were first aligned based on their timestamps, and a consistent sampling frequency was established to achieve a uniform temporal resolution. For time points with missing values, a sliding-window averaging method was employed to impute the gaps, thereby smoothing noise and preserving local temporal trends. Given its temporal dependency, a temporal Transformer is used for modeling. Suppose there are M different sensor measurements (e.g., light intensity, temperature, humidity, and branch angles), each recorded over T time steps as

S_{m} = {s_{m}^{1}, s_{m}^{2}, . . ., s_{m}^{T}}

; its encoding is formulated as:

E_{s} = \sum_{m = 1}^{M} \sum_{t = 1}^{T} β_{m, t} \cdot W_{s} \cdot f_{m, t}, β_{m, t} = \frac{exp (k_{m}^{T} f_{m, t})}{\sum_{j = 1}^{M} \sum_{t = 1}^{T} exp (k_{j}^{T} f_{j, t})}

(5)

where

f_{m, t} \in R^{d^{'}}

denotes the feature representation of the m-th sensor at time step t,

k_{m} \in R^{d^{'}}

is the query vector corresponding to that measurement type, and

W_{s} \in R^{d \times d^{'}}

represents a learnable weight matrix. The term

β_{m, t}

indicates the temporal attention weight. The output

E_{s} \in R^{d}

represents the time-weighted embedding aggregated across all sensor modalities, providing a compact representation that captures dynamic environmental trends and ensures that pruning decisions are adaptively aligned with environmental variations. For the pruning evaluation model’s output scores, given that they are structured numerical data, a Multi-Layer Perceptron (MLP) is used to map them into a feature representation

E_{p}

:

E_{p} = σ (W_{p} \cdot x_{p} + b_{p}),

(6)

where

x_{p} \in R^{d^{″}}

denotes the original pruning evaluation score vector,

W_{p} \in R^{d \times d^{″}}

and

b_{p} \in R^{d}

are learnable parameters, and

σ

represents a nonlinear activation function (e.g., ReLU). The resulting embedding

E_{p} \in R^{d}

captures the semantic contribution of historical pruning feedback, enabling the model to learn the relationship between pruning scores and pruning strategies, and to utilize past evaluation scores for optimizing future pruning plans. In the feature fusion phase, a unified temporal alignment mechanism was designed to effectively integrate heterogeneous information across modalities and establish global dependencies. The sensor data exhibit explicit temporal dependencies, and their encoded representation

E_{s} \in R^{T \times d}

corresponds to a dynamic feature sequence generated over T time steps. In contrast, the expert rules

E_{r} \in R^{d}

and pruning evaluation scores

E_{p} \in R^{d}

are static embedding vectors. To ensure structural consistency with the time-series data, both vectors were replicated T times along the temporal dimension, resulting in extended representations

{\tilde{E}}_{r}, {\tilde{E}}_{p} \in R^{T \times d}

. Subsequently, the three modalities were concatenated at each time step to form the fused input tensor

X \in R^{T \times 3 d}

, which was then fed into a Transformer encoder. Through the multi-head self-attention mechanism, high-order interactions were established across both temporal and modal dimensions. This enabled the model to effectively capture the dynamic relationships among pruning rules, environmental context, and historical feedback, thereby providing context-aware support for the generation of pruning strategies:

Q = W_{Q} [E_{r}; E_{s}; E_{p}],

(7)

K = W_{K} [E_{r}; E_{s}; E_{p}],

(8)

V = W_{V} [E_{r}; E_{s}; E_{p}],

(9)

where

W_{Q}

,

W_{K}

, and

W_{V}

are trainable query, key, and value matrices,

A

is the attention matrix, and

F = A V

is the fused feature representation. By leveraging the multimodal Transformer structure, the pruning scheme can learn complex relationships between different modalities, such as which pruning rules are most relevant under current environmental conditions, which sensor data have the greatest influence on pruning decisions, and how historical scores impact current pruning strategies. In the trim mask decoding phase, the fused feature

F

is passed through a Transformer decoder to generate the final pruning mask

M

. The decoder consists of a Transformer and a convolutional neural network (CNN), ensuring that the pruning scheme maintains both global consistency and local precision:

H = Decoder (F), M = σ (Conv 2 D (H)),

(10)

where

Decoder (\cdot)

represents the Transformer decoder,

Conv 2 D (\cdot)

denotes a 2D convolution operation, and

σ

is a Sigmoid activation function that ensures the output is in the range

[0, 1]

, representing the pruning mask. The final output

M

is a pixel-level prediction of the pruning area, indicating which branches should be pruned and which should be preserved. Compared to traditional pruning strategies that rely on fixed rules, the proposed Trim Scheme offers adaptive adjustment, online optimization, and multimodal fusion. Initially, the model relies more on the expert system’s pruning rules. However, over time, historical evaluation scores become increasingly influential, allowing pruning decisions to dynamically align with real-world environmental demands. This deep learning-based Transformer approach enables the pruning scheme to autonomously adapt to various orchard growth environments, achieving more precise and efficient pruning strategies, thereby providing an innovative solution for intelligent agricultural management.

2.5.3. Dynamic Evaluation Strategy

The dynamic evaluation strategy is primarily used to dynamically quantify the effectiveness of pruning schemes in real time, ensuring that the tree structure is optimized, the growth environment is improved, and the overall health of the fruit trees is enhanced. This strategy combines multimodal sensor data (light intensity, temperature and humidity, branch growth angles, etc.) with historical pruning scores, analyzed through deep learning models. The evaluation process adopts a weighted update scoring mechanism, where initial evaluations rely on expert system rules, and over time, sensor data are gradually incorporated to dynamically optimize the pruning scheme and improve the long-term reliability and accuracy of pruning results. Since sensor data in the early stages after pruning are still in the dynamic adjustment phase, their reference values are relatively low. However, as time progresses, sensor data can more comprehensively reflect pruning effects, and therefore, the weight of historical data in the evaluation should gradually increase. To achieve this, a dynamic updating mechanism based on weighted averages has been designed, allowing the pruning evaluation scores to be continuously optimized by combining current sensor data with historical data. After each pruning event, sensor data are input into the pruning evaluation model to calculate an initial pruning effectiveness score. To define the pruning effectiveness score in a reliable and consistent manner, a binary labeling strategy was adopted, wherein pruning outcomes were categorized as either “effective” (label 1) or “ineffective” (label 0). To ensure high-quality supervision, the labeling criteria were developed and validated by a panel of three experts in fruit tree cultivation. Specifically, the effectiveness of pruning was determined based on three structural and environmental indicators measured within 7–10 days post pruning: (1) the standard deviation of canopy light intensity decreased by more than 25% compared to pre-pruning conditions; (2) the main scaffold branch angles were adjusted to a biologically optimal range (30°–60°) with fluctuation less than 10% over the observation period; and (3) the localized temperature and humidity returned to stable levels (temperature fluctuation < 2 °C, humidity variation < 5%). If two or more of these criteria were met, the pruning outcome was labeled as effective (label 1); otherwise, it was labeled as ineffective (label 0). The entire labeling process was completed by the expert team through a comprehensive analysis of time-series sensor logs and on-site observations, ensuring both scientific validity and consistency across the dataset. Let this score be denoted as

y_{t}

, which is calculated based on the current sensor data

D_{t}

. As time passes, new sensor data continue to accumulate, and historical data gradually integrate into the evaluation system. During each evaluation update, the latest sensor data

D_{t}

are combined with the previous evaluation score

y_{t}

and adjusted through a weighting factor to compute the new pruning evaluation score

y_{t + 1}

. The update formula is as follows:

y_{t + 1} = α y_{t} + (1 - α) f (D_{t}),

(11)

where

y_{t + 1}

is the updated pruning evaluation score,

y_{t}

is the previous pruning evaluation score,

f (D_{t})

is the pruning effectiveness evaluation function based on the current sensor data

D_{t}

, and

α

is a weighting factor that controls the influence of historical and current data. The value of

α

typically ranges between

[0, 1]

, with a larger

α

indicating greater weight for historical data, and a smaller

α

indicating greater influence of current data. In this updating mechanism, the weight of historical data gradually increases over time, meaning that older data become more valuable. To implement this, the value of

α

decreases over time, allowing new sensor data to have an increasing impact on the pruning effectiveness evaluation. The design of the score update strategy takes into account two factors: on the one hand, the importance of sensor data is lower in the early stages after pruning, so a larger weight for historical data ensures evaluation stability; on the other hand, as time progresses, new sensor data become more important, so the impact of current data is gradually increased to ensure that the pruning effectiveness evaluation better reflects the actual situation.

2.6. Experimental Design

2.6.1. Experimental Environment

To validate the effectiveness of the proposed tree pruning strategy and evaluation system, a high-performance experimental environment was established, and hyperparameters were carefully tuned to ensure the scientific rigor and reliability of the experimental results. The hardware platform consisted of a high-performance computing server equipped with an NVIDIA A100 GPU (80 GB VRAM), an Intel Xeon Gold 6248R processor, and 512 GB of RAM, supporting large-scale data processing and efficient training of deep learning models. The operating system was Ubuntu 20.04, and the deep learning framework used was PyTorch 2.0, which provides robust support for multimodal data fusion and parallel processing. The integration of NVIDIA CUDA 11.8 and cuDNN 8.7.0 further optimized the underlying hardware performance, significantly improving training speed and model stability. In terms of software tools, Python 3.9 was adopted as the primary programming language. Data processing and visualization were carried out using NumPy 1.24.2, Pandas 1.5.3, Matplotlib 3.7.1, and Seaborn 0.12.2. Model evaluation and cross-validation were implemented via Scikit-learn 1.2.2. For image registration, feature extraction, and stitching, OpenCV 4.7.0 was employed to perform Harris corner detection, SIFT feature matching, image fusion, and geometric transformations. Dimensionality reduction was conducted using PCA and PLS algorithms provided by Scikit-learn. ENVI 5.6 served as the primary annotation platform for hyperspectral images, where a hybridsc approach combining manual labeling and machine learning-assisted annotation was adopted to enhance annotation efficiency and consistency. During model training, the AdamW optimizer [18] was used, with the initial learning rate set to

1 \times 10^{- 4}

and gradually decayed using a cosine annealing scheduling strategy to ensure stability in the later training stages. The batch size was set to 32 to balance computational resource consumption and model performance. Mixed-precision training was introduced to reduce memory usage and accelerate training. A dropout rate of 0.1 was applied to alleviate overfitting. The dataset was split into training and validation sets in a ratio of 7:3, ensuring sufficient training samples while enabling the reliable evaluation of generalization performance. Additionally, a 10-fold cross-validation strategy was employed to enhance the robustness and statistical reliability of the experimental results. Specifically, the training set was randomly divided into ten subsets, with one subset used for validation and the remaining nine for training in each iteration. This process was repeated 10 times, and the average performance metrics were reported as the final results, thereby reducing performance fluctuations caused by data partitioning and increasing the credibility of the experimental conclusions.

2.6.2. Baseline Methods

To comprehensively evaluate the performance of the proposed Trim-SegTransformer model, several representative baseline models were selected for comparison. These models cover the two main tasks of instance segmentation and pruning evaluation, and are based on the latest advances in their respective fields, with wide applications. For the tree segmentation task, Mask R-CNN, a general-purpose object detection and instance segmentation model, was chosen. It is capable of multi-object detection and is widely used in smart agriculture scenarios. Mask R-CNN combines Region Proposal Networks (RPNs) and Fully Convolutional Networks (FCNs), enabling the efficient completion of detection and segmentation tasks, especially excelling in multi-object detection. Tiny-Segformer, a lightweight visual segmentation model, is designed for real-time disease detection in resource-constrained environments, with high inference efficiency. It optimizes the use of computational resources, making it suitable for mobile devices and edge computing scenarios. SegNet, CNN segmentation model, is suitable for semantic segmentation tasks. Its symmetric encoder–decoder structure allows for effective pixel-level classification and is widely applied in various segmentation tasks. Box2Mask [19], an instance segmentation model based on bounding box annotations, achieves efficient instance segmentation through simple bounding box annotations, without requiring pixel-level mask annotations. Box2Mask integrates an Instance-aware Decoder (IAD) and Box-level Matching Assignment, demonstrating excellent performance in segmentation tasks with complex backgrounds or similar appearance objects. CS-Net [20], a hybrid architecture combining CNNs and Simpleformer, is designed to optimize agricultural image segmentation, addressing issues such as high computational complexity and dataset quality dependence. CS-Net, using Simple-Attention Block (SIAB) and CS modules, effectively combines local detail extraction with long-range dependency modeling, exhibiting superior segmentation performance and inference efficiency on small agricultural datasets.

For the pruning evaluation task, the following three baseline models were selected for comparison: Support Vector Machine (SVM), which projects data into high-dimensional space for classification, exhibiting good generalization ability and efficient classification performance, suitable for various classification tasks; Multi-Layer Perceptron (MLP), a simple and efficient neural network architecture suitable for both regression and classification tasks—by applying multiple layers of nonlinear transformations, MLP is able to effectively learn complex mappings between inputs and outputs; and Random Forest, an ensemble learning method that enhances model stability and precision by using multiple decision trees. Random Forest performs excellently in handling high-dimensional data and capturing complex relationships between features, and is widely applied in classification and regression tasks. By comparing these baseline models, the advantages and areas for improvement of the proposed method can be clearly identified.

2.6.3. Evaluation Metrics

To comprehensively evaluate the performance of the Trim-SegTransformer model, corresponding evaluation metrics were defined for different tasks. For the tree segmentation task, metrics such as mAP@50, mAP@75, accuracy, precision, recall, and F1 score were used to measure the model’s performance in segmentation tasks. For the pruning evaluation task, accuracy, precision, recall, and F1 score were used as the main evaluation metrics to assess the model’s overall performance in classification tasks. The calculation formulas are as follows:

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(12)

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(13)

Precision = \frac{T P}{T P + F P}

(14)

Recall = \frac{T P}{T P + F N}

(15)

F 1 - Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(16)

Here, mAP@50 refers to the mean average precision at an Intersection over Union (IoU) threshold of 50%, and mAP@75 refers to the mean average precision at an IoU threshold of 75%, which measures the model’s performance in detection and segmentation under stricter overlap requirements.

T P

(True Positive) refers to true positives,

T N

(True Negative) refers to true negatives,

F P

(False Positive) refers to false positives, and

F N

(False Negative) refers to false negatives.

3. Results and Discussion

3.1. Tree Segmentation Detection Results

The purpose of this experiment is to evaluate the performance of various models in tree segmentation tasks, with a particular focus on comparing the detection results across several models. The goal is to analyze the strengths and weaknesses of each model in terms of precision, recall, accuracy, F1 score, and mean average precision (mAP). To ensure scientific validity and reproducibility in density classification, the commonly used light penetration indicator in orchard management—Leaf Area Index (LAI)—was adopted to define the density levels. Specifically, low-density areas were defined as those with LAI values ranging from 1.0 to 2.5, medium density from 2.6 to 4.0, and high density as greater than 4.1. This classification criterion was established based on in-field measurement data and was further refined with expert knowledge from fruit tree cultivation specialists. Tree segmentation plays a crucial role in accurately detecting and identifying tree structures, especially in complex scenarios involving intricate tree shapes and background interference. The performance of the model directly affects the accuracy and efficiency of subsequent tasks such as pruning. By comparing the performance of various deep learning models, this experiment provides theoretical guidance for selecting the appropriate model for practical applications and lays the foundation for future optimization efforts.

The results shown in Table 2, Figure 8 and Figure 9 indicate that there are notable differences in precision, recall, F1 score, and mAP across the models. Mask R-CNN, a classical object detection model, performs reasonably well in tree segmentation tasks, with a precision of 0.83, recall of 0.79, and an F1 score of 0.81, suggesting that it is somewhat effective in tree-shape recognition. However, compared to other models, Mask R-CNN’s performance is relatively lower, especially in mAP@50 and mAP@75, where the scores are 0.80 and 0.79, respectively. This indicates that there is room for improvement in segmentation accuracy and precision. SegNet, a typical semantic segmentation network, performs better in tree segmentation, particularly in recall (0.82) and mAP@75 (0.83), showing a clear improvement over Mask R-CNN. This is due to SegNet’s encoder–decoder structure, which adapts well to complex scenes and is better at capturing finer details between branches. However, its precision is slightly lower than that of Tiny-Segformer and CS-net, suggesting some limitations in fine-grained segmentation. Tiny-Segformer performs excellently in this task, achieving some of the best results in precision, recall, F1 score, and mAP, particularly in mAP@50 (0.88) and mAP@75 (0.86). This demonstrates that Tiny-Segformer is capable of balancing precision and recall, making it suitable for scenarios that require precise detection and high recall. Both CS-net and Box2Mask perform similarly and effectively enhance the precision and accuracy of tree segmentation, particularly in handling complex branch structures and background interference, avoiding false negatives. Finally, the proposed method outperforms all other models, achieving the highest precision of 0.94, recall of 0.90, F1 score of 0.92, and mAP@50 and mAP@75 of 0.91 and 0.90, respectively. This demonstrates that the proposed model integrates multimodal data more effectively and improves tree segmentation accuracy and stability through innovative network design.

From a mathematical perspective, the differences in these results can be attributed to the structural design of each model. Mask R-CNN uses a region proposal network (RPN) to generate candidate boxes and detects regions of interest, making it suitable for simpler and clearer segmentation tasks. However, it is weaker at modeling the complex relationships between branches. SegNet and Tiny-Segformer focus more on fine-grained processing of feature maps to improve segmentation accuracy, particularly in handling blurry boundaries, occlusions, and small objects, where they demonstrate superior performance. CS-net and Box2Mask enhance the expression of local features with improved structures, optimizing segmentation between the background and target, thus excelling in complex scenes. The proposed method further enhances the model’s expressive power by deeply integrating expert system rules and sensor data, combined with the advanced Transformer architecture, especially in multimodal data processing and cross-modal learning. This approach significantly improves the segmentation accuracy and precision, making the model robust and effective. Therefore, the experimental results show that deep learning models based on Transformer, combined with Path Attention Mechanisms and pruning evaluation models, can effectively improve tree segmentation accuracy and robustness.

3.2. Pruning Evaluation Analysis

The purpose of this experiment is to compare the performance of different models in pruning evaluation tasks, with a particular focus on how the models evaluate pruning effects based on post-pruning data. As shown in Table 3, the pruning evaluation capabilities of the models vary. In traditional machine learning models, Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) both show relatively stable pruning evaluation results. The SVM model has a precision of 0.81, recall of 0.78, and F1 score of 0.79, performing relatively weakly. This could be due to the limitations of SVM in handling nonlinear and complex data features, especially when dealing with high-dimensional and large-scale datasets where its feature learning ability is constrained. In contrast, the MLP model performs better than SVM, with a precision of 0.85, recall of 0.82, and F1 score of 0.83. MLP can better learn complex patterns and nonlinear relationships in data through its Multi-Layer Perceptron structure, showing strong adaptability, particularly when dealing with more complex pruning evaluation tasks, allowing it to capture more feature information. The performance of the Random Forest model is further superior to that of SVM and MLP, with a precision of 0.88, recall of 0.85, and F1 score of 0.86. Random Forest, by aggregating decisions from multiple decision trees, is more effective in handling noise and uncertainty in high-dimensional data, making it more stable and robust in tree pruning evaluation tasks. Finally, the proposed method outperforms all other models, achieving a precision of 0.92, recall of 0.89, and F1 score of 0.91, clearly higher than the others. This indicates that the proposed model integrates more multimodal data and is optimized through deep learning networks, enabling more accurate pruning evaluation. It is also more adaptable, particularly when facing complex environmental factors, providing more precise pruning suggestions.

3.3. Online Learning Strategy Analysis

The objective of this experiment is to analyze the impact of online learning strategies on model performance, particularly in the tree pruning task, by gradually updating model evaluation scores to reflect the improvement in pruning effects. The key to the experiment lies in simulating the pruning process, where new sensor data are continuously collected over time, and this data are used to optimize the pruning evaluation model. The results of the experiment, as shown in Figure 10, are presented through the accuracy changes at different time intervals, illustrating how the online learning strategy affects the model’s learning progress and pruning evaluation effectiveness. The data in the chart reflect a significant increase in model accuracy over time, especially in the first 30 days, where accuracy growth is particularly noticeable. This could be attributed to the model’s rapid adaptation to new data in the early stage. As time progresses and more sensor data accumulate, the model gradually optimizes evaluation results, making the pruning plan more accurate, with the accuracy stabilizing at a high level.

3.4. Performance Analysis Under Different Tree Density Levels

This experiment aims to analyze the impact of different tree densities on model performance to evaluate its segmentation and pruning assessment capabilities in complex environments. The experimental results, as shown in Figure 11, indicate a decreasing trend in all performance metrics as density increases, suggesting that high-density scenarios pose challenges to the model’s segmentation and pruning evaluation capabilities. In low-density environments, precision, recall, accuracy, mAP@50, and mAP@75 are all close to 1.0, indicating clear target segmentation and high pruning evaluation accuracy. However, under medium-density conditions, all metrics decline, with recall and mAP@75 showing the most significant drop, suggesting a reduction in the model’s recognition ability in more complex backgrounds. In high-density scenarios, all metrics further decrease, with mAP@75 approaching 0.85, demonstrating that dense foliage occlusion significantly affects model performance. The overall trend highlights that improving segmentation and evaluation capabilities in high-density environments remains a key direction for future optimization.

3.5. Ablation Experiment of Different Sensor Data for Evaluation

The purpose of this experiment is to analyze the impact of different sensor data on pruning evaluation model performance through ablation experiments, specifically evaluating the contribution of different data modalities to pruning effect assessment. By systematically removing various sensor data, the experiment aims to reveal the critical role of environmental factors, such as light intensity, temperature and humidity, and branch growth angle, in the model’s pruning performance. The experimental results demonstrate the variations in metrics like precision, recall, accuracy, and F1 score under different sensor data combinations, which help identify the sensor data that are crucial for pruning effect evaluation, thus providing a basis for future model optimization.

As shown in Table 4, the model achieved optimal performance when light intensity, temperature and humidity, and branch growth angle data were jointly incorporated into the evaluation. Specifically, the precision reached 0.92, recall was 0.89, accuracy was 0.91, and the F1 score was 0.91. These results demonstrate that the integration of multimodal sensor data significantly enhances the comprehensiveness and accuracy of pruning effect evaluation. Further ablation studies revealed substantial differences in the impact of individual sensor modalities on model performance. Among them, the exclusion of light intensity data resulted in the most pronounced degradation, with precision and recall dropping to 0.75 and 0.71, respectively. This highlights the critical importance of light-related features in pruning assessment. Such significance stems from the fact that light distribution directly reflects the internal light penetration within the canopy, while one of the primary objectives of pruning is to optimize branch and leaf structure to improve light uniformity. As a result, the model exhibits strong reliance on this modality. In comparison, temperature and humidity data provide contextual information on microclimatic variations. Although short-term fluctuations are relatively small, these data are vital for capturing the physiological stability of the tree after pruning. When temperature and humidity data were excluded, model performance also declined, with precision dropping to 0.73 and recall to 0.70, indicating its indispensable auxiliary role in long-term effect evaluation. Branch growth angle, serving as a direct indicator of spatial structural changes, captures the morphological adjustments induced by pruning. Its removal also caused performance deterioration, particularly under high-density canopy conditions, where structural variation plays a more prominent role. In summary, the three sensor types exhibit strong complementarity in pruning assessment: light intensity reflects energy distribution changes, temperature and humidity reveal physiological responses, and branch angle captures geometric restructuring. A collaborative modeling approach incorporating all three modalities is essential for achieving a comprehensive and robust evaluation of pruning quality, thereby providing a reliable foundation for input feature selection and system deployment.

3.6. Discussion

To achieve intelligent analysis and prediction of tree pruning, various 3D modeling-based techniques have been proposed in existing studies. Tagarakis et al. (2018) demonstrated the feasibility of using LiDAR sensors to capture olive tree structures prior to pruning, assisting managers in making more informed pruning decisions [24]. Strnad et al. (2020) further proposed a pruning optimization framework based on fully virtual trees, capable of simulating tree structure changes and subsequent growth under different pruning strategies, while supporting multi-objective optimization [25]. However, this method relies on complete and idealized virtual environments, which are difficult to adapt directly to real orchards with structurally complex and partially observed individual trees. On the deployment side, Underwood et al. (2016) developed a mobile 3D scanning platform for the efficient acquisition of fruit tree point-cloud data, suitable for continuous scanning tasks in outdoor environments [26]. Westling et al. (2018) proposed a light distribution analysis method based on low-resolution laser point clouds, which, when combined with public weather data, enables the rapid estimation of canopy light distribution while significantly reducing hardware requirements [27]. Building on this, Westling et al. introduced the SimTreeLS system, capable of generating realistic virtual fruit tree datasets simulating LiDAR point clouds, thereby providing a standardized testing platform for pruning decision algorithms [28]. Compared with the aforementioned studies, the method proposed in this paper focuses on unified modeling and transfer analysis between real and virtual data. Real tree point-cloud data are captured via handheld LiDAR scanners and combined with the SimTreeLS-generated virtual tree models to enable non-destructive simulation and analysis of pruning behaviors. Unlike optimization frameworks that rely on fully observed structures or idealized light estimation methods, our approach introduces local structural analysis and multi-scale light reconstruction mechanisms to effectively restore the impact of pruning on light accessibility in low-density point cloud scenarios. This enables accurate evaluation of pre- and post-pruning light distribution for individual trees. Moreover, the proposed method exhibits good deployability and transferability across different fruit species, laying a practical foundation for the development of intelligent pruning decision-support systems in future orchards.

Although the proposed system is based on deep learning and multi-modal data fusion techniques, its core purpose is straightforward: to improve the scientific accuracy and operational efficiency of apple tree pruning [29]. The experimental results demonstrated that the system maintains high pruning accuracy across various tree density levels. Notably, even in high-density canopy scenarios—where traditional manual pruning struggles due to occlusion—the system effectively identifies the target branches for pruning [30]. This addresses a common issue in orchard management: difficulty in visually distinguishing weak or diseased branches from healthy ones. From a practical standpoint, this result is especially valuable under modern high-density planting modes, where consistent pruning quality is critical for ensuring balanced light distribution and adequate ventilation, thereby enhancing overall tree health and fruit yield [31]. Moreover, by incorporating the continuous monitoring of light intensity, temperature, humidity, and branch angles, the system dynamically adjusts its pruning recommendations over time, effectively “learning” from the orchard’s evolving conditions [32]. This makes the solution adaptable to different fruit species, climatic regions, and orchard scales. While the technical implementation involves multiple model components, the end goal is to deliver a practical, intelligent, and adaptive decision-support tool for real-world agricultural applications [33]. It enables a shift from subjective, experience-driven pruning decisions to objective, data-driven practices, ultimately benefiting both productivity and sustainability in orchard management [7].

3.7. Limitation and Future Work

Despite the significant achievements of the deep learning-based tree pruning evaluation system proposed in this study, particularly in tree segmentation and pruning effect evaluation, several limitations remain, and there is still room for further improvements. First, the diversity and quality of data are critical to the performance of the model. In practical applications, differences in tree species, climate conditions, and growth states can impact data collection and processing, leading to insufficient representativeness in the training data. Although the multimodal data fusion strategy proposed in this study, which integrates hyperspectral images, sensor data, and expert system rules, was employed, the model’s performance may still be limited in certain extreme or unique environments. Therefore, future work should consider further validating the robustness and generalization of the model in more diverse scenarios. Additionally, in practical applications, the real-time update of sensor data requires substantial computational resources and a stable data transmission system. Particularly in large orchard environments, the layout of sensors and data collection may be constrained by spatial and technical limitations. Thus, how to effectively handle incomplete or erroneous data and ensure data quality during the online learning process remains a direction for future research.

4. Conclusions

Tree pruning is one of the key tasks in precision agricultural management, particularly in fruit tree pruning, where accurately identifying branches to be pruned, optimizing tree structure, and improving pruning effectiveness have always been challenges in the agricultural field. An intelligent system capable of real-time evaluation and optimization of tree pruning effects has been developed in this study. The innovations of this work are reflected in several aspects. First, the fusion of multimodal data offers a novel solution to the tree pruning task. By jointly encoding textual pruning rules provided by the expert system, real-time data from environmental sensors (such as light intensity, temperature and humidity, and branch angles), and historical scores from the pruning evaluation model, the proposed method leverages the complementary nature of multiple data sources, providing more comprehensive and accurate pruning recommendations. Second, the application of the Path Attention Mechanism in tree pruning tasks significantly enhances the model’s ability to focus on tree branch structures. By incorporating path information, the Path Attention Mechanism effectively strengthens the importance of different paths within the tree structure, ensuring the accuracy of branch segmentation and pruning paths. In the experimental aspect, the proposed model achieved significant improvements in both tree segmentation and pruning evaluation tasks. In the tree segmentation task, the method outperformed other comparison models with accuracy (0.94), recall (0.90), and F1 score (0.92), with particularly outstanding performance in mAP@50 (0.91) and mAP@75 (0.90). This demonstrates that the fusion of multimodal data effectively enhances pruning precision, especially in scenarios with high branch complexity, where the model better adapts to environmental changes. Through ablation experiments evaluating different sensor data, the importance of various sensor data for pruning effect evaluation was further verified. The experimental results show that sensor data such as light intensity, temperature and humidity, and branch growth angle are complementary in pruning effect evaluation and can significantly improve evaluation accuracy. These works and experiments further support the effectiveness of the multimodal data fusion strategy and provide a theoretical basis for the future optimization of pruning evaluation models. The proposed system holds strong potential for real-world deployment, particularly in high-density orchards, where it can enhance pruning efficiency and consistency while minimizing losses from subjective errors. In the future, this system could be integrated with agricultural robots for automated pruning tasks and extended to other fruit species such as pear and peach trees. Furthermore, by adapting to multi-season pruning requirements, the system can support dynamic strategies across different growth stages, advancing the digitization and intelligence of modern agriculture.

Author Contributions

Conceptualization, T.H., W.W., F.Y. and C.L. (Chunli Lv); Methodology, T.H., W.W. and F.Y.; Software, T.H., W.W. and F.Y.; Validation, C.L. (Chengze Li); Formal analysis, C.L. (Chengze Li), S.L. and R.H.; Investigation, C.L. (Chengze Li), S.L. and R.H.; Resources, M.L.; Data curation, M.L.; Writing—original draft, T.H., W.W., F.Y., M.L., C.L. (Chengze Li), S.L. and R.H.; Visualization, M.L. and S.L.; Supervision, R.H. and C.L. (Chunli Lv); Project administration, C.L. (Chunli Lv); Funding acquisition, C.L. (Chunli Lv). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Key Research and Development Program of China (2024YFC2607600).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Majeed, Y.; Zhang, J.; Zhang, X.; Fu, L.; Karkee, M.; Zhang, Q.; Whiting, M.D. Deep learning based segmentation for automated training of apple trees on trellis wires. Comput. Electron. Agric. 2020, 170, 105277. [Google Scholar] [CrossRef]
Tinoco, V.; Silva, M.F.; Santos, F.N.; Rocha, L.F.; Magalhães, S.; Santos, L.C. A Review of Pruning and Harvesting Manipulators. In Proceedings of the 2021 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Santa Maria da Feira, Portugal, 28–29 April 2021; pp. 155–160. [Google Scholar]
Borrenpohl, D.; Karkee, M. Automated pruning decisions in dormant sweet cherry canopies using instance segmentation. Comput. Electron. Agric. 2023, 207, 107716. [Google Scholar] [CrossRef]
Kong, X.; Li, X.; Zhu, X.; Guo, Z.; Zeng, L. Detection model based on improved faster-RCNN in apple orchard environment. Intell. Syst. Appl. 2024, 21, 200325. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Z.; Wang, X.; Fu, W.; Li, J. Automatic reconstruction and modeling of dormant jujube trees using three-view image constraints for intelligent pruning applications. Comput. Electron. Agric. 2023, 212, 108149. [Google Scholar] [CrossRef]
Lodolini, E.; Polverigiani, S.; Giorgi, V.; Famiani, F.; Neri, D. Time and type of pruning affect tree growth and yield in high-density olive orchards. Sci. Hortic. 2023, 311, 111831. [Google Scholar] [CrossRef]
Tong, S.; Zhang, J.; Li, W.; Wang, Y.; Kang, F. An image-based system for locating pruning points in apple trees using instance segmentation and RGB-D images. Biosyst. Eng. 2023, 236, 277–286. [Google Scholar] [CrossRef]
Ryalls, J.M.; Garratt, M.P.; Spadaro, D.; Mauchline, A.L. The benefits of integrated pest management for apple depend on pest type and production metrics. Front. Sustain. Food Syst. 2024, 8, 1321067. [Google Scholar] [CrossRef]
Zahid, A.; Mahmud, M.S.; He, L.; Choi, D.; Heinemann, P.; Schupp, J. Development of an integrated 3R end-effector with a cartesian manipulator for pruning apple trees. Comput. Electron. Agric. 2020, 179, 105837. [Google Scholar] [CrossRef]
Kolmanič, S.; Strnad, D.; Kohek, Š.; Benes, B.; Hirst, P. An algorithm for automatic dormant tree pruning. Appl. Soft Comput. 2021, 99, 106931. [Google Scholar] [CrossRef]
Zahid, A.; Mahmud, M.S.; He, L.; Heinemann, P.; Choi, D.; Schupp, J. Technological advancements towards developing a robotic pruner for apple trees: A review. Comput. Electron. Agric. 2021, 189, 106383. [Google Scholar] [CrossRef]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
You, A.; Grimm, C.; Silwal, A.; Davidson, J.R. Semantics-guided skeletonization of upright fruiting offshoot trees for robotic pruning. Comput. Electron. Agric. 2022, 192, 106622. [Google Scholar] [CrossRef]
Chen, Z.; Ting, D.; Newbury, R.; Chen, C. Semantic segmentation for partially occluded apple trees based on deep learning. Comput. Electron. Agric. 2021, 181, 105952. [Google Scholar] [CrossRef]
Moriya, É.A.S.; Imai, N.N.; Tommaselli, A.M.G.; Berveglieri, A.; Santos, G.H.; Soares, M.A.; Marino, M.; Reis, T.T. Detection and mapping of trees infected with citrus gummosis using UAV hyperspectral data. Comput. Electron. Agric. 2021, 188, 106298. [Google Scholar] [CrossRef]
Kang, H.; Chen, C. Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Comput. Electron. Agric. 2020, 171, 105302. [Google Scholar] [CrossRef]
Maheswari, P.; Raja, P.; Karkee, M.; Raja, M.; Baig, R.U.; Trung, K.T.; Hoang, V.T. Performance Analysis of Modified DeepLabv3+ Architecture for Fruit Detection and Localization in Apple Orchards. Smart Agric. Technol. 2024, 10, 100729. [Google Scholar] [CrossRef]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Li, W.; Liu, W.; Zhu, J.; Cui, M.; Hua, R.Y.X.; Zhang, L. Box2mask: Box-supervised instance segmentation via level-set evolution. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5157–5173. [Google Scholar] [CrossRef]
Liu, L.; Li, G.; Du, Y.; Li, X.; Wu, X.; Qiao, Z.; Wang, T. CS-net: Conv-simpleformer network for agricultural image segmentation. Pattern Recognit. 2024, 147, 110140. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Tagarakis, A.; Koundouras, S.; Fountas, S.; Gemtos, T. Evaluation of the use of LIDAR laser scanner to map pruning wood in vineyards and its potential for management zones delineation. Precis. Agric. 2018, 19, 334–347. [Google Scholar] [CrossRef]
Strnad, D.; Kohek, Š.; Benes, B.; Kolmanič, S.; Žalik, B. A framework for multi-objective optimization of virtual tree pruning based on growth simulation. Expert Syst. Appl. 2020, 162, 113792. [Google Scholar] [CrossRef]
Underwood, J.P.; Hung, C.; Whelan, B.; Sukkarieh, S. Mapping almond orchard canopy volume, flowers, fruit and yield using lidar and vision sensors. Comput. Electron. Agric. 2016, 130, 83–96. [Google Scholar] [CrossRef]
Westling, F.; Underwood, J.; Örn, S. Light interception modelling using unstructured LiDAR data in avocado orchards. Comput. Electron. Agric. 2018, 153, 177–187. [Google Scholar] [CrossRef]
Westling, F.; Bryson, M.; Underwood, J. SimTreeLS: Simulating aerial and terrestrial laser scans of trees. Comput. Electron. Agric. 2021, 187, 106277. [Google Scholar] [CrossRef]
He, L.; Schupp, J. Sensing and automation in pruning of apple trees: A review. Agronomy 2018, 8, 211. [Google Scholar] [CrossRef]
Robinson, T.L.; Lakso, A.N.; Ren, Z. Modifying apple tree canopies for improved production efficiency. HortScience 1991, 26, 1005–1012. [Google Scholar] [CrossRef]
Zhang, X.; He, L.; Majeed, Y.; Whiting, M.D.; Karkee, M.; Zhang, Q. A precision pruning strategy for improving efficiency of vibratory mechanical harvesting of apples. Trans. ASABE 2018, 61, 1565–1576. [Google Scholar] [CrossRef]
Mhamed, M.; Zhang, Z.; Yu, J.; Li, Y.; Zhang, M. Advances in apple’s automated orchard equipment: A comprehensive research. Comput. Electron. Agric. 2024, 221, 108926. [Google Scholar] [CrossRef]
Dong, X.; Kim, W.Y.; Yu, Z.; Oh, J.Y.; Ehsani, R.; Lee, K.H. Improved voxel-based volume estimation and pruning severity mapping of apple trees during the pruning period. Comput. Electron. Agric. 2024, 219, 108834. [Google Scholar] [CrossRef]

Figure 1. The diagram illustrates three core challenges in the fruit tree pruning task.

Figure 2. The diagram illustrates the data collection scenario for fruit trees, where (a) represents the sensor mounted on a stand to capture the three-dimensional information of the tree; (b) is the laptop connected to the sensor for real-time data processing and storage; (c) is the side view of the fruit tree, showing the growth and fruit distribution of the tree; (d) is the reference sphere used for calibrating the data collected by the sensor, ensuring the accuracy of the data.

Figure 3. The illustration of the multi-sensor deployment for fruit tree pruning evaluation is shown. It demonstrates the strategic arrangement of three types of sensors: the PAR light sensor (a) LI-COR LI-250A Light Meter + LI-190R Quantum Sensor, placed at the top and sides of the tree canopy; the SHT30 temperature and humidity sensor (b), distributed across the upper, middle, and lower layers of the tree canopy; and the ADXL345 tilt sensor (c), installed on the main branches. This sensor network provides a comprehensive monitoring of the key parameters for pruning evaluation.

Figure 4. The diagram illustrates the overall architecture of the tree pruning and dynamic evaluation system.

Figure 5. The diagram illustrates the network structure flowchart of the Trim-SegTransformer.

Figure 6. The diagram illustrates the structure of the Path Attention Mechanism, which optimizes the pruning plan by introducing path information, allowing the model to focus on important paths between branches.

Figure 7. The diagram illustrates the schematic of the tree pruning strategy, showing the process from the input of sensor data and expert system rules to the generation of the final pruning plan.

Figure 8. Performance curves during training of different models.

Figure 9. Visualization results of tree segmentation across models.

Figure 10. Performance analysis of our method on online learning strategy.

Figure 11. Performance of our method under different tree density levels.

Table 1. Data collection details for hyperspectral imaging in apple orchards.

Spectral Band	Wavelength Range	Number of Samples
Blue	450–520 nm	2781
Green	520–590 nm	2593
Red	620–760 nm	2964

Table 2. Tree segmentation detection results.

Model	Precision	Recall	Accuracy	F1 Score	mAP@50	mAP@75	FPS
Mask R-CNN [21]	0.83	0.79	0.81	0.81	0.80	0.79	18.2
SegNet [22]	0.86	0.82	0.84	0.84	0.85	0.83	22.1
Tiny-Segformer [23]	0.88	0.85	0.86	0.86	0.88	0.86	29.4
CS-net [20]	0.90	0.88	0.89	0.89	0.89	0.88	25.0
Box2Mask [19]	0.91	0.89	0.90	0.90	0.89	0.88	24.3
Proposed Method	0.94	0.90	0.92	0.92	0.91	0.90	31.2

Table 3. Pruning evaluation analysis results.

Model	Precision	Recall	Accuracy	F1 Score
SVM	0.81	0.78	0.80	0.79
MLP	0.85	0.82	0.83	0.83
Random Forest	0.88	0.85	0.86	0.86
Ours	0.92	0.89	0.91	0.91

Table 4. Ablation experiment of different sensor data for evaluation.

Light	Temperature and Humidity	Branch Growth Angle	Precision	Recall	Accuracy	F1 Score
✓	✕	✕	0.75	0.71	0.73	0.73
✕	✓	✕	0.73	0.70	0.71	0.71
✕	✕	✓	0.78	0.74	0.76	0.76
✕	✓	✓	0.86	0.83	0.84	0.84
✓	✓	✕	0.84	0.80	0.82	0.82
✓	✕	✓	0.81	0.78	0.80	0.79
✓	✓	✓	0.92	0.89	0.91	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hai, T.; Wang, W.; Yan, F.; Liu, M.; Li, C.; Li, S.; Hu, R.; Lv, C. A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices. Agronomy 2025, 15, 1242. https://doi.org/10.3390/agronomy15051242

AMA Style

Hai T, Wang W, Yan F, Liu M, Li C, Li S, Hu R, Lv C. A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices. Agronomy. 2025; 15(5):1242. https://doi.org/10.3390/agronomy15051242

Chicago/Turabian Style

Hai, Tong, Wuxiong Wang, Fengyi Yan, Mingyu Liu, Chengze Li, Shengrong Li, Ruojia Hu, and Chunli Lv. 2025. "A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices" Agronomy 15, no. 5: 1242. https://doi.org/10.3390/agronomy15051242

APA Style

Hai, T., Wang, W., Yan, F., Liu, M., Li, C., Li, S., Hu, R., & Lv, C. (2025). A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices. Agronomy, 15(5), 1242. https://doi.org/10.3390/agronomy15051242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Learning-Based Approach to Apple Tree Pruning and Evaluation with Multi-Modal Data for Enhanced Accuracy in Agricultural Practices

Abstract

1. Introduction

2. Materials and Methods

2.1. Hyperspectral Data Collection

2.2. Hyperspectral Data Annotation

2.3. Sensor Data Collection

2.4. Tree Pruning Strategy and Evaluation System

2.5. Trim-Segtransformer

2.5.1. Path Attention Mechanism

2.5.2. Trim Scheme

2.5.3. Dynamic Evaluation Strategy

2.6. Experimental Design

2.6.1. Experimental Environment

2.6.2. Baseline Methods

2.6.3. Evaluation Metrics

3. Results and Discussion

3.1. Tree Segmentation Detection Results

3.2. Pruning Evaluation Analysis

3.3. Online Learning Strategy Analysis

3.4. Performance Analysis Under Different Tree Density Levels

3.5. Ablation Experiment of Different Sensor Data for Evaluation

3.6. Discussion

3.7. Limitation and Future Work

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI