PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8

Zhang, Yajie; Jin, Weiliang; Gu, Baoxing; Tian, Guangzhao; Li, Qiuxia; Zhang, Baohua; Ji, Guanghao

doi:10.3390/agriculture15161786

Open AccessArticle

PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8

by

Yajie Zhang

¹,

Weiliang Jin

¹,

Baoxing Gu

^1,*,

Guangzhao Tian

¹

,

Qiuxia Li

¹,

Baohua Zhang

²

and

Guanghao Ji

¹

College of Engineering, Nanjing Agricultural University, Nanjing 210031, China

²

School of Smart Agriculture (Artificial Intelligence), Nanjing Agricultural University, Nanjing 210031, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(16), 1786; https://doi.org/10.3390/agriculture15161786

Submission received: 22 July 2025 / Revised: 14 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

With the development of smart agriculture, the precise identification of fruit tree trunks by orchard management robots has become a key technology for achieving autonomous navigation. To solve the issue of tree trunks being hard to see against their background in orchards, this study introduces PCC-YOLO (PENet, CoT-Net, and Coord-SE attention-based YOLOv8), a new trunk detection model based on YOLOv8. It improves the ability to identify features in low-contrast situations by using a pyramid enhancement network (PENet), a context transformer (CoT-Net) module, and a combined coordinate and channel attention mechanism. By introducing a pyramid enhancement network (PENet) into YOLOv8, the model’s feature extraction ability under low-contrast conditions is enhanced. A context transformer module (CoT-Net) is then used to strengthen global perception capabilities, and a combination of coordinate attention (Coord-Att) and SENetV2 is employed to optimize target localization accuracy. Experimental results show that PCC-YOLO achieves a mean average precision (mAP) of 82.6% on a self-built orchard dataset (5000 images) and a detection speed of 143.36 FPS, marking a 4.8% improvement over the performance of the baseline YOLOv8 model, while maintaining a low computational load (7.8 GFLOPs). The model demonstrates a superior balance of accuracy, speed, and computational cost compared to results for the baseline YOLOv8 and other common YOLO variants, offering an efficient solution for the real-time autonomous navigation of orchard management robots.

Keywords:

orchard management robot; YOLOv8; trunk detection; attention mechanism

1. Introduction

Smart agriculture, also known as the Third Agricultural Revolution, is transforming traditional agriculture by utilizing new technologies to achieve high-quality and labor-saving agricultural production. Orchard intelligence is a key approach in advancing the orchard industry, improving production quality, and addressing labor shortages [1,2,3,4,5]. Navigation is crucial for orchard management robots to operate in orchards, and accurate identification of fruit tree trunks is essential for autonomous navigation [6].

Traditional trunk detection methods include multi-camera and ultrasonic sensor fusion, laser scanning, and more. These methods have been proposed by various researchers, including Chen et al., Maeyama et al., Andersen et al., Shalal and Shalal, Gimenez and Gimenez, and Jiang and Sun. Each method has its limitations, such as sensitivity to lighting changes, complex backgrounds, and the need for significant computational resources and professional software [7,8,9,10,11,12,13,14,15,16,17,18]. In dense orchard environments, complex backgrounds can scatter or weaken ultrasonic signals, interfering with effective trunk detection. In regions with extensive vegetation, laser signals may be obstructed, leading to incomplete point cloud data. Additionally, processing point cloud data using laser scanning techniques necessitates substantial computational resources and specialized software [19,20,21].

In conclusion, smart agriculture is transforming traditional agriculture by utilizing new technologies to improve efficiency, enhance product quality, and expand industrial scale. Accurate identification of fruit tree trunks is crucial for autonomous navigation and addressing labor shortages in the orchard industry. Traditional trunk detection methods, such as multi-camera and ultrasonic sensor fusion, laser scanning, and RGB-D images, exhibit limitations that must be addressed to fully realize the potential of smart agriculture in the future.

YOLO is a method for detecting and measuring fruit trees. However, traditional CNNs have limitations in regards to feature reuse and controlling computational redundancy [22]. Joseph Redmon and others improved the YOLO detection method, developing YOLO9000, which achieved an average precision of 19.7 on the ImageNet detection validation set [23].

Ziang Cao introduced an exponential moving average (EMA) mechanism and proposed a detection model named YOLOv8-Trunk, which achieved a final accuracy of 92.7% in detecting vine trunks [24]. Zhao Yinghua et al. collected samples of roadside tree trunks and transferred the training and collected image samples to the YOLOv3 model, achieving a recognition accuracy of 93.7%, recall rate of 92.4%, and a recognition time per image of 0.014 s [25]. Tolga Özer collected a cherry tree dataset and trained it using YOLOv5m, YOLOv5s, and YOLOv5x models, with F1 scores of 94.20%, 98.0%, and 95.9%, respectively [26]. Youliang Chen improved the YOLOv4 model, using UAV images of bayberry trees for training, achieving a detection accuracy of 97.78% and a recall rate of 98.16% [27]. Han Sun proposed a method that combines binocular vision with semantic segmentation to detect and measure pear orchard trunks, building a deep learning model based on PSPNet [28]. Jianjun Zhou collected and processed orchard images to train the YOLOv3 model, achieving a final accuracy of 90.00% for extracting the orchard center line [29].

While existing studies have made some progress, they are rarely specifically optimized for robustness under low-contrast and complex lighting variations, which is crucial for all-weather operations in orchards [30,31]. This paper proposes an improved YOLOv8 solution based on a real orchard environment, introducing a pyramid enhancement network (PENet) to improve feature extraction in low-contrast environments, adopting the context transformer module (CoT-Net) to enhance global perception, and Coord-SE (Coord-Att and SENetV2 combined module) to optimize target localization accuracy.

The main contributions of this paper are as follows:

A thorough and realistic orchard dataset was developed. To guarantee the universality of the experimental model, the dataset encompassed various seasons (winter and spring), diverse lighting conditions, and fruit trees of varying sizes (large, medium, and small), hence ensuring the dataset’s comprehensiveness and wide application.
A multi-module fusion model augmentation strategy was presented to attain multi-dimensional performance improvements. Optimizations were implemented on the YOLOv8 model to mitigate poor contrast and intricate lighting variations. The PENet and CoT-Net modules were introduced, and the Coord-SE module was proposed to improve model robustness while preserving a lightweight architecture.
Traditional methods employing LiDAR for fruit tree identification are expensive. This study presents an economical, lightweight, and extremely resilient method for fruit tree detection in orchard robots, demonstrating resistance to occlusion and variations in lighting conditions.

2. Materials and Methods

2.1. Experimental Data

The image dataset used in this study was collected at the Baima Experimental Base of Nanjing Agricultural University in Nanjing, Jiangsu Province. To make the data collection process more closely resemble the real conditions under which an orchard management robot operates between rows of trees, a dedicated data collection platform was constructed. This platform includes a visual data acquisition vehicle chassis and a camera module, which were used to collect images of chestnut tree trunks.

The data collection platform was placed between tree rows and remotely controlled to move forward at a speed of 1 km/h. The image acquisition setup is shown in Figure 1. The dimensions of the vehicle chassis are 500 mm (length) × 350 mm (width) × 200 mm (height), and the camera is mounted at a height of 300 mm. The camera module has a resolution of 12 megapixels and a frame rate of 60 frames per second. A total of ten video segments were recorded. Sample images from the videos are shown in Figure 2.

The orchard environment is relatively complex, and orchard management robots are usually required to work continuously throughout the day. To ensure the richness and accuracy of the data, collection was carried out under different lighting conditions and during different seasons. Additionally, data was collected for tree trunks appearing in large, medium, and small scales within the field of view, as shown in Figure 3.

In this study, the annotation tool LabelImg was used to label the tree trunks in the Pascal VOC format, generating .xml annotation files. A total of 1000 original images of tree trunks were collected, covering various working conditions, lighting scenarios, and weather conditions. Since the row spacing in this study is relatively wide, most trees appear as medium or small-sized targets in the field of view of the orchard management robot. To enrich the dataset, improve its generalization, and prevent overfitting, mosaic data augmentation was applied. For each augmentation, four images were read and subjected to transformations such as flipping, scaling, and color variation, then stacked together to form a new image, as shown in Figure 4. After augmentation, the dataset expanded to 5000 images, divided into 3500 for training, 1000 for validation, and 500 for testing, following a roughly 7:2:1 ratio. The LabelImg tool was used to annotate the fruit trees, with the category labeled as “tree.”

2.2. Improved YOLOv8 Algorithm

2.2.1. Model Selection

This experiment uses the YOLOv8 model as the baseline. YOLO has been widely applied in agricultural production, and in many of the complex environments found in orchards, its lightweight design and stability cause it to outperform other models. The YOLOv8 structure diagram is shown in Figure 5.

The YOLOv8 network architecture is composed of four main parts: the input, the backbone, the neck (feature fusion layer), and the head (detection layer). Its backbone network adopts an improved CSPDarknet53 structure and introduces the C2f module (cross-stage partial connections with 2-fold feature extraction) [32]. This module uses a multi-branch gradient flow design to combine shallow and deep features. It retains rich feature information while reducing computational cost. C2f replaces the E-ELAN module used in YOLOv7 and removes the focus module instead using standard convolutional layers to simplify the architecture and improve inference speed.

YOLOv8 continues to use PAN-FPN for feature fusion but improves the way features are concatenated by eliminating the redundant 1 × 1 convolution layers from YOLOv7. It directly merges multi-scale features and uses a dual-path (top-down and bottom-up) structure for better small-object detection. The detection head in YOLOv8 adopts a decoupled head design, performing anchor-free prediction. It separates classification and regression tasks: classification uses VFL (varifocal loss), and regression uses DFL + CIoU loss for higher accuracy. In contrast, YOLOv7 uses a coupled head with anchor-based prediction. In terms of training strategies, YOLOv8 uses dynamic mosaic augmentation, which is disabled in the later stages of training to reduce the risk of overfitting. It also introduces self-adversarial training (SAT) to enhance model robustness.

Overall, YOLOv8 offers a better balance between accuracy and speed than does YOLOv7. It achieves a higher mAP on the COCO dataset, faster inference speed, and improves small-object detection AP by over 20%, making it well-suited for orchard environments where small targets are common.

2.2.2. Model Construction

This model uses YOLOv8 as the baseline. At the very front of the model, a pyramid enhancement network (PE-Net) is added to reduce the complexity of the input images, thereby improving the model’s ability to detect targets. At the backbone, the last two C2f units’ Darknet bottleneck modules are replaced with a variant transformer structure called CoT-Net (contextual transformer network). This modification enhances the model’s capacity to extract boundary features during training, which improves detection speed and makes the model lighter. Prior to the detection head, a coordinate attention mechanism (coord-Att) is introduced, combined with a two-layer convolutional SeNetV2 module. This fusion leverages the coordinate attention’s ability to highlight spatial details and the channel attention’s focus on surface-level image features, together enhancing the model’s accuracy. The architecture diagram of the improved YOLOv8 is shown Figure 6.

2.2.3. Element Replacement

In YOLOv8, the C2f module introduces residual connections, allowing gradients to bypass some convolutional layers and flow directly. This helps ease the problem of vanishing gradients in deep networks. The structure includes two 3 × 3 convolution layers with a residual skip connection in between, similar to the bottleneck design in ResNet [32]. However, it does not use a 1 × 1 convolution to adjust channel numbers; instead, it handles everything within the 3 × 3 convolutions. This limits the model’s ability to extract higher-level features and weakens its global perception, which impacts detection accuracy in complex scenes like orchards. Within the backbone, the last two C2f modules output feature maps at smaller scales (40 × 40 and 20 × 20), where the bottleneck’s feature extraction ability is even weaker, and its sensitivity to changes in neighboring region features is low. To improve this, the last two C2f units replace the bottleneck with a variant transformer structure called CoT-Net (contextual transformer network) to enhance global perception [33].

The self-attention mechanisms of traditional transformer structures rely on isolated query–key interactions to create attention matrices, where each query only compares to individual keys independently. This ignores contextual relationships among neighboring keys, leading to a lack of explicit local context modeling, e.g., missing details like object edges or local structures in visual tasks. CoT-Net introduces context-enhanced self-attention by applying a 3 × 3 convolution to encode local context in the input keys, capturing spatial relationships between neighboring keys and creating a static context representation. This simulates local information extraction, e.g., traditional convolutions, but focuses on contextualizing keys to provide prior information for subsequent attention. Next, the static context keys are concatenated with queries and passed through two stages of 1 × 1 convolutions to generate a dynamic attention matrix that models global relationships. Finally, static and dynamic contexts are fused, preserving local structure information while capturing long-range dependencies, resulting in a more comprehensive feature representation. The visualization structures of the two structures are shown in Figure 7.

2.2.4. Unit Optimization

Because PCC-YOLO needs to be deployed on orchard management robots with limited local computing power, it must balance lightweight with high accuracy. The orchard environment is complex, with varying tree growth conditions, and the field of view mainly contains small and medium-sized targets. To address these challenges, we propose a module Coord-SE that combines coordinate attention (Coord-Att) with channel attention (SENetV2). This module is placed prior to the detection head: after features are extracted from the backbone, they first pass through this module for processing, then enter the detection head. Coordinate attention retains spatial coordinate encoding, capturing both channel and positional information. Its core function is to decompose channel attention into spatial dimensions along horizontal and vertical directions to more efficiently capture long-range dependencies [34].

First, the feature map extracted by the backbone is globally pooled along the horizontal (X-axis) and vertical (Y-axis) directions, producing two independent 1D feature vectors. For example, for a feature map with height and width, horizontal pooling generates a vector, and vertical pooling generates a vector. These two 1D vectors are then concatenated and passed through a shared 1 × 1 convolution and a nonlinear activation function (like sigmoid) to generate an intermediate feature representing coordinate attention. This intermediate feature is split into horizontal and vertical attention weight matrices. Each attention matrix is applied to the original feature map by weighting, enhancing spatially sensitive feature responses. Finally, the weighted horizontal and vertical feature maps are summed to form the final attention-enhanced features, preserving precise positional information. The Coord SE structure is shown in Figure 8.

SENetV2 is an improved version of the classic squeeze-and-excitation network (SENet). Its main goal is to enhance the joint learning ability of channel features and global representations through a multi-branch dense layer design [35]. The core idea of SENet is to strengthen the modeling of dependencies between channels using squeeze-and-excitation (SE) blocks. The key innovation is the adaptive recalibration of each channel’s feature response, which significantly improves the network’s ability to represent features. Stacking SE blocks forms the SENet architecture, which effectively improves the model’s generalization across different datasets. The structures of SENet and SENetV2 are shown in Figure 9.

SENetV2 retains the residual modules from the original SENet but integrates a multi-branch structure inspired by ResNeXt. It builds branch structures through multiple parallel fully connected layers (dense layers), with each branch learning different global feature representations. It also adds a feature fusion mechanism that weights or concatenates the outputs of these branches to form richer global representations. These are then combined with the original SE module’s channel attention features, enhancing the model’s ability to capture complex patterns. This provides SENetV2 stronger global representation capability, which helps better identify fruit trees in the complex environment of orchards.

By decomposing spatial dimensions into independent horizontal and vertical encodings, Coord-Att explicitly models positional information, solving the problem of traditional channel attention ignoring spatial relationships. This allows for more accurate localization of object edges in detection tasks. In our dataset, tree trunk colors are similar to that of the ground, making trunk edges difficult to detect, which lowers recognition accuracy. Using coordinate attention improves the precision of trunk edge localization and thus enhances detection accuracy. The introduction of SENetV2 dense layers allows the model not only to focus on channel importance (as in SENet) but also to learn semantic relationships across regions (such as contextual relationships between parts of objects). At the same time, the multi-branch design uses parameter sharing (e.g., sharing weights in some layers) or low-dimensional projection, adding very few extra parameters compared to number using traditional multilayer perceptrons (MLPs). This makes SENetV2 suitable for lightweight deployment and reduces the computing power requirements of orchard management robots.

Finally, a two-layer convolution is used to connect the two modules—coordinate attention and SENetV2—into a single structure. These mechanisms complement each other: coordinate attention focuses on spatial position modeling, while SENetV2 emphasizes global semantic aggregation and is more attentive to shallow image features. Combining their strengths results in higher recognition accuracy.

2.2.5. PENet Preprocessing

PENet is a pyramid enhancement network designed for object detection in low-light conditions. Its main idea is to improve image recognizability through multi-scale decomposition and detail enhancement, thereby boosting the performance of detection models [36]. PENet first enhances the input image to produce high-quality feature maps. These enhanced features are then fed into the backbone of YOLOv8, which uses its multi-scale prediction heads to output detection results. This design enables the detector to directly benefit from the enhanced details, showing significant improvement in scenarios with occlusion or low contrast—such as in this experiment, where tree trunks and the ground display very similar color tones. The structural diagram of PENet is shown in Figure 10.

PENet first uses a Laplacian pyramid to decompose the input image into four components of different resolutions (e.g., high-frequency details, low-frequency background). High-frequency components capture edges, textures, and other detail features (like object outlines), which are often lost under low-light or low-contrast conditions, making detection difficult. Low-frequency components preserve overall lighting and color distribution, helping restore the image’s basic brightness and contrast.

With this layered approach, PENet can individually enhance features at different frequencies, avoiding the noise introduced by global enhancement. For the decomposed multi-scale components, PENet includes specialized enhancement modules. High-frequency enhancement uses convolutional layers or attention mechanisms (e.g., coordinate attention) to strengthen edge and texture details. Low-frequency correction adjusts the lighting distribution in low-frequency components using methods like adaptive histogram equalization or brightness mapping to improve overall visibility. Cross-scale feature fusion then recombines the enhanced components from different resolutions into a final image that is both detail-rich and well-lit.

PENet uses pyramid decomposition to process high-frequency details (such as object edges) and low-frequency lighting (such as background brightness) simultaneously, avoiding the limitations of traditional single-method enhancement techniques. Compared to end-to-end joint training approaches (like directly modifying the detection network), PENet functions as an independent preprocessing module, making it more flexible and more easily adaptable to different detectors, while also improving inference speed.

2.3. Experimental Platform and Parameter Settings

2.3.1. Experimental Platform

The experiments were conducted on a desktop computer with the following configuration: The hardware CPU used for the experiments is an AMD Ryzen 5 5600 CPU with 6 cores, manufactured by AMD Corporation, Santa Clara, CA, USA. 16 GB RAM, NVIDIA GeForce RTX 4060 Ti GPU, and Windows 11 (64-bit) operating system. The deep-learning environment included Python 3.10.16, PyTorch 2.3.1, Torchvision 0.18.1, and CUDA version 12.6. During training, stochastic gradient descent (SGD) was used to learn and update the network parameters.

2.3.2. Parameter Settings

The training parameters for the PCC-YOLO network were as follows:

Batch size: 16.

Number of training epochs: 300.
Initial learning rate: 0.001.
Momentum: 0.95.
Weight decay: 0.0004.
Optimizer: SGD.
Image size: 640 × 640.
Dataset caching: enabled (true).

2.3.3. Evaluation Criteria for Tree Trunk Recognition

The model was evaluated using metrics such as Precision (P), Frames Per Second (FPS), Mean Average Precision (mAP) and Giga Floating Point Operations Per Second (GFLOPs), MSE (Mean Squared Error), RMSE (Root Mean Square Error), and the formulas for calculating them. Precision, FPS, mAP, MSE and RMSE are given in Equations (1) through (5). The MSE and RMSE now utilize the accurate IoU (Intersection over Union) matching computation, assessing only the matching prediction boxes and true boxes where IoU exceeds 0.5. These measures enhance the conventional mAP metric by offering a direct assessment of bounding box localization precision. In contrast to conventional metrics, mAP emphasizes “the occurrence of detection”, while MSE and RMSE concentrate on “the accuracy of location”, measured in pixels.

P = \frac{TP}{TP + FP} \times 100 %

(1)

FPS = \frac{N}{Total time}

(2)

mAP = \frac{\int_{0}^{1} P (R) dR}{N}

(3)

M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

(4)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}

(5)

3. Results

3.1. Comparison of Different Network Models

To evaluate the advantages of PCC-YOLO over baseline models, comparative experiments were conducted under the same hardware conditions using YOLOv8n, YOLOv6, and YOLOv5s. The results are shown in Table 1. Among all models, PCC-YOLO performed the best, achieving the highest recognition speed at 143.36 frames per second. While it slightly increased the number of floating-point operations compared to the original YOLOv8, it showed significant improvements in average precision, frame rate, and recall. Although YOLOv5s achieved a slightly higher average precision of 0.82 compared to PCC-YOLO’s 0.81, PCC-YOLO maintained 98% of YOLOv5s’s precision, with only 41% of its computational load. This makes PCC-YOLO more suitable for deployment in orchard management robots, where hardware resources are limited. Overall, PCC-YOLO stands out as the most efficient and practical model. PCC-YOLO exhibited the lowest MSE and RMSE values among the five models, demonstrating superior bounding box localization accuracy.

To more clearly demonstrate PCC-YOLO’s outstanding performance, a set of test images from the dataset was used for comparison, focusing on tree recognition accuracy. Four images were selected under varying weather conditions: sunny winter with angled sunlight, sunny winter without sunlight, overcast winter, and sunny spring. These images also included targets of small, medium, and large sizes to showcase the model’s robustness. Table 2 illustrate that YOLOv8n, YOLOv6, and YOLOv5s encountered differing levels of missed and false detections. The statistics reveal that the false positive rate of PCC-YOLO is 2.6%, whereas YOLOv8n, YOLOv6, and YOLOv5s demonstrate rates of 23.6%, 18.4%, and 21%, respectively, particularly in low-light conditions or when a greater number of trees were present. This highlights their limitations in extracting tree trunk features and detecting boundaries between the trunks and the ground. Even under optimal illumination conditions, where fruit trees are sparse and more discernible, YOLOv5s still fails to detect certain targets, demonstrating its inadequate detection capabilities for small objects.

In contrast, PCC-YOLO consistently and accurately detected fruit trees even under low-contrast and poor lighting conditions. It also performed well across different seasons, proving its versatility and accuracy.

3.2. Comparison of Different Attention Mechanisms

As shown in Table 1, due to the specific characteristics of the model, the contrast between the tree trunks and the ground is low. The commonly used attention mechanisms that generally work well in other scenarios actually perform worse than the original YOLOv8 in this orchard environment. Additionally, adding attention mechanisms increases the model’s complexity, which raises the requirements for local deployment devices.

From Table 3, it is clear that except for YOLOv8-Coord-SE, all other models that included attention mechanisms saw a drop in precision, with only YOLOv8-SEAM showing a slight improvement. Nonetheless, MSE exhibited no optimization and, in certain instances, demonstrated negative optimization. This suggests that most attention mechanisms are not suitable for this scenario. When comparing YOLOv8-SEAM and YOLOv8-Coord-SE, YOLOv8-Coord-SE not only achieved greater improvements in accuracy and recall but also maintained better detection speeds. YOLOv8-Coord-SE demonstrates exceptional advantages in regards to bounding box localization accuracy.

Compared to the original YOLOv8, the YOLOv8-Coord-SE model experienced a small drop in speed due to the added attention mechanism, but the results remained close to those of the original. Meanwhile, accuracy and MSE increased significantly—by 1.1 and 3.1 percentage points, respectively. In YOLOv8-Coord-SE, the combination of coordinate attention and channel attention effectively merges their strengths, considering both spatial information and surface features of the image. This makes it superior in identifying the edges of tree trunks in this orchard environment, thereby improving both detection accuracy and recall. As a result, YOLOv8-Coord-SE was ultimately chosen for the tree trunk recognition task due to its superior performance.

3.3. CoT-Net Effectiveness Validation

CoT-Net integrates local and contextual information to improve modeling skills. To assess the genuine performance enhancement of CoT-Net compared to that of conventional transformer self-attention frameworks for visual feature extraction and object recognition, we performed a comparative analysis of YOLOv8-cot, YOLOv8, and YOLOv8-tr to confirm its efficacy. We employed the conventional transformer self-attention architecture alongside CoT-Net at identical network sites, with the results presented in Table 4.

The incorporation of the conventional transformer self-attention framework yielded only a marginal enhancement over the results for the original YOLOv8 model. The conventional transformer self-attention framework neglects the contextual interactions of adjacent keys, and this research failed to sufficiently capture the boundaries between tree trunks and the ground. Nonetheless, the implementation of this structure augmented the model’s computational burden, resulting in adverse consequences. CoT-Net enhances the conventional transformer self-attention architecture by emphasizing variations in characteristics within neighboring regions, thus mitigating computational redundancy. This not only improves the model’s accuracy but also facilitates model lightweighting.

3.4. Ablation Experiment

PCC-YOLO introduces several improvements when compared with the original YOLOv8 model. Firstly, the model adopts YOLOv8 as the baseline and incorporates a pyramid enhancement network (PE-Net) at the front end of the model. Secondly, a variant of the transformer architecture, i.e., a contextual transformer network (CoT-Net), is used to replace the Darknet bottleneck in the last two C2f blocks of the backbone. Finally, a coordinate attention mechanism (coord-Att) is introduced prior to the head, and a dual-layer convolution concatenated SeNetV2 module is utilized to enhance the model’s accuracy. To more clearly analyze the performance improvements generated by PCC-YOLO compared to the results for the original YOLOv8, we conducted ablation experiments and designed five experimental groups, with results shown in Table 5. These experimental outcomes demonstrate the impact of the various improvements and the overall performance enhancement of the PCC-YOLO algorithm.

According to Table 5, PCC-YOLO achieved a precision of 81%, which is 5 percentage points higher than the results using YOLOv8, 3.3 points higher than those for YOLOv8-coord, 2 points higher than those for YOLOv8-cot, and 3.8 points higher than those for YOLOv8-lap. In terms of mAP, it is also higher by 4.8, 1.7, 3, and 2 percentage points, respectively. The recognition performance is significantly improved compared to that of the traditional YOLOv8. Meanwhile, the floating-point operations are slightly higher than for the original YOLOv8, but considering the improvement in precision and accuracy, it remains at a relatively low level, and is 0.4 lower than for YOLOv8-coord. PCC-YOLO’s detection speed reaches 112.53 f/s. Although this is slower than YOLOv8-lap’s 192 f/s, it offers a greater advantage in regards to average precision. Regarding MSE and RMSE, each enhanced model demonstrated superiority over the results for the original model; however, PCC-YOLO exhibited the most significant enhancement, achieving the highest accuracy for bounding box localization. The visualization comparison of ablation experiment data is shown in Figure 11 and Figure 12. Taking all factors into account, PCC-YOLO was finally chosen as the improved model.

4. Discussion

This study introduced three improved models: the pyramid enhancement network (PENet), the context transformer module (CoT-Net), and the optimization strategy of combining a coordinate attention mechanism (Coord-Att) with a channel attention mechanism (SENetV2), all of which achieved satisfactory results.

In previous studies, Qianren Guo et al. introduced a new adaptive weighted feature pyramid network [37]. The incorporation of the pyramid network improved underwater target detection, where the illumination is minimal, and the delineation between identified objects and their surroundings is ambiguous. These conditions resemble those used in this study, demonstrating that the introduction of the pyramid network can augment feature extraction when the target exhibits low contrast with the environment. Adding attention mechanisms into the headis presently a prevalent approach for enhancing baseline models, allowing the model to concentrate more effectively on its features. In intricate agricultural settings, this is advantageous for recognizing crops, pests, weeds, and other entities [38,39,40]. Liu et al. proposed a camellia fruit-picking robot trunk detection method based on improved YOLOv7 [41], adding attention mechanisms into the head to increase trunk feature extraction ability. Recently, to address the constraints of YOLO in managing occlusions, irregular structures, and background noise, transformers and their variants have been extensively utilized with YOLO for agricultural image processing tasks [42,43,44]. Liu et al. employed the Swin-B Transformer to augment the precision of machine-based strawberry ripeness detection in intricate environments [45]; Zhang et al. refined the YOLOv5 algorithm to facilitate the identification and localization of target objects, including tree trunks, pedestrians, and supports in orchards. This research substituted the bottleneck network in the C3 module of the original YOLOv5 model with the lightweight GhostNet V2 to diminish network parameters and model size [46]. This study replaced the bottleneck network in the final two C2f modules of the main branch with CoT-Net, which used a variation transformer architecture. This diminished the quantity of network parameters and reduced the model size while enhancing recognition performance, illustrating the efficacy of this enhancement.

Ling et al. proposed a jujube tree recognition method based on YOLOv8 with an attention mechanism [47], improving the recognition rate while keeping the model lightweight. But this study exhibits subpar performance in processing multi-scale inputs, demonstrates inadequate feature fusion capabilities, and struggles with recognition when confronted with interference from branches. Our study employs the CoT-Net and PENet variant transformer architectures, which provide superior performance in this context. Fei Su et al. proposed a method based on improved YOLOv5s for trunk and obstacle detection in semi-structured apple orchards [48]. Their study also considered changes in recognition rate caused by contrast differences between fruit trees and the ground in different seasons, but only annotated the first two rows of trees, and recognition of the back rows was poor. This study achieves better recognition of obscured trees in the back rows, with an overall recognition rate of 81%. It exhibits higher precision in regards to orchard autonomous navigation, proving the research value of this work.

Although the model in this paper improved the accuracy and speed of fruit tree recognition, it has some limitations: (1) The introduced Coord-SE module (coordinate attention mechanism and channel attention mechanism) improved the recognition rate but increased floating-point operations and inference time; (2) the dataset was collected from chestnut trees in the orchard at Nanjing Agricultural University’s Baima base and lacks verification for other tree species; (3) the model’s recognition rate for distant, heavily obscured trees requires improvement. Under special lighting, distant trees blocked by branches can still cause some missed and false detections. Future research will add datasets of other tree species to enrich the dataset, use better-performing modules in the model to improve fruit tree recognition accuracy, and replace complex parts of the model with lightweight structures to balance speed and recognition.

5. Conclusions

This study addresses the challenges of tree trunk recognition in complex orchard environments by proposing an accurate fruit tree trunk detection algorithm based on an improved YOLOv8 framework, termed PCC-YOLO. The algorithm incorporates the pyramid enhancement network (PENet), the context transformer module (CoT-Net), and an optimization strategy that integrates the coordinate attention mechanism (Coord-Att) with the channel attention mechanism (SENetV2). These enhancements significantly improve the model’s detection accuracy and real-time performance in complex orchard environments.

The module optimization has yielded notable improvements. The PCC-YOLO model demonstrates superior performance across key evaluation metrics, including mean average precision (mAP), precision (P), and detection speed (FPS). Compared with traditional object detection models such as YOLOv5s, the improved model maintains or enhances detection accuracy while substantially reducing computational complexity—for example, its floating-point operations (FLOPs) amount to only 41% of those required by YOLOv5s—making it more suitable for deployment on mobile devices with limited computing resources.

This study yields significant practical implications. The research team constructed a real-world orchard trunk dataset comprising 5000 labeled images, which provides a valuable data foundation for future studies. The improved model effectively mitigates the performance degradation of traditional visual algorithms under challenging conditions such as occlusion and varying illumination, without relying on costly sensors. This advancement offers reliable technical support for the autonomous navigation of orchard management robots. Furthermore, the model’s lightweight design (7.8 GFLOPs) enables efficient execution on embedded devices, satisfying the low-power and high real-time requirements typical of agricultural applications.

In conclusion, PCC-YOLO provides an efficient, accurate, and low-cost vision-based solution for autonomous navigation in orchards. By obviating the need for expensive sensors like LiDAR, our approach significantly lowers the barrier for the practical deployment of agricultural robots, thereby contributing to the advancement of smart farming.

Author Contributions

Conceptualization, Y.Z. and B.G.; methodology, Y.Z. and B.G.; data curation, W.J. and Q.L.; validation, Y.Z., G.T. and W.J.; writing—original draft preparation, Y.Z. and B.G.; writing—review and editing, Y.Z., B.Z., G.J. and W.J.; funding acquisition, Y.Z.; visualization, Y.Z. and B.G.; supervision, Y.Z. and B.G.; project administration, G.T. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Nanjing Agricultural University’s 2024 “National Student Innovation Research and Entrepreneurship Training” (202410307044Z).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the fact that these data are also used for other research in the laboratory.

Acknowledgments

The author would like to thank all contributors to this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Z.; Granland, K.; Tang, Y.; Chen, C. HOB-CNNv2: Deep learning based detection of extremely occluded tree branches and reference to the dominant tree image. Comput. Electron. Agric. 2024, 218, 108727. [Google Scholar] [CrossRef]
Mandal, S.; Yadav, A.; Panme, F.A.; Devi, K.M.; S.M., S.K. Adaption of smart applications in agriculture to enhance production. Smart Agric. Technol. 2024, 7, 100431. [Google Scholar] [CrossRef]
Li, D.; Nanseki, T.; Chomei, Y.; Kuang, J. A review of smart agriculture and production practices in Japanese large-scale rice farming. J. Sci. Food Agric. 2022, 103, 1609–1620. [Google Scholar] [CrossRef]
Abbasi, R.; Martinez, P.; Ahmad, R. The digitization of agricultural industry—A systematic literature review on agriculture 4.0. Smart Agric. Technol. 2022, 2, 100042. [Google Scholar] [CrossRef]
Javaid, M.; Haleem, A.; Singh, R.P.; Suman, R. Enhancing smart farming through the applications of Agriculture 4.0 technologies. Int. J. Intell. Netw. 2022, 3, 150–164. [Google Scholar] [CrossRef]
Lv, J.; Zhao, D.-A.; Ji, W.; Chen, Y.; Zhang, Y. Research on trunk and branch recognition method of apple harvesting robot. In Proceedings of the 2012 International Conference on Measurement, Information and Control, Harbin, China, 18–20 May 2012; pp. 474–478. [Google Scholar] [CrossRef]
Chen, X.; Wang, S.; Zhang, B.; Luo, L. Multi-feature fusion tree trunk detection and orchard mobile robot localization using camera/ultrasonic sensors. Comput. Electron. Agric. 2018, 147, 91–108. [Google Scholar] [CrossRef]
Maeyama, S.; Ohya, A.; Yuta, S. Positioning by tree detection sensor and dead reckoning for outdoor navigation of a mobile robot. In Proceedings of the 1994 IEEE International Conference on MFI ‘94. Multisensor Fusion and Integration for Intelligent Systems, Las Vegas, NV, USA, 2–5 October 1994; pp. 653–660. [Google Scholar] [CrossRef]
Andersen, J.C.; Ravn, O.; Andersen, N.A. Autonomous rule-based robot navigation in orchards. IFAC Proc. Vol. 2010, 43, 43–48. [Google Scholar] [CrossRef]
Shalal, N.; Low, T.; McCarthy, C.; Hancock, N. Orchard mapping and mobile robot localisation using on-board camera and laser scanner data fusion—Part A: Tree detection. Comput. Electron. Agric. 2015, 119, 254–266. [Google Scholar] [CrossRef]
Gimenez, J.; Sansoni, S.; Tosetti, S.; Capraro, F.; Carelli, R. Trunk detection in tree crops using RGB-D images for structure-based ICM-SLAM. Comput. Electron. Agric. 2022, 199, 107099. [Google Scholar] [CrossRef]
Lamprecht, S.; Stoffels, J.; Dotzler, S.; Haß, E.; Udelhoven, T. aTrunk—An ALS-Based Trunk Detection Algorithm. Remote Sens. 2015, 7, 9975–9997. [Google Scholar] [CrossRef]
Wang, J.; Chen, X.; Cao, L.; An, F.; Chen, B.; Xue, L.; Yun, T. Individual Rubber Tree Segmentation Based on Ground-Based LiDAR Data and Faster R-CNN of Deep Learning. Forests 2019, 10, 793. [Google Scholar] [CrossRef]
Zhu, D.; Liu, X.; Zheng, Y.; Xu, L.; Huang, Q. Improved Tree Segmentation Algorithm Based on Backpack-LiDAR Point Cloud. Forests 2024, 15, 136. [Google Scholar] [CrossRef]
Fei, Z.; Vougioukas, S. Row-sensing templates: A generic 3D sensor-based approach to robot localization with respect to orchard row centerlines. J. Field Robot. 2022, 39, 712–738. [Google Scholar] [CrossRef]
Jiang, A.; Ahamed, T. Navigation of an Autonomous Spraying Robot for Orchard Operations Using LiDAR for Tree Trunk Detection. Sensors 2023, 23, 4808. [Google Scholar] [CrossRef]
Zambre, Y.; Rajkitkul, E.; Mohan, A.; Peeples, J. Spatial Transformer Network YOLO Model for Agricultural Object Detection (Version 2). arXiv 2024, arXiv:2407.21652. [Google Scholar]
Huang, P.; Huang, P.; Wang, Z.; Wu, X.; Liu, J.; Zhu, L. Deep-Learning-Based Trunk Perception with Depth Estimation and DWA for Robust Navigation of Robotics in Orchards. Agronomy 2023, 13, 1084. [Google Scholar] [CrossRef]
Escolà, A.; Planas, S.; Rosell, J.R.; Pomar, J.; Camp, F.; Solanelles, F.; Gracia, F.; Llorens, J.; Gil, E. Performance of an Ultrasonic Ranging Sensor in Apple Tree Canopies. Sensors 2011, 11, 2459–2477. [Google Scholar] [CrossRef]
Murtiyoso, A.; Cabo, C.; Singh, A.; Obaya, D.P.; Cherlet, W.; Stoddart, J.; Fol, C.R.; Schwenke, M.B.; Rehush, N.; Stereńczak, K.; et al. A Review of Software Solutions to Process Ground-based Point Clouds in Forest Applications. Curr. For. Rep. 2024, 10, 401–419. [Google Scholar] [CrossRef]
Deng, S.; Xu, Q.; Yue, Y.; Jing, S.; Wang, Y. Individual tree detection and segmentation from unmanned aerial vehicle-LiDAR data based on a trunk point distribution indicator. Comput. Electron. Agric. 2024, 218, 108717. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Venice, Italy, 22–29 October 2017. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Cao, Z.; Gong, C.; Meng, J.; Liu, L.; Rao, Y.; Hou, W. Orchard Vision Navigation Line Extraction Based on YOLOv8-Trunk Detection. IEEE Access 2024, 12, 104126–104137. [Google Scholar] [CrossRef]
Yinghua, Z.; Yu, T.; Xingxing, L. Urban Street tree Recognition Method Based on Machine Vision. In Proceedings of the 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; pp. 967–971. [Google Scholar]
Ozer, T.; Akdogan, C.; Cengiz, E.; Kelek, M.M.; Yildirim, K.; Oguz, Y.; Akkoc, H. Cherry Tree Detection with Deep Learning. In Proceedings of the 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), Antalya, Turkey, 7–9 September 2022; pp. 1–4. [Google Scholar]
Chen, Y.; Xu, H.; Zhang, X.; Gao, P.; Xu, Z.; Huang, X. An object detection method for bayberry trees based on an improved YOLO algorithm. Int. J. Digit. Earth 2023, 16, 781–805. [Google Scholar] [CrossRef]
Sun, H.; Xue, J.; Zhang, Y.; Li, H.; Liu, R.; Song, Y.; Liu, S. Novel method of rapid and accurate tree trunk location in pear orchard combining stereo vision and semantic segmentation. Measurement 2025, 242, 116127. [Google Scholar] [CrossRef]
Zhou, J.; Geng, S.; Qiu, Q.; Shao, Y.; Zhang, M. A Deep-Learning Extraction Method for Orchard Visual Navigation Lines. Agriculture 2022, 12, 1650. [Google Scholar] [CrossRef]
Katsura, H.; Miura, J.; Hild, M.; Shirai, Y. A View-Based Outdoor Navigation Using Object Recognition Robust to Changes of Weather and Seasons. J. Robot. Soc. Jpn. 2005, 23, 75–83. [Google Scholar] [CrossRef]
Brown, J.; Paudel, A.; Biehler, D.; Thompson, A.; Karkee, M.; Grimm, C.; Davidson, J.R. Tree detection and in-row localization for autonomous precision orchard management. Comput. Electron. Agric. 2024, 227, 109454. [Google Scholar] [CrossRef]
Kumar, P.; Kumar, V. Exploring the Frontier of Object Detection: A Deep Dive into YOLOv8 and the COCO Dataset. In Proceedings of the 2023 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Gwalior, India, 10–11 December 2023; pp. 1–6. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1489–1500. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design (Version 1). arXiv 2021, arXiv:2103.02907. [Google Scholar]
Narayanan, M. SENetV2: Aggregated dense layer for channelwise and global representations (Version 1). arXiv 2023, arXiv:2311.10807. [Google Scholar]
Yin, X.; Yu, Z.; Fei, Z.; Lv, W.; Gao, X. PE-YOLO: Pyramid Enhancement Network for Dark Object Detection. In Proceedings of the 32nd International Conference on Artificial Neural Networks, Heraklion, Crete, Greece, 26–29 September 2023. [Google Scholar]
Guo, Q.; Wang, Y.; Zhang, Y.; Qin, H.; Jiang, Y. AWF-YOLO: Enhanced underwater object detection with adaptive weighted feature pyramid network. Complex Eng. Syst. 2023, 3, 16. [Google Scholar] [CrossRef]
Diao, Z.; Guo, P.; Zhang, B.; Zhang, D.; Yan, J.; He, Z.; Zhao, S.; Zhao, C.; Zhang, J. Navigation line extraction algorithm for corn spraying robot based on improved YOLOv8s network. Comput. Electron. Agric. 2023, 212, 108049. [Google Scholar] [CrossRef]
Chen, J.; Wang, H.; Zhang, H.; Luo, T.; Wei, D.; Long, T.; Wang, Z. Weed detection in sesame fields using a YOLO model with an enhanced attention mechanism and feature fusion. Comput. Electron. Agric. 2022, 202, 107412. [Google Scholar] [CrossRef]
Praveen, S.; Jung, Y. CBAM-STN-TPS-YOLO: Enhancing Agricultural Object Detection through Spatially Adaptive Attention Mechanisms (Version 1). arXiv 2025, arXiv:2506.07357. [Google Scholar]
Liu, Y.; Wang, H.; Liu, Y.; Luo, Y.; Li, H.; Chen, H.; Liao, K.; Li, L. A Trunk Detection Method for Camellia oleifera Fruit Harvesting Robot Based on Improved YOLOv7. Forests 2023, 14, 1453. [Google Scholar] [CrossRef]
Ramalingam, K.; Pazhanivelan, P.; Jagadeeswaran, R.; Prabu, P.C. YOLO deep learning algorithm for object detection in agriculture: A review. J. Agric. Eng. 2024, 55. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural Object Detection with You Look Only Once (YOLO) Algorithm: A Bibliometric and Systematic Literature Review (Version 1). arXiv 2024, arXiv:2401.10379. [Google Scholar] [CrossRef]
Tang, Z.; Lu, J.; Chen, Z.; Qi, F.; Zhang, L. Improved Pest-YOLO: Real-time pest detection based on efficient channel attention mechanism and transformer encoder. Ecol. Inform. 2023, 78, 102340. [Google Scholar] [CrossRef]
Liu, H.; Wang, X.; Zhao, F.; Yu, F.; Lin, P.; Gan, Y.; Ren, X.; Chen, Y.; Tu, J. Upgrading swin-B transformer-based model for accurately identifying ripe strawberries by coupling task-aligned one-stage object detection mechanism. Comput. Electron. Agric. 2024, 218, 108674. [Google Scholar] [CrossRef]
Zhang, J.; Tian, M.; Yang, Z.; Li, J.; Zhao, L. An improved target detection method based on YOLOv5 in natural orchard environments. Comput. Electron. Agric. 2024, 219, 108780. [Google Scholar] [CrossRef]
Ling, S.; Wang, N.; Li, J.; Ding, L. Accurate Recognition of Jujube Tree Trunks Based on Contrast Limited Adaptive Histogram Equalization Image Enhancement and Improved YOLOv8. Forests 2024, 15, 625. [Google Scholar] [CrossRef]
Su, F.; Zhao, Y.; Shi, Y.; Zhao, D.; Wang, G.; Yan, Y.; Zu, L.; Chang, S. Tree Trunk and Obstacle Detection in Apple Orchard Based on Improved YOLOv5s Model. Agronomy 2022, 12, 2427. [Google Scholar] [CrossRef]

Figure 1. Orchard management robot data collection platform.

Figure 2. Data collection video screenshot.

Figure 3. Tree trunk images of fruit trees. (a) Image with strong lighting; (b) image with weak lighting; (c) image of medium and small-scale tree trunks; (d) image of large-scale tree trunks.

Figure 4. Tree trunk dataset. (a,b) Augmented dataset images.

Figure 5. YOLOv8 architecture diagram.

Figure 6. Architecture diagram of PCC-YOLO.

Figure 7. Transformer. (a) Self-attention mechanism of traditional transformer structure; (b) CoT-Net structure.

Figure 8. Diagram of the coordinate attention mechanism.

Figure 9. Architecture. (a) The SENet module; (b) the SENetV2 module.

Figure 10. PENet network architecture.

Figure 11. Recognition effect of different models in ablation experiments.

Figure 12. Curves of mAP@0.5 and mAP@0.5:0.95 for different models in the ablation experiment.

Table 1. Performance comparison of PCC-YOLO with other object detection models.

Model	P	FPS/(f·s⁻¹)	mAP@0.5	GFLOPs	MSE/px	RMSE/px
YOLOv5s	0.820	85.19	0.724	18.9	12.20	3.39
YOLOv6	0.750	132.56	0.772	11.8	11.10	3.34
YOLOv8n	0.760	127.27	0.778	6.8	11.20	3.34
PCC-YOLO	0.810	143.36	0.826	7.8	5.02	2.24

Table 2. Comparison of recognition effects of different models in different environments.

	A	B	C	D
(ground truth)
(PCC-YOLO)
(YOLOv8n)
(YOLOv6)
(YOLOv5s)

A: oblique sunlight on a clear winter day; B: the absence of oblique sunlight on a clear winter day; C: overcast winter conditions; D: springtime conditions.

Table 3. Performance comparison of four attention mechanism modules.

Model	P	mAP@0.5	FPS/(f·s⁻¹)	MSE/px	RMSE/px
YOLOv8	0.760	0.778	127.27	11.2	3.34
YOLOv8-SENetV2	0.733	0.751	117.39	14.3	3.79
YOLOv8-CGA	0.753	0.763	112.77	12.7	3.57
YOLOv8-SEAM	0.768	0.783	104.08	11.2	3.34
YOLOv8-Coord-SE	0.777	0.809	126.69	6.2	2.49

YOLOv8 is the original model; YOLOv8-SENetV2 adds a channel attention mechanism prior to the decoder head based on the original YOLOv8; YOLOv8-CGA adds a cascaded group attention mechanism prior to the decoder head, based on the original YOLOv8; YOLOv8-SEAM replaces the original c2f with the spatially enhanced attention module in the decoder head, based on the original YOLOv8; YOLOv8-coord introduces a fusion of coordinate attention mechanism and SENetV2 prior to the decoder head, based on the original YOLOv8.

Table 4. Comparison between CoT Net and conventional transformer self-attention.

Model	P	FPS/(f·s⁻¹)	mAP@0.5	GFLOPs
YOLOv8	0.760	127.27	0.778	6.8
YOLOv8-tr	0.765	109.45	0.781	6.9
YOLOv8-cot	0.790	112.77	0.796	6.4

YOLOv8-cot introduces CoT-Net into the last two C2f layers of the original YOLOv8 model; YOLOv8-tr introduces the traditional transformer self-attention structure into the last two C2f layers of the original YOLOv8 model.

Table 5. PCC-YOLO ablation experiment.

Model	P	FPS/(f·s⁻¹)	mAP@0.5	GFLOPs	MSE/px	RMSE/px
YOLOv8	0.760	127.27	0.778	6.8	11.2	3.34
YOLOv8-Coord-SE	0.777	126.69	0.809	8.4	6.18	2.49
YOLOv8-cot	0.790	112.77	0.796	6.4	8.02	2.83
YOLOv8-lap	0.772	192.31	0.806	6.8	7.44	2.73
PCC-YOLO	0.810	143.36	0.826	7.8	5.04	2.24

YOLOv8-coord is based on the original YOLOv8 and introduces a fusion of coordinate attention mechanism and SENetV2 prior to the decoder head; YOLOv8-cot adds CoT-Net in the last two C2f modules of the original YOLOv8; YOLOv8-lap adds PENet before the input image, based on the original YOLOv8.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Jin, W.; Gu, B.; Tian, G.; Li, Q.; Zhang, B.; Ji, G. PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8. Agriculture 2025, 15, 1786. https://doi.org/10.3390/agriculture15161786

AMA Style

Zhang Y, Jin W, Gu B, Tian G, Li Q, Zhang B, Ji G. PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8. Agriculture. 2025; 15(16):1786. https://doi.org/10.3390/agriculture15161786

Chicago/Turabian Style

Zhang, Yajie, Weiliang Jin, Baoxing Gu, Guangzhao Tian, Qiuxia Li, Baohua Zhang, and Guanghao Ji. 2025. "PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8" Agriculture 15, no. 16: 1786. https://doi.org/10.3390/agriculture15161786

APA Style

Zhang, Y., Jin, W., Gu, B., Tian, G., Li, Q., Zhang, B., & Ji, G. (2025). PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8. Agriculture, 15(16), 1786. https://doi.org/10.3390/agriculture15161786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PCC-YOLO: A Fruit Tree Trunk Recognition Algorithm Based on YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Data

2.2. Improved YOLOv8 Algorithm

2.2.1. Model Selection

2.2.2. Model Construction

2.2.3. Element Replacement

2.2.4. Unit Optimization

2.2.5. PENet Preprocessing

2.3. Experimental Platform and Parameter Settings

2.3.1. Experimental Platform

2.3.2. Parameter Settings

2.3.3. Evaluation Criteria for Tree Trunk Recognition

3. Results

3.1. Comparison of Different Network Models

3.2. Comparison of Different Attention Mechanisms

3.3. CoT-Net Effectiveness Validation

3.4. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI