YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions

Liu, Yi; Han, Xiang; Zhang, Hongjian; Liu, Shuangxi; Ma, Wei; Yan, Yinfa; Sun, Linlin; Jing, Linlong; Wang, Yongxian; Wang, Jinxing

doi:10.3390/agronomy15071581

Open AccessArticle

YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions

by

Yi Liu

¹,

Xiang Han

¹

,

Hongjian Zhang

^1,2,

Shuangxi Liu

¹,

Wei Ma

³,

Yinfa Yan

¹,

Linlin Sun

¹,

Linlong Jing

¹,

Yongxian Wang

^1,* and

Jinxing Wang

^1,2,*

¹

College of Mechanical and Electronic Engineering, Shandong Agricultural University, Taian 271018, China

²

Shandong Key Laboratory of Intelligent Production Technology and Equipment for Facility Horticulture, Taian 271018, China

³

Institute of Urban Agriculture, Chinese Academy of Agricultural Sciences, Chengdu 610213, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2025, 15(7), 1581; https://doi.org/10.3390/agronomy15071581

Submission received: 27 May 2025 / Revised: 20 June 2025 / Accepted: 26 June 2025 / Published: 28 June 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of Jinxiu Malus fruits in unstructured orchard environments is hampered by frequent overlap, occlusion, and variable illumination. To address these challenges, we propose YOLOv8-MSP-PD (YOLOv8 with Multi-Scale Pyramid Fusion and Proportional Distance IoU), a lightweight model built on an enhanced YOLOv8 architecture. We replace the backbone with MobileNetV4, incorporating unified inverted bottleneck (UIB) modules and depth-wise separable convolutions for efficient feature extraction. We introduce a spatial pyramid pooling fast cross-stage partial connections (SPPFCSPC) module for multi-scale feature fusion and a modified proportional distance IoU (MPD-IoU) loss to optimize bounding-box regression. Finally, layer-adaptive magnitude pruning (LAMP) combined with knowledge distillation compresses the model while retaining performance. On our custom Jinxiu Malus dataset, YOLOv8-MSP-PD achieves a mean average precision (mAP) of 92.2% (1.6% gain over baseline), reduces floating-point operations (FLOPs) by 59.9%, and shrinks to 2.2 MB. Five-fold cross-validation confirms stability, and comparisons with Faster R-CNN and SSD demonstrate superior accuracy and efficiency. This work offers a practical vision solution for agricultural robots and guidance for lightweight detection in precision agriculture.

Keywords:

Jinxiu Malus fruit; YOLOv8; lightweight; multi-scale feature fusion; object detection

1. Introduction

Jinxiu Malus fruit (Malus spp.) is a member of the Rosaceae family and Malus genus. Rich in phenolic compounds and bioactive nutrients, it exhibits notable health benefits, including anticancer, antiviral, antimicrobial, antioxidant, and cardiovascular protective properties [1,2]. Characterized by dense clusters with high fruit density per plant, this cultivar presents significant challenges for traditional manual harvesting methods, which are labor-intensive, inefficient, and economically unsustainable. These limitations constitute a critical bottleneck for industrial-scale cultivation. Consequently, the development of intelligent harvesting technologies that simultaneously reduce operational costs and enhance picking efficiency holds substantial practical significance for industrial advancement.

Fruit detection algorithms, as the perceptual core of intelligent harvesting systems, directly determine operational efficiency through their detection accuracy. However, conventional methods rely on hand-crafted features and classifiers, such as color, edge, texture, and local descriptors. Their recognition accuracy is limited by suboptimal feature extraction, making them unsuitable for real-time orchard deployment and prone to poor robustness [3,4,5]. Recent advancements in deep learning-based object detection algorithms (e.g., YOLO and SSD series) have demonstrated remarkable progress in field fruit recognition [6,7,8,9]: Du et al. developed the DSW-YOLO model based on YOLOv7, achieving 94.6% mAP for strawberry detection in unobstructed environments [10]; Wu et al. enhanced YOLOv8 by integrating CBAM attention mechanisms and CARAFE up-sampling, attaining 94.3% mAP for apple detection under complex conditions [11]; Sun et al. combined HS-FPAN feature fusion with knowledge distillation to compressing the model to 58.99% of its original size, or 3.51 MB while maintaining 85.40% mAP for jujube detection [12]; Yang et al. proposed the MobileNet-MCA backbone incorporating partial convolution (PConv) into E-ELAN modules with progressive rectification, achieving 97.65% mAP for strawberry detection within a 3.58 MB model [13]. Nevertheless, the applicability of these approaches to densely clustered Jinxiu Malus fruit under challenging field conditions (e.g., foliage occlusion, illumination variation) remains unverified.

This study presents YOLOv8-MSP-PD, a lightweight detection model built on an enhanced YOLOv8 framework for precise localization of Jinxiu Malus fruit in complex orchard environments. The technical framework comprises three innovations: Replacement of the CSPDarknet backbone with MobileNetV4, incorporating Unified Inverted Bottleneck (UIB) modules to boost feature extraction efficiency; introduction of the Spatial Pyramid Pooling Cross-Stage Partial Connections (SPPFCSPC) module for enhanced multi-scale feature fusion and boundary delineation; and fusion of layer-adaptive magnitude pruning (LAMP) with knowledge distillation to reduce model parameters and computational complexity for lightweight deployment.

2. Materials and Methods

2.1. Dataset Construction and Augmentation

The dataset for this study was collected in an apple orchard in Taian, Shandong Province, China (37°21′ N, 112°30′ E), focusing on Jinxiu Malus fruit. To accommodate the complex conditions encountered by a harvesting robot, images were captured in September 2024 using a Nikon D850 digital camera (Nikon Corporation, Tokyo, Japan) from multiple viewpoints at distances ranging from 0.5 to 1.5 m. In total, 1225 JPEG images (3468 × 4624 pixels) were acquired under varied orchard conditions, including backlighting, single-fruit and cluster-fruit scenarios, close-up and wide-angle shots, fruit overlap, and occlusion by branches and leaves. The dataset was then partitioned into training, validation, and test sets at a ratio of 7:2:1. Example samples are shown in Figure 1.

To address class imbalance and improve model robustness and generalization, we applied extensive data augmentation. Specifically, we used horizontal and vertical flips, gamma correction, adaptive histogram equalization, and random brightness adjustment to enhance contrast and help the model better capture fruit features. In addition, to simulate motion blur caused by wind-induced foliage movement or slight camera shake, we applied a randomized motion-blur filter. After augmentation, the dataset comprised a total of 6125 images, including the original 1225 and 4900 augmented samples. Finally, these were split into 4287 training, 1225 validation, and 613 test images.

2.2. Field Detection Method for Jinxiu Malus Fruits

The conventional YOLOv8 framework employs Cross-Stage Partial Darknet (CSPDarknet) as its backbone network. While demonstrating strong local feature extraction capabilities, CSPDarknet exhibits limitations in modeling global contextual dependencies, leading to suboptimal differentiation between target fruits and occlusions in complex orchard environments. Furthermore, existing multi-scale feature fusion approaches suffer from insufficient detail representation in high-level features and inadequate semantic information in low-level features, resulting in limited adaptability to varying object scales. Consequently, traditional convolutional networks are prone to missed and false detections in complex orchard scenarios. To address these challenges, this study replaces the backbone with an improved MobileNet fourth version (MobileNetV4), which leverages Unified Inverted Bottleneck (UIB) modules and optimized depth-wise separable convolutions. This design significantly reduces computational complexity while effectively aggregating local texture details of fruit surfaces and global contextual information from occluded regions, achieving an optimal balance between efficiency and accuracy. High-performance inference is ensured even on resource-constrained embedded devices through this design [14,15]. To further enhance multi-scale feature fusion, we integrate the Spatial Pyramid Pooling Cross-Stage Partial Connections (SPPFCSPC) module into the backbone architecture. This module combines multi-scale pooling operations to capture global contextual information and employs a Cross-Stage Partial (CSP) structure for feature refinement. By adaptively aggregating multi-scale convolutional kernels, the SPPFCSPC optimizes information flow while reducing redundancy, thereby improving the model’s ability to delineate object boundaries and manage complex backgrounds. The architecture of the improved YOLOv8-MSP detection model for Jinxiu Malus fruit is shown in Figure 2.

2.2.1. Optimization of Backbone Network for Feature Extraction

To address CSPDarknet’s limitations in modeling global context dependencies, we replace it with MobileNetV4 as the YOLOv8 backbone (Figure 3). The MobileNetV4 architecture employs a progressive layer-wise feature extraction strategy that balances computational efficiency and robust feature representation, making it suitable for resource-constrained applications [15,16]. The network architecture begins with an initial convolutional layer using a 3 × 3 kernel with a stride of 3 for down-sampling, expanding the channel dimension to 32. Subsequently, a lightweight shallow module composed of two convolutional layers extracts and enhances low-level features while minimizing computational overhead. The intermediate feature module dynamically adjusts the channel dimension from 32 to 96 and then compresses it to 64, optimizing feature quality and parameter efficiency. At the core of the network, the UIB blocks are stacked in the deep feature module, leveraging depth-wise separable convolutions and skip connections to enhance high-order semantic modeling. Finally, a tail feature adjustment module applies two convolutional layers to expand channels from 128 to 960 and further to 1280, generating high-dimensional feature representations for downstream classification or detection tasks.

2.2.2. SPPFCSPC Multi-Scale Feature Fusion Module

To enhance multi-scale feature fusion and overcome the shortcomings of insufficient detail in high-level maps and weak semantics in low-level maps, we integrate the Spatial Pyramid Pooling Cross-Stage Partial Connections (SPPFCSPC) module into the MobileNetV4-enhanced YOLOv8 backbone. Unlike the conventional SPPF, SPPFCSPC combines serial multi-scale spatial pyramid pooling with a CSP deep-branch structure: Identical pooling kernels process the feature map at successive scales, and the resulting feature channels are then reorganized and optimized via the CSP architecture. This design simultaneously captures global context and reinforces local detail, substantially reducing channel redundancy and accelerating feature flow [17]. The architectures of the SPPF and SPPFCSPC modules are illustrated in Figure 4.

2.2.3. Model Pruning

In resource-constrained embedded systems with limited storage and computational capabilities, it is imperative to minimize both the floating-point operations (FLOPs) and parameter scale of neural networks. To address this challenge, we implement Layer-adaptive Sparsity for Magnitude-based Pruning (LAMP) [18,19], a refined optimization technique that preserves model expressiveness while compressing parameters. LAMP operates by scoring the importance of each weight connection based on its squared magnitude and subsequently eliminating low-scoring redundant weights [13]. The core scoring function is defined as

s c o r e (μ; W) = \frac{{(W_{μ})}^{2}}{\sum_{v \in L} {(W_{ν})}^{2}}

(1)

where

W_{μ}

denotes the squared magnitude of the target weight connection

μ

, and

\sum_{v \in L} W_{v}^{2}

represents the sum of squared magnitudes for all weights within the same layer L.

After computing the importance scores by Equation (1), the pruning threshold for each layer is determined as follows. The scores within a given layer are sorted in descending order, and the threshold is set at the value that retains a predefined retention ratio (e.g., retaining 50% of weights yields approximately a 2.0 reduction in FLOPs). These retention ratios, corresponding to target Speed_up factors of 1.5, 2.0, 2.5, and 3.0, are selected empirically via a grid search on a held-out validation set. No closed-form criterion is used; each threshold is chosen to optimize validation mAP under the desired sparsity level.

As Equation (1) indicates, weights with higher magnitudes receive elevated scores, reflecting their greater contribution to model outputs, whereas low-scoring weights are pruned as non-essential. During implementation, the weights of each layer are first flattened into a one-dimensional vector and sorted in descending order of their scores. To prevent a catastrophic model collapse due to excessive pruning, a layer-wise sparsity constraint is applied, ensuring at least one connection is retained per layer while adhering to a global sparsity threshold. Finally, the pruned network undergoes fine-tuning to recover performance losses caused by parameter removal.

2.2.4. Model Distillation

Knowledge distillation is a widely used technique for compressing models and enhancing performance, particularly effective for improving the accuracy of lightweight, single-object detection models. Given that model pruning may lead to performance degradation, this study introduces a distillation strategy to transfer knowledge from the unpruned teacher model to the pruned student model, effectively restoring and improving its detection capability [20,21,22]. The optimized YOLOv8-MSP model is selected as the teacher model, while the pruned version serves as the student. To ensure compatibility between the two networks during feature extraction, the spatial dimensions and channel numbers of their corresponding feature maps are aligned. Both models are then fed the same batch of samples in parallel, as shown in Figure 5.

During the classification stage, both teacher and student models generate binary classification maps. The binary cross-entropy (BCE) loss is calculated for each class at each location, weighted by an importance coefficient, and summed to form the classification distillation loss:

L_{cls}^{dis} (x) = \sum_{i = 1}^{n} \sum_{j = 1}^{k} w_{i, j} L_{BCE} (p_{i, j}^{t}, p_{i, j}^{s}),

(2)

where

p_{i, j}^{t}

and

p_{i, j}^{s}

are the predicted probabilities (after sigmoid) from the teacher and student models at location i, class j, respectively, and

w_{i j}

is the corresponding importance weight.

The BCE loss

L_{B C E}

is defined as

L_{B C E} (p_{i, j}^{t}, p_{i, j}^{s}) = - [(1 - p_{i, j}^{t}) \log (1 - p_{i, j}^{s}) + p_{i, j}^{t} \log (p_{i, j}^{s})]

(3)

For localization, the Intersection over Union (IoU) between the predicted bounding boxes of the teacher and student models is computed. A weighted IoU-based loss is then applied:

L_{loc}^{dis} (x) = \sum_{i = 1}^{n} \max (w_{i, j}) (1 - {u_{i}}^{'})

(4)

where

u_{i}

denotes the IoU between the student’s predicted box and the ground-truth box at location i, whereas

u_{i}^{'}

denotes the IoU between the teacher’s and student’s predicted boxes at location i.

By combining the classification and localization distillation losses, the student model effectively inherits the predictive capabilities of the teacher, resulting in enhanced detection accuracy even after pruning.

2.2.5. MPD-IoU Loss Function

In a complex orchard environment, the background is prone to interference, and the quality of the annotation boxes is uneven, which directly affects the convergence stability and positioning accuracy of the model. To address these issues, we replace the conventional CIoU loss with the modified proportional distance IoU loss (MPD-IoU loss) [22,23], which enhances sensitivity to small bounding-box deviations while streamlining computation. The MPD-IoU loss is defined as

L_{M P D I O U} = \frac{A \cap B}{A \cup B} + \frac{d_{2}^{1}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(5)

where

A \cap B

is the area of intersection between the predicted box B and the ground-truth box A;

A \cup B

is the area of their union; w and h are the width and height of the ground-truth box A;

(c_{x}^{A}, c_{y}^{A})

and

(c_{x}^{B}, c_{y}^{B})

are the center coordinates of boxes A and B, respectively.

d_{1}^{2}

is the squared Euclidean distance between the two box centers

d_{1}^{2} = {(c_{x}^{A} - c_{x}^{B})}^{2} + {(c_{y}^{A} - c_{y}^{B})}^{2}

(6)

Let

(x_{\min}, y_{\min})

and

(x_{\max}, y_{\max})

denote the top-left and bottom-right corners of the smallest enclosing box covering both A and B. Its width

w_{e}

and height

h_{e}

are defined as

w_{e} = x_{\max} - x_{\min}

(7)

h_{e} = y_{\max} - y_{\min}

(8)

The diagonal characterization

d_{2}^{2}

is:

d_{2}^{2} = w_{e}^{2} + h_{e}^{2}

(9)

Compared with CIoU loss, MPD-IoU directly minimizes the Euclidean distance between corresponding corners of the predicted and ground-truth boxes, without the need for extra hyperparameters such as angles or aspect-ratio terms—thereby reducing redundant computation. This formulation not only preserves high robustness under occlusion, deformation, or poorly defined boundaries but also more accurately conforms to complex object contours, enhancing both detection accuracy and generalization.

2.3. Model Training and Evaluation Metrics

2.3.1. Experimental Platform

Experiments were conducted on an Intel Xeon Silver 4314 CPU @ 2.40 GHz with an NVIDIA Quadro RTX 4000 GPU. The development platform comprised Windows 10 running the PyTorch deep-learning framework (CUDA Toolkit 11.7) and Python 3.8.8.

Prior to training, all input images were resized to 640 × 640 pixels. We initialized the batch size at 8 and optimized model weights using stochastic gradient descent (SGD), with an initial learning rate of 0.01 and a momentum factor of 0.937. An SGD variant with automatic batch-size adjustment tied to the learning-rate schedule was employed. Weight decay was set to 0.0005, and the learning rate was decayed every ten epochs. Training proceeded for 150 epochs to ensure thorough convergence and fitting.

2.3.2. Evaluation Index

We employ precision (P), recall (R), mean average precision (mAP), and model size to evaluate the Jinxiu Malus fruit detection models. To further characterize computational complexity, floating-point operations (FLOPs) are also reported. Precision P reflects the proportion of true positives among all predicted positives and is defined as

P = \frac{T_{P}}{T_{P} + F_{P}} \times 100 %

(10)

where

T_{P}

is the number of true positive detections and

F_{P}

is the number of false positives.

Recall R measures the proportion of true positives detected out of all actual positives:

R = \frac{T_{P}}{T_{P} + F_{N}} \times 100 %

(11)

where

F_{N}

denotes the number of false negatives.

Precision and recall directly indicate the model’s detection capability. In addition, average precision (AP) and mean average precision (mAP) summarize performance across varying confidence thresholds and serve as key statistics for overall model assessment. AP is defined as the area under the precision–recall curve

A P = \int_{0}^{1} P (R) d R \times 100 %

(12)

m A P = \frac{1}{N} \sum_{j = 1}^{N} \int_{0}^{1} P_{j} (R) d R \times 100 %

(13)

2.3.3. K-Fold Cross-Validation Method

To rigorously evaluate the stability, generalization performance, and risk of overfitting of the model under different data distributions, as well as to reduce random errors arising from dataset partitioning, this study employs K-Fold Cross-Validation. The training dataset is partitioned into K subsets of equal size, each mutually exclusive. In each iteration, one subset is selected as the validation set, and the remaining K − 1 subsets are used to train the model. The training and validation process is repeated K times. By averaging the results from these K iterations, the stability and generalization capabilities of the model across different data distributions are effectively assessed, reducing overfitting risk. This approach not only provides a comprehensive evaluation of model performance but also minimizes bias introduced by any single arbitrary dataset split. The mathematical expression for K-fold cross-validation is as follows:

\bar{M} = \frac{1}{K} \underset{i = 1}{\sum^{K}} M_{i}

(14)

where

M_{i}

represents the performance metric (e.g., mAP or loss) from the

i^{t h}

fold,

\bar{M}

is the average performance across all folds, and

K

denotes the total number of folds used.

3. Results

3.1. Comparison of Different Backbone Networks

To identify the optimal backbone network for feature extraction within the YOLOv8 model, we compared the performance of several lightweight architectures. Mainstream lightweight backbones, including EfficientFormerV2, FasterNet, StarNet, and MobileNetV3, were integrated individually into the YOLOv8 framework, while the neck and detection head remained consistent across experiments. Performance comparisons were conducted on the same dataset using precision (P), recall (R), mean average precision (mAP), floating-point operations (FLOPs), parameter count, and model size as the primary evaluation metrics. Table 1 summarizes the results, providing a comprehensive assessment of the feature extraction capabilities of the different backbone networks.

Compared to the original YOLOv8 model, the tested backbones demonstrated distinct strengths and weaknesses. Specifically, EfficientFormerV2 and MobileNetV3 showed comparable performance to the baseline YOLOv8, possibly due to architectural mismatches or suboptimal optimization strategies relative to the dataset. As shown in Table 1, the YOLOv8-FasterNet achieved a lower computational load of just 8.2 GFLOPs; however, its precision, recall, and mAP decreased to 86.3%, 84.5%, and 87.8%, respectively, suggesting that excessive lightweight design compromises detection accuracy. Conversely, although YOLOv8-StarNet increased the parameter count and model size, it did not significantly improve detection performance, indicating inherent limitations in its feature extraction ability. These observations highlight that variations in computational complexity and model size among backbone networks significantly influence detection speed and deployment efficiency. Therefore, selecting a backbone requires carefully balancing detection accuracy against available computational resources. YOLOv8-MobileNetV4 achieved improvements in precision (91.5%), recall (89.1%), and mAP (91.2%) relative to the original YOLOv8 model. Moreover, throughout the 150-epoch training cycle, all backbone variants displayed a monotonically increasing trend in validation mAP (Figure 6), an uncommon pattern indicative of robust model convergence. This consistent, upward performance trajectory can be attributed to two primary factors: The substantial diversity and scale of our field-collected dataset, comprising over six thousand images capturing variations in illumination, fruit orientation, and background complexity—and the synergistic effects of a comprehensive data augmentation strategy (including mosaic stitching, geometric, and photometric transformations) combined with regularization techniques (weight decay: 1 × 10⁻⁴; label smoothing: 0.05). Crucially, adopting a convergence-focused training regimen without early stopping ensured that each model fully leveraged its learning capacity without exhibiting signs of overfitting. Collectively, these design elements not only demonstrate the superior convergence properties and training stability inherent in the YOLOv8-MobileNetV4 architecture but also reliably drive progressive performance gains throughout the training process. This validates our selection of this backbone configuration as optimal for achieving both high accuracy and computational efficiency.

3.2. Ablation Experiments

Ablation experiments were conducted to systematically evaluate the effectiveness of each proposed improvement strategy. Starting from the baseline YOLOv8 model, three enhancements—the MobileNetV4 backbone; SPPFCSPC module; and MPD-IoU loss—were integrated individually and in combined configurations. Each variant was trained under identical experimental settings. The results are summarized in Table 2.

When assessed individually, each enhancement exhibited distinct impacts. Integrating MobileNetV4 alone improved precision and recall to 91.5% and 89.1%, respectively, and increased mAP by 0.6% to 91.2%. However, this came at the cost of significantly higher computational demands: FLOPs increased by 13.1 GFLOPs, parameter count by 2.0 × 10⁶, and model size by 5.4 MB. In contrast, applying only the SPPFCSPC module marginally improved mAP by 0.3% to 90.9% but unexpectedly reduced precision and recall to 85.9% and 85.1%, respectively. This limited gain was accompanied by an increase of 0.7 GFLOPs in FLOPs, 1.6 × 10⁶ parameters, and 3.2 MB in model size, indicating low standalone efficiency of SPPFCSPC. Conversely, MPD-IoU loss alone notably boosted mAP by 2.5% to 93.1%, with precision and recall rising to 91.3% and 89.0%, highlighting its superiority in localization accuracy.

Pairwise combinations produced varied outcomes. The integration of SPPFCSPC and MPD-IoU provided a balanced trade-off between accuracy and computational efficiency, achieving precision of 92.4%, recall of 89.1%, and an mAP of 91.5%, as shown in Table 2. However, combining MobileNetV4 with MPD-IoU led to an mAP of 91.7%, which was lower than when MPD-IoU was applied individually, and increased resource demands. Meanwhile, the combination of MobileNetV4 and SPPFCSPC resulted in higher computational complexity of 23.2 GFLOPs and increased model size to 13.4 MB but only achieved moderate improvement in mAP at 92.2%. Finally, when all three enhancements were integrated, the model achieved the highest precision of 92.9%, recall of 91.2%, and mAP of 93.5%. However, this optimal configuration also increased computational complexity to 23.2 GFLOPs, parameters to 6.5 × 10⁶, and model size to 13.4 MB. Overall, although the simultaneous integration of all three improvements delivered superior detection accuracy, it significantly increased computational complexity and storage requirements. Therefore, to achieve an optimal balance between accuracy and computational efficiency suitable for practical agricultural applications, subsequent experiments introduced pruning and knowledge distillation techniques.

3.3. Model Lightweight Experiment

Although the ablation studies confirmed that the improved Jinxiu Malus fruit detection model achieves higher accuracy, the addition of MobileNetV4 and SPPFCSPC modules increased overall model size. To examine how varying pruning intensities affect performance, we implemented LAMP-based channel pruning on the model. Figure 7 illustrates channel weight distributions before and after pruning, demonstrating that a substantial number of redundant parameters were removed, channel weights became more balanced, and channels contributing most to performance were retained. Here, “Speed_up factor” denotes the theoretical reduction in floating-point operations (FLOPs) obtained by dividing the baseline model’s FLOPs by those of the pruned model. We evaluated four pruning intensities—Speed_up factors of 1.5; 2.0, 2.5, and 3.0—as summarized in Table 3. Although these factors imply proportional inference-time improvements, actual speed-up measured on an NVIDIA Quadro RTX 4000 GPU (batch size 1, input 640 × 640) was 1.4, 1.7, 2.1, and 2.6, respectively, owing to non-FLOPs-scaling overheads such as kernel launches and memory access. Nevertheless, the theoretical Speed_up remains a reliable benchmark, closely mirroring real-world trends and facilitating fair comparison across pruning levels.

The results in Table 3 show that pruning to Speed_up 1.5 and 2.0 preserves detection accuracy while markedly reducing parameter count and model size, whereas pruning to 3.0 causes a pronounced accuracy drop. Balancing accuracy against compactness, we therefore chose the model pruned at Speed_up 2.5 as the optimal configuration.

In order to further improve the detection performance of the pruning model, the pruning model is optimized by the knowledge distillation method mimic. The teacher model is the original model, YOLOv8-MSP before pruning, and the student model is the pruned model with a Speed_up of 2.5. After distillation, the student model maintained the same computational cost and parameter count as the pruned version while boosting precision to 92.5%, recall to 90.8%, and mean average precision to 92.2%. This demonstrates that knowledge distillation compensates for pruning-induced accuracy loss, delivering a lightweight model that retains high detection performance. These findings validate the combined use of LAMP-based pruning and knowledge distillation for efficient deployment in resource-constrained orchard environments.

3.4. Comparative Experiments with Different Models

To comprehensively evaluate the performance of the improved model, we conducted comparative experiments against several mainstream object detection frameworks, including Faster R-CNN, SSD, YOLOv5, YOLOv8, YOLOv9, DETR, and RTMDet. The improved model, incorporating both pruning and knowledge distillation, is referred to as YOLOv8-MSP-PD (Pruning and Distillation). All models were trained and evaluated on the same custom dataset. Evaluation metrics included precision, recall, mAP, FLOPs, and model size. The results of these comparative experiments are presented in Table 4.

The results in Table 4 show that Faster R-CNN, as a two-stage detection algorithm, achieved a precision of 78.8%, a recall of 71.4%, and a mean average precision (mAP) of 82.6%. Its computational complexity reached 23.2 GFLOPs, with a parameter count of 65 million and a model size of 260 megabytes. These resource demands are significantly higher than those of one-stage detectors such as YOLOv5 and YOLOv8, making Faster R-CNN unsuitable for lightweight and real-time detection applications. SSD achieved the mAP of 79.3%, with 23 million parameters, 14.2 GFLOPs, and a model size of 92 MB. Although SSD demonstrates some improvement over Faster R-CNN in terms of resource consumption, its detection accuracy remains relatively low.

In contrast, YOLOv5 achieved an mAP of 90.3%, with 1.2 million parameters, 12.3 GFLOPs, and a model size of 4.8 MB, striking a favorable balance between accuracy and compactness. YOLOv8 further improved performance, the mAP reaching 90.6%, a computational complexity of 9.4 GFLOPs, and a model size of 6.3 MB.

The improved YOLOv8-MSP-PD model demonstrated superior performance across all evaluation metrics, achieving a precision of 92.5%, a recall of 90.8%, and a mean average precision of 92.2%. Its computational complexity was 7.9 GFLOPs, with 1.2 million parameters and a model size of only 2.2 megabytes. Compared with Faster R-CNN, SSD, YOLOv5, and YOLOv8, the mean average precision improved by 9.6%, 12.9%, 1.9%, and 1.6%, respectively, while the computational complexity and model size were markedly reduced. These results highlight its suitability for real-time fruit detection in resource-constrained environments.

Although both YOLOv9 and DETR achieved mean average precisions close to the improved model, with YOLOv9 reaching 91.7% and DETR 92.5%, their computational complexities were substantially higher, at 102.4 GFLOPs and 280.7 GFLOPs, respectively. Parameter counts and model sizes were also much larger, with YOLOv9 at 48 megabytes and DETR at 130 megabytes, making them less suitable for deployment where computational resources are limited. RTMDet’s mean average precision was slightly lower than the YOLO series, and its computational complexity and model size also exceeded those of the proposed model.

Overall, YOLOv8-MSP-PD achieves high detection accuracy while significantly reducing model parameters and storage requirements, providing a feasible solution for real-time detection in complex orchard environments. To further validate the lightweight model’s adaptability under various background conditions, we tested it on diverse Jinxiu Malus fruit images, highlighting missed detections and false positives with yellow circles and blue rectangles, respectively, as illustrated in Figure 8.

As shown in Figure 8, detection networks exhibited notable differences in performance under challenging backgrounds. YOLOv8-MSP-PD demonstrated outstanding detection ability and robustness across complex scenarios involving leaf occlusion, fruit overlap, and variable lighting, consistently outperforming other models. In scenes with significant leaf occlusion, both YOLOv8-MSP-PD and DETR displayed strong resistance to interference, accurately detecting most partially obscured fruits. In contrast, Faster R-CNN, SSD, YOLOv5, YOLOv8, and RTMDet experienced high rates of missed detections. In overlapping fruit scenarios, YOLOv8-MSP-PD maintained high detection accuracy, effectively distinguishing and localizing overlapping fruits, while other models, including DETR, missed some overlapped targets; SSD and RTMDet showed particularly high miss rates. In scenarios without significant occlusion, Faster R-CNN and SSD frequently produced false positives by misclassifying background elements as fruit, while RTMDet continued to miss some targets. Other models, such as YOLOv5, YOLOv8, and DETR, performed generally well under these simpler conditions. Notably, YOLOv8-MSP-PD achieved zero missed or false detections in such cases, underscoring its superior robustness and precision. While DETR performed well under occlusion and complex conditions, it still exhibited some missed or false detections. In summary, YOLOv8-MSP-PD consistently maintained the most stable and superior detection performance across a variety of complex field environments.

3.5. Evaluation of Model Generalization Capability

To systematically evaluate the model’s stability and generalization performance across diverse data partitions and mitigate stochastic errors inherent to single-split validation, this study adopted 5-fold cross-validation (K = 5), with the workflow illustrated in Figure 9. The 5-fold approach optimizes the trade-off between estimation accuracy and robustness relative to 3-fold cross-validation while maintaining computational efficiency compared to 10-fold validation. Performance metrics, including mAP, Precision, Recall, and Loss values, were calculated for each fold. Statistical analysis of metric means and standard deviations was conducted to quantify model consistency and reliability under varying data distributions, as summarized in Table 5.

The YOLOv8-MSP-PD model exhibited exceptional stability and generalizability across all cross-validation folds. Quantitative results demonstrated a Precision of 90.04%, an average Recall of 89.08%, and an mAP of 91.48%. Notably, low standard deviations of 0.46%, 0.58%, and 0.54% for Precision, Recall, and mAP, respectively, confirmed minimal performance variability under heterogeneous data conditions. While inter-fold variations were observed, such as the maximum mAP of 92.1% in Fold 2 and the minimum mAP of 90.7% in Fold 3, the overall metric fluctuations remained marginal, underscoring the model’s robustness to dataset shifts. Further analysis revealed the model’s proficiency in handling challenging agricultural scenarios, including occluded fruits, foliage obstructions, and variable illumination. Despite a strong overall performance, a slight reduction in Recall (88.2% in Fold 3) was noted under extreme occlusion or abrupt lighting changes, likely attributable to limited representation of such edge cases in the training corpus. These systematic validations not only substantiated the model’s detection accuracy and operational stability but also identified potential areas for enhancement in complex field environments.

4. Discussion

In this study, we developed YOLOv8-MSP-PD, a lightweight and efficient object detection model specifically tailored for detecting Jinxiu Malus fruit in complex orchard environments. The proposed model integrates three significant enhancements into the original YOLOv8 framework: The MobileNetV4 backbone, the SPPFCSPC multi-scale feature fusion module, and the MPD-IoU loss function. Experimental comparisons demonstrated clear advantages of our approach in terms of both accuracy and computational efficiency.

When compared with recent related studies, our model exhibited noteworthy advancements. Du et al. developed the DSW-YOLO model, achieving a mean average precision (mAP) of 94.6% for strawberry detection under relatively unobstructed conditions [10]. Similarly, Wu et al. enhanced YOLOv8 by incorporating the CBAM attention mechanism and CARAFE up-sampling, obtaining a high detection performance (mAP of 94.3%) for apples [11]. Although these approaches yielded high accuracy, their effectiveness in heavily occluded and densely clustered scenarios—such as those frequently encountered in Jinxiu Malus fruit detection—remains uncertain; particularly given their relatively high computational complexity. Sun et al. successfully reduced the model size to 3.51 MB through a combination of HS-FPAN feature fusion and knowledge distillation, achieving an 85.40% mAP for jujube detection, although this was comparatively lower in accuracy than our proposed model [12]. Yang et al. presented a MobileNet-MCA backbone enhanced with partial convolutions (PConv) and achieved a remarkable accuracy (97.65% mAP) within a compact model size of 3.58 MB [13]; however, its performance has not been validated under densely occluded orchard conditions. By contrast, the YOLOv8-MSP-PD model developed in our study achieves a highly competitive accuracy of 92.2% mAP while significantly reducing the model size to only 2.2 MB, surpassing conventional lightweight detection models such as YOLOv5, YOLOv8, SSD, and Faster R-CNN in terms of both detection accuracy and computational efficiency.

The effectiveness of our proposed improvements was further verified through systematic ablation experiments and rigorous 5-fold cross-validation. The ablation analysis indicated that integrating MobileNetV4, the SPPFCSPC module, and MPD-IoU loss significantly enhanced feature extraction efficiency and improved localization accuracy, albeit at the cost of increased computational requirements. To address this concern, we applied layer-adaptive magnitude pruning (LAMP) combined with knowledge distillation, successfully compressing the model parameters without compromising its detection accuracy. Consequently, the proposed YOLOv8-MSP-PD achieves an optimal balance between model compactness and performance, making it highly suitable for resource-constrained agricultural applications, particularly in real-time edge computing scenarios.

Nevertheless, there remain several limitations that should be acknowledged. Due to experimental constraints, this study did not empirically evaluate key performance metrics such as real-world inference speed (frames per second, FPS), power consumption, and memory usage on actual edge computing devices (e.g., Nvidia Jetson or RTX 4050 GPUs). Although theoretical reductions in floating-point operations (FLOPs) and model size suggest promising results for practical edge deployment, empirical validation on actual hardware remains essential. Additionally, the pruning and knowledge distillation strategies implemented in this study might introduce the risk of underfitting or reduced accuracy, particularly when generalizing to orchards with more diverse or complex environmental conditions.

Looking ahead, we plan to deploy and rigorously test the YOLOv8-MSP-PD model on real edge-computing platforms to obtain empirical measurements of inference speed, energy efficiency, and memory usage under realistic operational conditions. Moreover, to thoroughly assess the model’s generalization capability, we will extend validation beyond our current dataset—collected from a single cultivar in one orchard—to include multiple geographic regions; a variety of Malus cultivars; and different climatic scenarios; thereby ensuring its robustness and broad applicability in diverse agricultural settings.

5. Conclusions

This study proposed YOLOv8-MSP-PD, a lightweight object detection model designed for accurate detection of Jinxiu Malus fruits in complex orchard environments. To address challenges such as fruit overlap, occlusion, and illumination variations, the model achieved a balance between detection performance and computational efficiency through several architectural innovations. Specifically, the MobileNetV4 backbone enhanced feature extraction efficiency by incorporating unified inverted bottleneck modules and optimized depth-wise separable convolutions. The SPPFCSPC module improved the representation of fruit boundaries and fine details through multi-scale feature fusion and cross-stage partial connections. The MPD-IoU loss function further refined the localization accuracy. Additionally, LAMP pruning combined with knowledge distillation significantly reduced the number of model parameters and storage requirements. Experimental results on the custom Jinxiu Malus dataset demonstrated that YOLOv8-MSP-PD achieved a mean average precision (mAP) of 92.2%, representing a 1.6% improvement over the original YOLOv8. Moreover, the model’s floating-point operations, parameter count, and storage size were reduced by 59.9%, 60.0%, and 65.1%, respectively. Under challenging field conditions—such as overlapping fruits; occluding foliage; and dynamic lighting—the model consistently outperformed mainstream detectors; including Faster R-CNN and SSD; in both detection accuracy and operational efficiency. This work provided a practical solution for intelligent fruit monitoring in orchard scenarios and offered technical reference for precision monitoring of other crops. Future work will focus on extending the model’s adaptability to diverse cultivars and environmental conditions while optimizing real-time performance for deployment on edge-computing platforms.

Author Contributions

Conceptualization and writing—original draft preparation, Y.L. and X.H.; methodology and investigation, Y.L., H.Z., and S.L.; writing—review and editing, X.H., Y.W., and J.W.; visualization and validation, Y.Y., L.S., and L.J.; funding acquisition, H.Z. and J.W.; supervision and project administration, J.W., Y.W., H.Z., and W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the project grant of the Shandong Province Key R&D Plan (2022CXGC020701, 2023TZXD061), the China Agriculture Research System (CARS-27), the Shandong Province “University Youth Innovation Team” Program (2023KJ160), and the Postdoctoral Innovation Project of Shandong Province (SDCX-ZG-202503062). Special thanks to all the individuals and organizations that provided data support for this article.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

We gratefully acknowledge the editors and anonymous reviewers for their constructive comments, which we have carefully taken into account during revision.

Conflicts of Interest

The authors state no conflicts of interest.

References

Zeng, X.; Li, H.; Jiang, W.; Li, Q.; Xi, Y.; Wang, X.; Li, J. Phytochemical compositions, health-promoting properties and food applications of crabapples: A review. Food Chem. 2022, 386, 132789. [Google Scholar] [CrossRef] [PubMed]
Han, M.; Li, G.; Liu, X.; Li, A.; Mao, P.; Liu, P.; Li, H. Phenolic profile, antioxidant activity and anti-proliferative activity of crabapple fruits. Hortic. Plant J. 2019, 5, 155–163. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Li, J.; Xiong, J. In-field citrus detection and localization based on RGB-D image analysis. Biosyst. Eng. 2019, 186, 34–44. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Xiong, J.; Fang, Y. Color-, depth-, and shape-based three-dimensional fruit detection. Precis. Agric. 2020, 21, 1–17. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Sun, Y. A detection method for apple fruits based on color and shape features. IEEE Access 2019, 7, 67923–67933. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Part I, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Tu, S.; Pang, J.; Liu, H.; Zhuang, N.; Chen, Y.; Zheng, C.; Wan, H.; Xue, Y. Passion fruit detection and counting based on multi-scale Faster R-CNN using RGB-D images. Precis. Agric. 2020, 21, 1072–1091. [Google Scholar] [CrossRef]
Wan, S.; Goudos, S. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput. Netw. 2020, 168, 107036. [Google Scholar] [CrossRef]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Du, X.; Cheng, H.; Ma, Z.; Lu, W.; Wang, M.; Meng, Z.; Jiang, C.; Hong, F. DSW-YOLO: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric. 2023, 214, 108304. [Google Scholar] [CrossRef]
Wu, H.; Mo, X.; Wen, S.; Wu, K.; Ye, Y.; Wang, Y.; Zhang, Y. DNE-YOLO: A method for apple fruit detection in diverse natural environments. J. King Saud Univ.—Comput. Inf. Sci. 2024, 36, 102220. [Google Scholar] [CrossRef]
Sun, H.; Ren, R.; Zhang, S.; Tan, C.; Jing, J. Maturity detection of ‘Huping’ jujube fruits in natural environment using YOLO-FHLD. Smart Agric. Technol. 2024, 9, 100670. [Google Scholar] [CrossRef]
Yang, H.; Yang, L.; Wu, T.; Yuan, Y.; Li, J.; Li, P. MFD-YOLO: A fast and lightweight model for strawberry growth state detection. Comput. Electron. Agric. 2025, 234, 110177. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. In Computer Vision—ECCV 2024, 18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XL; Springer Nature: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Zhang, H.; Li, G.; Wan, D.; Wang, Z.; Dong, J.; Lin, S.; Deng, L.; Liu, H. DS-YOLO: A dense small object detection algorithm based on inverted bottleneck and multi-scale fusion network. Biomim. Intell. Robot. 2024, 4, 100190. [Google Scholar] [CrossRef]
Shi, J.; Sun, D.; Kieu, M.; Guo, B.; Gao, M. An enhanced detector for vulnerable road users using infrastructure-sensors-enabled device. Sensors 2023, 24, 59. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Zhou, Z.; Zhou, D.; Su, B.; Xuanyuan, Z.; Tang, J.; Lai, Y.; Chen, J.; Liang, W. Underwater object detection algorithm based on attention mechanism and cross-stage partial fast spatial pyramidal pooling. Front. Mar. Sci. 2022, 9, 1056300. [Google Scholar] [CrossRef]
Xu, K.; Wang, Z.; Geng, X.; Wu, M.; Li, X.; Lin, W. Efficient joint optimization of layer-adaptive weight pruning in deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 1–8 October 2023; pp. 17447–17457. [Google Scholar]
Wang, D.; He, D. Channel-pruned YOLOv5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Nie, Y.; Lai, H.; Gao, G. DSOD-YOLO: A lightweight dual feature extraction method for small target detection. Digit. Signal Process. 2025, 164, 105268. [Google Scholar] [CrossRef]
Liao, H.; Wang, G.; Jin, S.; Liu, Y.; Sun, W.; Yang, S.; Wang, L. HCRP-YOLO: A lightweight algorithm for potato defect detection. Smart Agric. Technol. 2025, 10, 100849. [Google Scholar] [CrossRef]
Yang, J.; Zhang, G.; Ge, Y.; Shi, J.; Wang, Y.; Li, J. Multi-scale plastic lunch box surface defect detection based on dynamic convolution. IEEE Access 2024, 12, 120064–120076. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Y.; Zhang, L.; Wan, Y.; Chen, Z.; Xu, Y.; Cao, R.; Zhao, L.; Yang, Y.; Yu, X. A feature enhancement network based on image partitioning in a multi-branch encoder–decoder architecture. Knowl.-Based Syst. 2025, 311, 113120. [Google Scholar] [CrossRef]

Figure 1. Images of Jinxiu Malus fruits under different conditions.

Figure 2. Overview of the YOLOv8-MSP network architecture, showing the Backbone, Neck, and Head modules. Color-coded blocks denote specific operations: Gray blocks represent standard convolution (Conv2d–BatchNorm2d–SiLU); light-yellow blocks are C2f modules (Conv–Split–Bottleneck×2–Concat–Conv); green blocks indicate the Split operation within C2f; sky-blue blocks denote feature concatenation; pale-green blocks are nearest-neighbor up-sampling; purple blocks are Unified Inverted Bottleneck (UIB) modules; and dark-blue blocks represent the SPPF-CSPC pooling block.

Figure 3. Network Architecture of MobileNetV4.

Figure 4. The structure of SPPF and SPPFCSPC.

Figure 5. Knowledge distillation flowchart.

Figure 6. The mAP change curves of different backbone models.

Figure 7. Comparison of each channel before and after pruning.

Figure 8. Comparison of detection effects of seven models. Yellow circles indicate instances where the model failed to detect an existing fruit; blue rectangles indicate cases where a non-fruit region was incorrectly identified as fruit.

Figure 9. A 5-fold cross-validation schematic.

Table 1. Detection results with different backbone networks.

Model	P/%	R/%	mAP/%	FLOPs/G	Parameters	Model Size/MB
YOLOv8	90.9%	88.7%	90.6%	9.4	3.0 × 10⁶	6.3
YOLOv8-EfficientFormerV2	88.7%	87.2%	89.5%	9.4	4.0 × 10⁶	8.8
YOLOv8-FasterNet	86.3%	84.5%	87.8%	8.2	3.5 × 10⁶	7.2
YOLOv8-StarNet	89.1%	86.7%	89.0%	18.5	6.0 × 10⁶	10.8
YOLOv8-MobileNetV3	88.2%	86.8%	89.3%	12.5	5.5 × 10⁶	10.5
YOLOv8-MobileNetV4	91.5%	89.1%	91.2%	22.5	5.0 × 10⁶	11.7

Table 2. The ablation experiments of MobileNetV4 module, SPPFCSPC module, and MPD-IoU loss function.

Improvements			P/%	R/%	mAP/%	FLOPs/G	Parameters	Model Size/MB
MobileNetV4	SPPFCSPC	MPD-IoU	P/%	R/%	mAP/%	FLOPs/G	Parameters	Model Size/MB
✗	✗	✗	90.9%	88.7%	90.6%	9.4	3.0 × 10⁶	6.3
✔	✗	✗	91.5%	89.1%	91.2%	22.5	5.0 × 10⁶	11.7
✗	✔	✗	85.9%	85.1%	90.9%	10.1	4.6 × 10⁶	9.5
✗	✗	✔	91.3%	89.0%	93.1%	9.4	3.0 × 10⁵	6.3
✔	✔	✗	90.7%	87.8%	92.2%	23.2	6.5 × 10⁶	13.4
✗	✔	✔	92.4%	89.1%	91.5%	10.1	4.6 × 10⁶	9.5
✔	✗	✔	91.1%	84.1%	91.7%	22.5	5.0 × 10⁶	11.7
✔	✔	✔	92.9%	91.2%	93.5%	23.2	6.5 × 10⁶	13.4

Note: “✗” indicates that the improvement strategy was not adopted; “✔” indicates that the improvement strategy was adopted.

Table 3. Results of pruning and knowledge distillation.

Model	P/%	R/%	mAP/%	FLOPs/G	Parameters	Model Size/MB
Original	92.9%	91.2%	93.5%	23.2	6.5 × 10⁶	13.4
Speed_up 1.5	91.2%	89.1%	90.8%	14.2	2.3 × 10⁶	5.0
Speed_up 2.0	91.8%	90.3%	91.0%	9.3	1.2 × 10⁶	2.9
Speed_up 2.5	90.1%	89.5%	90.6%	7.9	0.9 × 10⁶	2.2
Speed_up 3.0	87.6%	85.6%	86.2%	7.1	0.7 × 10⁶	1.9
Knowledge Distillation	92.5%	90.8%	92.2%	7.9	0.9 × 10⁶	2.2

Table 4. Comparison of detection performance of different networks.

Model	P/%	R/%	mAP/%	FLOPs/G	Parameters	Model Size/MB
Faster R-CNN	78.8%	71.4%	82.6%	23.2	6.5 × 10⁷	260.0
SSD	88.6%	80.1%	79.3%	14.2	2.3 × 10⁷	92.0
YOLOv5	90.1%	88.2%	90.3%	12.3	1.2 × 10⁶	4.8
YOLOv8	90.9%	88.7%	90.6%	9.4	3.0 × 10⁶	6.3
YOLOv8-MSP-PD	92.5%	90.8%	92.2%	7.9	1.2 × 10⁶	2.2
YOLOv9	91.8%	89.5%	91.7%	102.4	2.5 × 10⁷	48.0
DETR	92.1%	89.2%	92.5%	280.7	6.7 × 10⁷	130.0
RTMDet	89.3%	87.6%	89.9%	28.6	1.1 × 10⁷	22.1

Table 5. Comparison of 5-Fold Cross-Validation.

Validation Fold	P/%	R/%	mAP/%
1st iteration	89.8	88.9	91.3
2nd iteration	90.6	89.7	92.1
3rd iteration	89.4	88.2	90.7
4th iteration	90.3	89.4	91.8
5th iteration	90.1	89.2	91.5
Mean ± Std	90.04 ± 0.46	89.08 ± 0.58	91.48 ± 0.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Han, X.; Zhang, H.; Liu, S.; Ma, W.; Yan, Y.; Sun, L.; Jing, L.; Wang, Y.; Wang, J. YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions. Agronomy 2025, 15, 1581. https://doi.org/10.3390/agronomy15071581

AMA Style

Liu Y, Han X, Zhang H, Liu S, Ma W, Yan Y, Sun L, Jing L, Wang Y, Wang J. YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions. Agronomy. 2025; 15(7):1581. https://doi.org/10.3390/agronomy15071581

Chicago/Turabian Style

Liu, Yi, Xiang Han, Hongjian Zhang, Shuangxi Liu, Wei Ma, Yinfa Yan, Linlin Sun, Linlong Jing, Yongxian Wang, and Jinxing Wang. 2025. "YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions" Agronomy 15, no. 7: 1581. https://doi.org/10.3390/agronomy15071581

APA Style

Liu, Y., Han, X., Zhang, H., Liu, S., Ma, W., Yan, Y., Sun, L., Jing, L., Wang, Y., & Wang, J. (2025). YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions. Agronomy, 15(7), 1581. https://doi.org/10.3390/agronomy15071581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction and Augmentation

2.2. Field Detection Method for Jinxiu Malus Fruits

2.2.1. Optimization of Backbone Network for Feature Extraction

2.2.2. SPPFCSPC Multi-Scale Feature Fusion Module

2.2.3. Model Pruning

2.2.4. Model Distillation

2.2.5. MPD-IoU Loss Function

2.3. Model Training and Evaluation Metrics

2.3.1. Experimental Platform

2.3.2. Evaluation Index

2.3.3. K-Fold Cross-Validation Method

3. Results

3.1. Comparison of Different Backbone Networks

3.2. Ablation Experiments

3.3. Model Lightweight Experiment

3.4. Comparative Experiments with Different Models

3.5. Evaluation of Model Generalization Capability

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI