YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments

Zhou, Wei; Gao, Leina; Sun, Fuchun; Bian, Yuechao

doi:10.3390/agriculture16020262

Open AccessArticle

YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments

¹

School of Mechanical Engineering, Chengdu University, Chengdu 610106, China

²

Entrepreneurship College, Chengdu University, Chengdu 610106, China

³

Institute for Advanced Study, Chengdu University, Chengdu 610106, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(2), 262; https://doi.org/10.3390/agriculture16020262

Submission received: 19 October 2025 / Revised: 11 December 2025 / Accepted: 12 December 2025 / Published: 21 January 2026

(This article belongs to the Special Issue Cutting-Edge Technology in Agricultural Robotics: Sensing and Actuation)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges faced by loquat detection algorithms in orchard settings—including complex backgrounds, severe branch and leaf occlusion, and inaccurate identification of densely clustered fruits—which lead to high computational complexity, insufficient real-time performance, and limited recognition accuracy, this study proposed a lightweight detection model based on the YOLO-MCS architecture. First, to address fruit occlusion by branches and leaves, the backbone network adopts the lightweight EfficientNet-b0 architecture. Leveraging its composite model scaling feature, this significantly reduces computational costs while balancing speed and accuracy. Second, to deal with inaccurate recognition of densely clustered fruits, the C2f module is enhanced. Spatial Channel Reconstruction Convolution (SCConv) optimizes and reconstructs the bottleneck structure of the C2f module, accelerating inference while improving the model’s multi-scale feature extraction capabilities. Finally, to overcome interference from complex natural backgrounds in loquat fruit detection, this study introduces the SimAm module during the initial detection phase. Its feature recalibration strategy enhances the model’s ability to focus on target regions. According to the experimental results, the improved YOLO-MCS model outperformed the original YOLOv8 model in terms of Precision (P) and mean Average Precision (mAP) by 1.3% and 2.2%, respectively. Additionally, the model reduced GFLOPs computation by 34.1% and Params by 43.3%. Furthermore, in tests under complex weather conditions and with interference factors such as leaf occlusion, branch occlusion, and fruit mutual occlusion, the YOLO-MCS model demonstrated significant robustness, achieving mAP of 89.9% in the loquat recognition task. The exceptional performance serves as a robust technical base on the development and research of intelligent systems for harvesting loquats.

Keywords:

loquat; YOLOv8; target recognition; deep learning; lightweight

1. Introduction

In intelligent mechanical harvesting operations, fruit recognition technology provides discriminative power for enhancing crop management efficiency and optimizing resource allocation [1,2]. Loquat (Eriobotrya japonica (Thunb.) Lindl.) is an important economic crop, is widely cultivated in regions such as Sichuan, Fujian, and Zhejiang Province. It not only holds a significant position in agricultural production but also plays a vital role in promoting regional economic development and increasing farmers’ income [3,4]. Traditional loquat recognition and harvesting techniques primarily rely on manual inspection and picking, which has obvious efficiency limitations. On the other hand, it is difficult to adapt to the requirements of modern agricultural large-scale production. Fruit loss caused by untimely harvesting further increases the overall production cost [5,6]. Additionally, loquat recognition often operates under natural lighting conditions. It also encounters problems like leaf obstruction and fruit overlap. Under such scenarios, traditional image analysis methods and machine learning-based recognition technologies face significant challenges, and it is difficult to achieve precise localization and identification of loquats. Against this backdrop, the development of intelligent recognition algorithms that are highly accurate, fast, and lightweight has become a key technological breakthrough in improving the automation level of loquat harvesting equipment.

Over the past decade, the field of agricultural computer vision has witnessed remarkable progress. Deep learning architectures are based on convolutional neural networks (CNNs). They have emerged as the mainstream technology for fruit detection tasks. Significant breakthroughs have been achieved in enhancing detection accuracy and system reliability, providing effective technical support for automated fruit harvesting [7,8,9,10]. However, existing methods still exhibit notable disparities and room for optimization in balancing lightweight design, detection speed, and recognition accuracy.

The YOLO-PEM model proposed by Jing et al. [11] integrates PConv operations and EMA attention mechanisms into its backbone network while employing the MPDIoU loss function. It demonstrates high computational efficiency in peach (Prunus persica (L.) Batsch) recognition tasks, though its lightweight convolutional structure may somewhat limit feature extraction capabilities. Deng et al. [12] constructed the YOLOv7-BiGS model, which achieves precise recognition of citrus (Citrus reticulata Blanco) targets by introducing BiFormer attention modules and GSConv convolutions. However, the model’s inference speed in practical deployment has not been fully evaluated. Yu et al. [13] restructured the YOLOv5s backbone based on MobileNet, significantly reducing model complexity and boosting detection speed. However, its accuracy stability in complex orchard scenarios with severe occlusions remains to be enhanced. Lü et al. [14] embedded the MobileOne module, Coordinate Attention (CA) mechanism, and lightweight SPPFCSPC structure into YOLOv7, achieving 97.2% detection accuracy in grape (Vitis vinifera L.) recognition. However, the model’s overall parameter count and computational cost remain high, limiting its potential application on mobile harvesting equipment. Sun et al. [15] constructed the YOLO-P model by introducing a shuffle module, CBAM attention mechanism, and Hard-Swish activation function, achieving 97.6% mAP in pear (Pyrus pyrifolia (Burm.f.) Nakai.) detection. However, its insufficient lightweight design struggles to meet real-time operational demands. Liu et al. [16] developed the Faster-YOLO-AP model, utilizing PDWFasterNet and Deep-Weakly Separable Convolution (DWSConv) for lightweight optimization and acceleration. However, it did not sufficiently address maintaining recognition accuracy for apple (Malus domestica Borkh.) targets in complex environments.

In summary, the research by Jing et al. [11], Deng et al. [12] and Yu et al. [13] has advanced fruit recognition technology. They demonstrated advantages in detection accuracy, speed, or lightweight design individually, but none achieved a systematic balance among these three aspects. Particularly in complex orchard scenarios, existing models exhibit significant shortcomings in the coordinated optimization of real-time performance and robustness. Loquat harvesting operations demand detection models that simultaneously exhibit high accuracy, fast speed, and lightweight characteristics, but there are currently relatively few studies on this type of fruit. However, challenges such as overlapping branches and leaves, along with fruit occlusion in natural growing environments, further increase recognition difficulty. To address these challenges, this study proposed a lightweight detection model based on YOLO-MCS, aiming to better balance accuracy, speed, and model complexity. Through structural optimization and modular innovation, the model significantly reduces computational costs while maintaining high detection performance, thus meeting the real-time operational demands of loquat harvesting machinery in complex environments. This research not only provides an efficient and reliable solution for loquat recognition but also contributes to enhancing agricultural productivity and advancing the intelligent and modern development of the fruit tree industry. Here, “MCS” represents the initial letters of the model’s three core modules: “M” originates from the MBConv core component of EfficientNet-b0, “C” denotes the C2f_SCConv module, and “S” stands for the SimAm attention mechanism.

This study’s core research contents are summarized as follows.

(1): Lightweight Backbone Network: Replacing the original main network with EfficientNet-b0 addresses fruit occlusion by branches and leaves while reducing model parameters and computational load, balancing detection speed and accuracy.
(2): Feature Extraction Optimization: Replacing the bottleneck structure in the C2f module with SCConv modules enhances the model’s multi-scale feature extraction capability for densely clustered fruits by reducing spatial and channel redundancy.
(3): Feature Focus Enhancement: Introduces the SimAm attention mechanism in the neck. Through feature recalibration strategies, it strengthens the model’s ability to focus on loquat targets within complex backgrounds.

2. Material Preparation and Experimental Process

This section will first detail the experimental hardware configuration, software environment, and parameter settings to ensure the reproducibility of the experimental process. Subsequently, the evaluation criteria and relevant metrics for algorithm assessment are presented. Finally, the material preparation and complete experimental workflow required for this study are elaborated on, including the acquisition specifications of the loquat image dataset used in the experiments, as well as the technical pathways and experimental methods of the YOLOv8 (Ultralytics v8.3.139) baseline and the improved lightweight YOLO-MCS algorithm employed for analyzing and processing the dataset.

2.1. Experimental Platform and Parameter Settings

This study was conducted in a deep learning environment with specific hardware and software configurations. The hardware includes 32 GB memory, Intel Core i5-14600KF CPU, and NVIDIA RTX4060 GPU. The software consists of Windows 10 operating system, Python 3.9, PyTorch 2.3.0, and CUDA 12.8. The experimental parameters are shown in Table 1. All comparative experiments were conducted under identical training conditions. These conditions include consistent dataset splits, training epochs, and learning rate scheduling strategies. The optimizers and their hyperparameters (e.g., learning rate, weight decay) were also kept consistent across all experiments. This controlled variable approach ensures that any observed performance differences objectively reflect the inherent effectiveness of each model architecture, thereby guaranteeing the reliability of the comparative results.

2.2. Evaluation Criteria

To comprehensively assess the YOLO-MCS model’s performance in loquat orchard detection tasks, this study developed a framework balancing detection precision and computational efficiency. All experimental models were validated using five core metrics: Precision (P), Recall (R), mean Average Precision (mAP), Giga Floating-Point Operations per Second (GFLOPs), and Number of Parameters (Params) [17]. The first three accuracy-focused indicators evaluate the model’s detection capability. Specifically, they measure the model’s ability to recognize loquat targets and precisely localize them in complex natural orchard environments. GFLOPs and Params serve as benchmarks for lightweight design and computational efficiency. They quantify computational overhead and structural scale, reflecting the model’s deployment viability on resource-constrained embedded devices of orchard robots. Collectively, these five metrics constitute a balanced and practical assessment system for this study, enabling the analysis of the YOLO-MCS model to concurrently account for detection effectiveness, lightweight properties, and real-time applicability. The corresponding calculation formulas about P, R, and mAP are presented in Equations (1)–(4).

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

A P = \int_{0}^{1} P (R) d R

(3)

m A P = \frac{1}{X} \sum_{i = 1}^{X} A P_{i} \times 100 %

(4)

In the formulas: TP denotes the count of correctly identified loquat targets; FP is the number of negative classes identified as positive ones; FN is the count of positive classes identified as negative ones; AP represents the area under the P-R curve; X stands for the total number of samples within the dataset; AP_i denotes the average precision for the i-th class; mAP denotes the mean Average Precision at an IoU threshold of 0.5, where all detections with intersection-over-union ratios exceeding this threshold are counted as correct predictions, making it suitable for rapid performance evaluation under relatively lenient matching criteria.

2.3. Loquat Image Collection

This study’s loquat dataset was collected at the Loquat Industrial Park in Longquanyi District, Chengdu City, Sichuan Province. The collection period was from 5 April to 1 May 2025, between 10:00 and 18:00. Images were captured using an iPhone 13 Pro Max (Apple Inc., Cupertino, CA, USA), with a resolution of 4032 × 3024 pixels and saved in JPG format. The loquat industrial park employs dwarf loquat trees with fixed-distance and high-density planting structures, which facilitate the harvesting of loquat fruits. Natural scenes feature complex background noise and severe mutual occlusion between loquat targets. To address this, various shooting conditions were considered during image capture, including different angles, distances (10–100 cm), light intensities (front lighting and back lighting), and occlusion degrees (branch occlusion, fruit mutual occlusion, leaf occlusion). This significantly improved the representational diversity of the dataset and effectively enhanced the model’s generalization performance. Images with high blur and similarity, or incorrect capture were removed from the dataset, resulting in 325 high-quality original loquat fruit images. The loquat image dataset is shown in Figure 1.

2.4. Dataset Creation

The image annotation for this study was performed independently by a single researcher. This approach eliminated inconsistencies between annotators at the source; furthermore, all annotations underwent unified review to ensure the accuracy of the benchmark dataset. The LabelImg (v1.8.6) software was used to manually annotate the images of loquat fruits collected under natural conditions. The category of the bounding box attribute was set as “Loquat”, and the txt label file containing the information of the bounding box position was obtained. During the data collection phase, environmental noise, lighting changes, and other factors caused interference. To address this, this study systematically expanded the original loquat dataset. Diverse data augmentation strategies were introduced to enable the model to fully learn the multidimensional feature representations of loquat fruits in complex natural scenes. This significantly improved the model’s generalization performance across different environments. Specifically, this study employed five data augmentation methods: image sharpening; horizontal and vertical flipping; random rotation (0–360 degrees); Gaussian noise injection; and dynamic brightness adjustment. After data augmentation, the dataset was expanded to 1950 high-quality images. The completed dataset was randomly divided into training, validation, and test sets in an 8:1:1 ratio, resulting in 1560 images for the training set, 195 images for the validation set, and 195 images for the test set, as shown in Table 2. Figure 2 shows typical samples processed using different augmentation methods.

2.5. YOLOv8n Model

The YOLOv8 object detection framework includes five models (n, s, m, l, and x), each customized to meet distinct application needs [18]. These models exhibit an increasing trend in network depth and complexity, with corresponding improvements in detection precision. Among them, the YOLOv8n model achieves an optimal balance between parameter count and detection precision. Based on these characteristics, the present investigation incorporated YOLOv8n as the base model for the loquat fruit recognition after comprehensive consideration. As shown in Figure 3, YOLOv8n adopts a streamlined and efficient network architecture, significantly reducing computational intricacy while maintaining detection capability.

The YOLOv8n network architecture adopts a four-module design, comprising a data input layer, a feature extraction backbone, a feature fusion neck, and a task decoupling head [19]. The feature extraction backbone is composed of five standard convolutional layers, four C2f modules, and one multi-scale pooling structure. The C2f modules are optimized based on the ELAN architecture of YOLOv7. They increase the number of cross-layer connection branches to significantly improve the efficiency of gradient information flow. This constructs feature learning units with stronger representational capabilities [20]. The multi-scale pooling structure adopts a cascaded spatial pyramid pooling strategy to effectively integrate feature information from different receptive fields. The feature fusion neck employs a Path Aggregation Network (PAN) to facilitate multi-level feature interaction, significantly promoting the model’s detection capability for multi-scale objects [21]. The task processing head adopts a decoupled structure, featuring distinct branches for classification and regression. This design is compared to traditional coupled structures. It achieves simultaneous improvements in detection precision and efficiency through task-specific processing mechanisms.

2.6. YOLO-MCS Model

The YOLOv8n algorithm combines high detection precision with a lightweight model. However, its detection performance still has certain limitations in orchard scenarios. These limitations are particularly evident in complex situations, such as dense loquat distribution, mutual fruit occlusion, and overlap with branches. This study entails the implementation of specific optimizations within the YOLOv8n framework to improve the algorithm’s efficacy in detecting loquat targets within orchard environments. Three specific optimizations were implemented for the YOLOv8n framework. Firstly, the initial backbone network was replaced with the lightweight EfficientNet-b0 architecture, which significantly reduces model complexity. Secondly, SCConv convolutional layers were embedded in the feature extraction module to suppress redundant features via a feature re-calibration mechanism. Finally, the SimAm module was integrated into the neck network to enhance the response to key features. The improved YOLO-MCS model maintains real-time performance while effectively improving detection precision and robustness. Its overall architecture is shown in Figure 4.

2.6.1. EfficientNet-b0 Feature Extraction Network

Terminal devices for loquat recognition have constrained computational capabilities. Thus, it is imperative to evaluate both recognition precision and overall model performance. The key requirement is to balance these two factors. Efficient networks such as GhostNet [22], ShuffleNetV2 [23], and MobilenetV3 [24] achieve model compression through structural optimization and employ mechanisms of reuse and reorganization to enhance their ability to represent high-dimensional nonlinear features, thereby significantly improving the computational efficiency of the algorithm. In comparison, the EfficientNet-b0 model stands out for its small parameter count and high recognition precision [25]. This advantage primarily stems from the composite scaling method of the EfficientNet-b0 model. The model uses a composite scaling coefficient (φ) to simultaneously optimize three dimensions: network depth, width, and input image resolution. The composite scaling formula for the EfficientNet-b0 model is shown in Equation (5).

d = α^{φ}, w = β^{φ}, r = γ^{φ}, s . t . \{α \cdot β^{2} \cdot γ^{2} \approx 2, α \geq 1, β \geq 1, γ \geq 1}

(5)

In this architectural configuration, parameters d, w, and r denote the scaling coefficients for network depth, channel width, and input resolution, respectively. Hyperparameters α, β, and γ are optimized via neural architecture search. The EfficientNet-b0 model mainly consists of convolutional layers, mobile inverted bottleneck convolutions (MBConv), pooling layers, and fully connected layers [26], with its network structure outlined in Table 3.

According to the analysis of the architecture parameters in Table 1, the central element of the EfficientNet-b0 model is the MBConv, whose detailed structure is shown in Figure 5. The MBConv module initially employs convolution on the input feature map to modify its channel dimension. Subsequently, depthwise convolution is utilized to perform spatial filtering on the features with reduced dimensions, significantly decreasing the number of model parameters. Thereafter, the channel attention mechanism, known as the SE module, is integrated to enhance the response strength of the loquat’s key features through feature re-calibration [27]. Finally, the feature representation undergoes dimensional restoration through pointwise convolution, followed by application of dead connections and skip connections.

2.6.2. C2f_SCConv Convolution Module

This research incorporates a spatial recombination convolution (SCConv) module [28] into the head. The bottleneck unit of the standard C2f architecture is substituted to achieve this. This enhancement aims to improve the detection model’s representation capabilities. It effectively reduces information redundancy during feature extraction through feature decoupling and channel recombination techniques, significantly boosting the model’s recognition precision. Specific improvements to the module are shown in Figure 6.

The SCConv module jointly optimizes spatial and channel dimensions to significantly reduce feature redundancy, achieving lightweight design while maintaining model precision. Its specific architecture is shown in Figure 7.

The SCConv module adopts a dual-branch feature optimization architecture, whose core components include a spatial reconstruction unit (SRU) and a channel reconstruction unit (CRU). The SRU extracts importance coefficients of feature maps through group normalization. It separates useful and redundant feature maps, then performs cross-reconstruction on these features. Finally, it concatenates the reconstructed features to generate spatially optimized feature maps. The SRU effectively minimizes redundancy in the spatial dimension while boosting the expressive capability of features. Meanwhile, the CRU separates the input features by channel. It uses group-wise convolution (GWC) to capture high-level information and point-wise convolution (PWC) to obtain detailed information. Next, it employs global average pooling (GAP) to generate channel descriptors and calculates weights via SoftMax. Finally, it fuses weighted information from different channels to generate channel-optimized features. The CRU effectively reduces channel redundancy, enhances feature representation capabilities, lowers computational costs, and improves model inference speed through its “segmentation-transformation-fusion” approach.

2.6.3. SimAm Attention Mechanism Module

SimAm is an unparameterized attention module that innovatively achieves collaborative optimization of spatial and channel attention. It creates a three-dimensional weight distribution model, assigning distinct saliency weights to each neuron in the feature map. This significantly enhances feature representation capability without introducing extra parameters [29]. The module’s detailed architecture is shown in Figure 8.

In SimAm, an energy function et characterizes each neuron, whose definition is shown in Equations (6)–(10).

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{\sum_{i = 1}^{M - 1} {(y_{0} - {\hat{x}}_{i})}^{2} + λ w_{t}^{2}}{M - 1}

(6)

w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}

(7)

b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t}

(8)

σ_{t}^{2} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ_{t})}^{2}

(9)

μ_{t} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

(10)

Among these, w_t and b_t represent the weights and bias terms; y denotes the expected output of the target neuron; x_i refers to the input features from other neurons; y_t and y₀ represent the output of the target neuron and the expected output of other neurons in the input features, respectively;

\hat{t}

and

x_{i}

denote the linear transformations of the input features t and x_i on the same channel for the target neuron and the i-th other neuron with respect to w_t and b_t; N denotes the number of energy functions; M denotes the total number of neurons in the channel; μ_t and

σ_{t}^{2}

represent the mean activation and dispersion measure, respectively, across the channel containing the target neuron; λ is the regularization coefficient.

This study chose to integrate the SimAm into the neck network of YOLOv8n, primarily based on the following three considerations.

(1): The loquat fruit dataset faces challenges like dense occlusions, small target sizes, and complex environments. Incorporating the attention mechanism helps minimize interference from irrelevant factors. It also retains the key feature information of detected targets and significantly reduces both false negative and false positive rates.
(2): SimAm is a highly efficient, unparameterized attention module that seamlessly integrates with various architectures. It can be widely applied to different convolutional neural network designs. This significantly enhances the model’s overall performance.
(3): In the YOLOv8n network, the Neck structure is a pivotal component for feature processing. It is located between the backbone network and the output layer, with the function of integrating target features extracted by the backbone. Integrating the SimAm attention module into the Neck structure effectively reduces background noise. It also enhances multi-scale feature fusion and improves detection performance for small objects [30].

3. Experimental Results and Analysis

To systematically evaluate the YOLO-MCS model’s effectiveness, lightweight properties, and practical applicability in orchard scenarios, this study adopts a hierarchical experimental design. The design progresses from single-factor validation to comprehensive performance assessment, consisting of four components.

(1): Comparative analysis of different attention mechanisms: To validate SimAm’s suitability for loquat detection tasks.
(2): Comparison of multiple backbone networks: To justify the adoption of EfficientNet-b0 as a lightweight feature extractor.
(3): Ablation studies: To examine the independent and synergistic contributions of EfficientNet-b0, SCConv, and SimAm.
(4): Performance comparison with multiple YOLO versions: To assess the YOLO-MCS model’s competitiveness.

3.1. Contrastive Investigation of Attention Mechanisms

This research assessed the practical effectiveness of the SimAm attention mechanism. It compared SimAm’s performance with five other attention mechanisms: CBAM [31], Efficient Channel Attention (ECA) [32], Global Attention (GAM) [33], Squeeze-and-Excitation (SE) [34], and Leverage Separable Kernel Attention (LSKA) [35]. The experiment adopted the control variable method. Only the attention module was sequentially embedded in the neck network of the YOLOv8n model, while other network structural parameters remained unchanged.

Detailed experimental comparison results are shown in Table 4. SimAm achieved the most significant improvement in model metrics among all attention mechanisms. Precision and mAP increased by 1.3% and 2.2%, respectively. Meanwhile, GFLOPs and Params decreased by 34.1% and 43.3%, respectively. These results indicate that the SimAm attention mechanism is most suitable for this research.

3.2. Comparison Experiment Between Different Backbone Networks

This experiment first compared the performance parameters of YOLOv8 with different backbone networks. To further verify the practical effectiveness, it then intuitively compared the actual detection effects of the models equipped with these backbones on three raw images.

3.2.1. Performance Comparison Across Different Backbone Networks

This study comprehensively evaluates the impact of various lightweight backbone architectures. It focuses on two key aspects: object detection performance and computational efficiency. The evaluation is conducted through systematic comparative experiments. Based on multiple control experiments, the study further analyzes the structural adaptability of these backbones and their potential optimization space. Detailed performance comparison data are shown in Table 5.

MobileNetv3 [36] adopts a reverse residual structure and linear bottleneck mechanism. Its design balances strong feature extraction capabilities with low computational cost. Experimental results show that integrating it into the backbone network effectively reduces model size and parameter count. However, detection accuracy declines significantly. This suggests MobileNetv3 lacks sufficient representational power in complex orchard scenarios. ShuffleNetv2 shows a similar trend. Its target recognition accuracy decreases, failing to achieve effective model optimization. GhostNetv2’s bottleneck structure incorporates DFC attention mechanisms to enhance intermediate layer feature expression. Theoretically, it is suitable for building efficient lightweight backbones. However, experimental results reveal a trade-off: while it compresses GFLOPs, detection accuracy drops simultaneously. This indicates that feature compression strategies weaken the critical spatial details required for object detection. Notably, when GhostNetv2 is combined with this study’s proposed “feature refinement-focusing” system (comprising SCConv and SimAm), its inherent feature compression may interact with SCConv’s redundancy elimination mechanism. This interaction produces a cumulative effect, further degrading detail information crucial for bounding box regression. As a result, SimAm struggles to fully compensate, leading to a decline in mAP performance. This phenomenon reflects a structural mismatch between GhostNetv2 and detection-oriented feature enhancement modules.

Comprehensive comparisons reveal that EfficientNet-b0 exhibits superior structural adaptability in this study’s application scenarios. It constructs a hierarchical multi-scale feature pyramid through a composite scaling strategy and MBConv modules. This feature pyramid synergizes effectively with SCConv and SimAm. Natural orchard environments are characterized by dense occlusions and significant lighting variations. The synergy between the feature pyramid and the two modules improves bounding box localization accuracy and object discrimination capabilities in such scenarios. Ultimately, the optimized EfficientNet-b0 backbone network used in this study achieves dual improvements: Precision and mAP are both enhanced, while GFLOPs and Params are reduced. This demonstrates an efficient balance between lightweight architecture and performance enhancement.

3.2.2. Comparison of Loquat Detection Images Under Different Backbone Networks

To intuitively compare the performance differences among different backbone networks, three original loquat test images were selected (as shown in Figure 9a). After adopting different backbone networks, the detection and recognition results of loquats in the three original images are presented in Figure 9b–f, respectively.

White circles in the figures indicate undetected loquat targets, while blue circles represent cases where the model misclassifies multiple connected loquat targets as a single one. Mutual occlusion between loquat fruits causes their boundaries to blur, making it easy for the model to misjudge multiple overlapping fruits as a single target during detection. Figure 9b uses the YOLOv8n baseline. The blue circles in the middle and right images indicate misdetections where multiple loquats are detected as one. Figure 9c adopts the MobileNetv3 backbone network: white circles in the left and middle images represent unrecognized loquat targets, and the blue circle in the right image indicates a misdetection. Figure 9d uses ShuffleNetv2, with missed detections in the left and middle images and a misdetection in the right image. Figure 9e adopts GhostNetv2, with misdetections in the middle and right images. Figure 9f uses EfficientNet-b0, and no missed detections or misdetections are observed in the left, middle, or right images. This indicates that the improved backbone network based on EfficientNet-b0 performs better in handling occlusion between loquats.

3.3. Ablation Experiment

To evaluate the actual impact of each enhanced module and their combinations on the model’s overall performance, this study conducted ablation experiments. The original YOLOv8n model was adopted as the baseline (shown in Experiment 1 of Table 6). Three key components were first individually integrated into YOLOv8n: the lightweight EfficientNet-b0 network (Experiment 2 of Table 6), the SimAm attention mechanism (Experiment 3 of Table 6), and the SCConv convolution module (Experiment 4 of Table 6). Subsequently, EfficientNet-b0 and SimAm were combined and added to the model. Finally, the performance parameters were examined after all three key components were integrated. As shown in Experiment 6 of Table 6, the YOLO-MCS model, improved from the original YOLOv8n, achieved enhancements in both Precision (P) and mean Average Precision (mAP), along with significant reductions in GFLOPs and Params.

In Experiment 2 of Table 6, the main feature extraction network of YOLOv8 was replaced with the lightweight EfficientNet-b0 module. While the Recall remained unchanged, the mean Average Precision (mAP) decreased by 4.5% and the Precision dropped by 11.1%, with GFLOPs and Params reduced by 2.5 G and 1.1 M, respectively. The core reason for EfficientNet-b0’s trade-off between computational efficiency and Precision/mAP metrics lies in the balance between model complexity and feature extraction capability: EfficientNet-b0 significantly reduces computational load and parameter count through lightweight designs such as separable convolutions. However, its limited feature extraction capability struggles to fully capture the diverse characteristics of loquat fruits, leading to insufficient discrimination power for blurry targets and complex backgrounds, which ultimately manifests as decreased Precision and mAP values.

In Experiment 3 of Table 6, the SimAm attention mechanism was solely integrated into the YOLOv8 baseline network. Results indicate that this module can enhance the focusing effect on small targets like loquats, but its improvement on overall performance remains limited. In Experiment 4 of Table 6, the SCConv module was introduced into the baseline model, leading to a certain enhancement in overall detection performance; however, the improvement effect on GFLOPs and Params—indicators reflecting lightweight characteristics—was not significant.

Experiment 5 in Table 6 was built upon Experiment 2, with SimAm added to the Neck section. Compared to Experiment 1, the Precision decreased by 5.0%, the Recall dropped by 2.6%, and the mAP reduced by 1.9%, while GFLOPs decreased by 2.4 G and Params reduced by 1.2 M. In contrast to Experiment 2, the Precision increased by 6.1%, the Recall decreased by 2.6%, the mAP improved by 2.6%, and the floating-point operations increased by 0.1 G. It demonstrated that the introduction of SimAm effectively enhances the model’s focus on small targets.

Experiment 6 in Table 6 further optimized the C2f structure and integrated the SCConv module based on Experiment 5, resulting in improved Precision and mAP, as well as reduced GFLOPs and Params. Compared with the baseline YOLOv8n model, the YOLO-MCS model incorporating the three key components (EfficientNet-b0, SimAm, and SCConv) achieves lightweight improvements while simultaneously enhancing detection precision.

In conclusion, compared with the baseline YOLOv8n, the optimized YOLO-MCS model achieves significant improvements in key metrics: Precision (+1.3%), mAP (+2.2%), while reducing GFLOPs by 34.1% and Params by 43.3%. This dual enhancement (performance boost + lightweight design) is attributed to the synergistic integration of three core components.

(1): EfficientNet-b0 Backbone: Utilizing composite scaling, it cuts computational overhead while balancing detection speed and accuracy.
(2): SimAm Attention Mechanism: Via feature recalibration, it strengthens the model’s focus on loquat targets in complex backgrounds, reducing irrelevant interference.
(3): SCConv Module: Suppresses spatial and channel redundancy to output more discriminative features, directly improving classification and localization accuracy.

Ultimately, the synergistic effect of these components enables YOLO-MCS to achieve remarkable overall detection performance under lightweight constraints. This fully validates the effectiveness and rationality of the proposed optimization strategy, laying a foundation for its practical application in agricultural loquat detection.

3.4. Comparison of Different Models

In this experiment, performance comparison was conducted between different YOLO versions and the YOLO-MCS model. Additionally, the Area Under the Curve (AUC) of the Precision-Recall (P-R) curves under the corresponding operating environments was plotted, followed by visual analysis.

3.4.1. Performance Comparison of Different YOLO Versions

To substantiate the enhanced efficacy of the improved YOLO-MCS model in loquat fruit recognition, this study compared its performance with several mainstream models. These models include Faster R-CNN [37], YOLO-Lite [38], YOLOv5 [39], YOLOv6 [40], YOLOv7 [41], YOLOv8 [42], YOLOv9 [43], YOLOv10 [44], YOLOv11 [45], and YOLOv12 [46].

The comparison experiment was conducted strictly in accordance with the parameter settings in Table 1, with the training iteration count fixed at 300. To ensure the objectivity and accuracy of model performance evaluation, two key controls were implemented: first, avoiding interference from other confounding factors; second, ensuring experimental data truly reflect performance differences among different model architectures. A set of standardized configurations was applied throughout the experiment to achieve these goals.

The results of the mainstream detection model performance comparison are detailed in Table 7. According to the experimental findings, the YOLO-MCS model outperforms other mainstream models in Precision and mAP, while requiring fewer parameters and floating-point operations. Compared with 10 mainstream algorithms, the Precision of YOLO-MCS shows the following improvements: (1) Faster R-CNN (+23.7%); (2) YOLO-Lite (+2.1%); (3) YOLOv5 (+3.3%); (4) YOLOv6 (+5.1%); (5) YOLOv7 (+1.8%); (6) YOLOv8 (+1.3%); (7) YOLOv9 (+9.3%); (8) YOLOv10 (+4.3%); (9)YOLOv11 (+9.9%); (10) YOLOv12 (+8.6%). The mAP comparison displays the following results: (1) Faster R-CNN (−2.5%); (2) YOLO-Lite (−2.8%); (3) YOLOv5 (+3.0%); (4) YOLOv6 (+4.3%); (5) YOLOv7 (+2.7%); (6) YOLOv8 (+2.2%); (7) YOLOv9 (−1.8%); (8) YOLOv10 (+7.6%); (9) YOLOv11 (+3.7%); (10) YOLOv12 (+5.0%). The computational cost of GFLOPs and Params shows the following reductions: (1) 364.8 G/135.4 M (vs. Faster R-CNN); (2) 10.0 G/2.0 M (vs. YOLO-Lite); (3) 1.8 G/0.8 M (vs. YOLOv5); (4) 6.5 G/2.5 M (vs. YOLOv6); (5) 99.9 G/26.0 M (vs. YOLOv7); (6) 2.8 G/1.3 M (vs. YOLOv8); (7) 96.9 G/23.6 M (vs. YOLOv9); (8) 53.5 G/13.6 M (vs. YOLOv10); (9) 45.4 G/10.8 M (vs. YOLOv11); (10) 43.2 G/10.2 M (vs. YOLOv12).

Based on these performance metrics, the YOLO-MCS algorithm demonstrates significantly superior comprehensive performance. It provides strong support for the deployment of loquat detection models on mobile devices.

3.4.2. Visual Analytics

To more intuitively illustrate the effectiveness of the model improvements, a comparison of Precision-Recall (P-R) curves was conducted between the improved YOLO-MCS model and other mainstream YOLO models. As clearly presented in the P-R curve (Figure 10), the YOLO-MCS model proposed in this study exhibits outstanding performance in loquat fruit detection tasks. Specifically, YOLO-MCS-AUC achieves an mAP of 0.930, which significantly surpasses the mainstream baseline models: YOLOv5 (0.900), YOLOv6 (0.887), YOLOv7 (0.903), YOLOv8 (0.908) and newer model variants: YOLOv11 (0.893) and YOLOv12 (0.880).

While YOLO-Lite (0.958) and YOLOv9 (0.948) yield marginally higher mAP values, YOLO-MCS further enhances model lightweighting without compromising its competitive mAP performance. Through dedicated optimization of its network structure, YOLO-MCS strikes a more optimal balance between model lightweighting and detection accuracy. This result fully validates the efficacy of the proposed improvement strategy, confirming that the YOLO-MCS model is particularly well-suited for agricultural computer vision tasks such as loquat fruit recognition.

4. Discussion

To address the complex background interference and target occlusion issues faced by loquat harvesting machinery in orchard environments, this study systematically evaluated the recognition performance of the improved YOLO-MCS detection model under different meteorological conditions and occlusion scenarios. Cross-validation is employed to fully verify the model’s generalization capability by partitioning the dataset into training, validation, and independent test sets. All reported performance metrics were calculated based on the independent test set that did not participate in the training process, ensuring the objectivity and universality of the evaluation results.

As shown in Figure 11, the experimental data are derived from loquat fruit images captured under varying lighting conditions, shooting angles, and occlusion levels. The structure and parameters of the YOLO-MCS model remain consistent in this study; however, evaluation results across different loquat datasets indicate significant differences in Precision and Recall, while the mAP shows relatively minor variations. According to Table 8, under the three distinct conditions, the model’s Precision values are 87.3%, 87.0%, and 85.4%, respectively, with Recall rates of 81.7%, 89.0%, and 87.8%.

This phenomenon primarily stems from inherent differences in the datasets themselves: variations exist across datasets in terms of collection environments (e.g., lighting conditions, background complexity), target characteristics (e.g., fruit size, distribution density, occlusion level), and annotation standards and quality. These factors collectively impact model performance: complex datasets containing more occluded and small target fruits lead to missed detections, thereby reducing Recall; datasets with complex backgrounds or interfering objects similar to fruits are prone to false detections, thus decreasing Precision. As a comprehensive metric, mAP directly reflects the model’s ability to adapt to the challenges presented by specific test datasets. The mAP values in Table 8 are 88.7%, 89.9%, and 89.9%, respectively, which demonstrates the model’s robustness.

Specifically, the EfficientNet-b0 backbone network constructs hierarchical multi-scale features on the basis of lightweight design through balanced compound scaling, which is the fundamental reason for the model’s effective handling of leaf occlusion and lighting variations. The SCConv module significantly enhances the model’s feature discrimination capability for dense clustered targets by synergistically optimizing spatial and channel redundancy. Meanwhile, the SimAm attention mechanism accurately focuses on key fruit regions through parameter-free energy function optimization, enabling the model to maintain high detection precision even in complex backgrounds. The synergistic effects of these improved modules systematically enhance the model’s robustness across multiple aspects, including feature extraction, redundancy suppression, and target focusing. Despite variations in detection results across different datasets, the model’s overall performance remains excellent. This experiment indicates that the proposed model achieves high and stable mAP in loquat fruit recognition under complex natural environments.

5. Conclusions

This study addresses the key technical challenges in loquat target detection under complex orchard scenarios, including low computational efficiency, high memory consumption, and insufficient recognition precision. Based on the YOLOv8n network framework, a lightweight improved model named YOLO-MCS was proposed. The model’s improvements involve three key aspects: (1) Reconstructing the feature extraction backbone using the lightweight EfficientNet-b0 module; (2) Enhancing feature representation by introducing the efficient, parameter-free, and lightweight SimAm module at the Neck layer; (3) Reducing feature redundancy by replacing the standard C2f bottleneck structure with SCConv convolutions.

The improved lightweight YOLO-MCS model achieves Precision, Recall, mean Average Precision (mAP), GFLOPs and Params of 93.8%, 87.6%, 93.0%, 5.4 G, and 1.7 M, respectively. Compared with the baseline YOLOv8n model, the YOLO-MCS model proposed in this study significantly enhances model efficiency while maintaining excellent detection capabilities. Specifically, the Precision is increased by 1.3%, the Recall remains unchanged, and the mAP is improved by 2.2%; in addition, the GFLOPs and Params are reduced by 34.1% and 43.3%, respectively. Furthermore, the results of detection experiments conducted under different lighting conditions, different perspectives, and different obstruction environments indicate that the YOLO-MCS model exhibits high and stable mAP. Compared with other models, it demonstrates the optimal comprehensive performance in loquat detection tasks under natural scenes. While ensuring detection precision, the model realizes lightweight design, thereby achieving an efficient balance and possessing significant advantages.

The proposed YOLO-MCS model demonstrates effective performance in loquat recognition, though room for improvement remains. The existing dataset only contains images of mature loquats with golden-yellow skin from a single variety. It lacks images of unripe (completely green) and semi-ripe (partially green) loquats. Furthermore, it lacks images of loquats from orchard environments in different regions. This limitation may affect the model’s ability to recognize loquats of varying ripeness and different varieties.

Meanwhile, the evaluation metrics in this study primarily focus on precision-related indicators (Precision, Recall, mAP) and lightweight metrics (GFLOPs, Params), without yet incorporating real-time-related FPS (Frames Per Second) metrics. Since inference speed is a crucial measure of orchard robots’ online detection capabilities, the absence of FPS testing also represents one of the current limitations of this research. Therefore, subsequent research will focus on expanding the collection of loquat samples across different maturity stages and images from orchards in diverse geographical regions. This will enhance the model’s versatility and adaptability in detecting loquats at various growth phases. Efforts will also be made to facilitate the practical deployment and performance evaluation of the model on orchard robots or smart terminals, ultimately achieving truly robust real-time detection across different orchards.

Author Contributions

Conceptualization, W.Z. and L.G.; methodology, W.Z. and L.G.; software, W.Z. and Y.B.; validation, F.S. and Y.B.; formal analysis, L.G. and W.Z.; investigation, Y.B.; resources, L.G. and F.S.; data curation, Y.B.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z. and L.G.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the special topic on innovation and entrepreneurship education in 2024 of CC National Mass Innovation Space of Chengdu University (project number: ccyg202401008) and Chengdu University’s Graduate Education and Teaching Excellence Project (project number: 2025YL001).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, S.; Xue, J.; Zhang, T.; Lv, P.; Qin, H.; Zhao, T. Research progress and prospect of key technologies of fruit target recognition for robotic fruit picking. Front. Plant Sci. 2024, 15, 1423338. [Google Scholar] [CrossRef]
Yang, Y.; Han, Y.; Li, S.; Yang, Y.; Zhang, M.; Li, H. Vision based fruit recognition and positioning technology for harvesting robots. Comput. Electron. Agric. 2023, 213, 108258. [Google Scholar] [CrossRef]
Ariza-Sentís, M.; Vélez, S.; Baja, H.; Valenti, R.G.; Valente, J. An aerial framework for Multi-View grape bunch detection and route Optimization using ACO. Comput. Electron. Agric. 2024, 221, 108972. [Google Scholar] [CrossRef]
Testolin, R.; Ferguson, A. Kiwifruit (Actinidia spp.) production and marketing in Italy. J. Crop Hortic. Sci. 2009, 37, 1–32. [Google Scholar] [CrossRef]
Chen, C.; Lu, J.; Zhou, M.; Yi, J.; Liao, M.; Gao, Z. A YOLOv3-based computer vision system for identification of tea buds and the picking point. Comput. Electron. Agric. 2022, 198, 107116. [Google Scholar] [CrossRef]
Liu, H.; Zhou, L.; Zhao, J.; Wang, F.; Yang, J.; Liang, K.; Li, Z. Deep-learning-based accurate identification of warehouse goods for robot picking operations. Sustainability 2022, 14, 7781. [Google Scholar]
He, W.; Gage, J.L.; Rellán-Álvarez, R.; Xiang, L. Swin-Roleaf: A new method for characterizing leaf azimuth angle in large-scale maize plants. Comput. Electron. Agric. 2024, 224, 109120. [Google Scholar] [CrossRef]
Hua, X.; Li, H.; Zeng, J.; Han, C.; Chen, T.; Tang, L.; Luo, Y. A review of target recognition technology for fruit picking robots: From digital image processing to deep learning. Appl. Sci. 2023, 13, 4160. [Google Scholar] [CrossRef]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and localization methods for vision-based fruit picking robots: A review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef]
Tulbure, A.-A.; Tulbure, A.-A.; Dulf, E.-H. A review on modern defect detection models using DCNNs–Deep convolutional neural networks. J. Adv. Res. 2022, 35, 33–48. [Google Scholar] [CrossRef] [PubMed]
Jing, J.; Zhang, S.; Sun, H.; Ren, R.; Cui, T. YOLO-PEM: A lightweight detection method for young “Okubo” peaches in complex orchard environments. Agronomy 2024, 14, 1757. [Google Scholar] [CrossRef]
Deng, F.; Chen, J.; Fu, L.; Zhong, J.; Qiaoi, W.; Luo, J.; Li, J.; Li, N. Real-time citrus variety detection in orchards based on complex scenarios of improved YOLOv7. Front. Plant Sci. 2024, 15, 1381694. [Google Scholar] [CrossRef]
Yu, K.; Tang, G.; Chen, W.; Hu, S.; Li, Y.; Gong, H. MobileNet-YOLO v5s: An improved lightweight method for real-time detection of sugarcane stem nodes in complex natural environments. IEEE Access 2023, 11, 104070–104083. [Google Scholar] [CrossRef]
Sun, F.; Lv, Q.; Bian, Y.; He, R.; Lv, D.; Gao, L.; Wu, H.; Li, X. Grape Target Detection Method in Orchard Environment Based on Improved YOLOv7. Agronomy 2024, 15, 42. [Google Scholar] [CrossRef]
Sun, H.; Wang, B.; Xue, J. YOLO-P: An efficient method for pear fast detection in complex orchard picking environment. Front. Plant Sci. 2023, 13, 1089454. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Abeyrathna, R.R.D.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
Lv, Q.; Sun, F.; Bian, Y.; Wu, H.; Li, X.; Zhou, J. A Lightweight Citrus Object Detection Method in Complex Environments. Agriculture 2025, 15, 1046. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Qi, C.; Nyalala, I.; Chen, K. Detecting the early flowering stage of tea chrysanthemum using the F-YOLO model. Agronomy 2021, 11, 834. [Google Scholar] [CrossRef]
Sun, Y.; Li, Y.; Li, S.; Duan, Z.; Ning, H.; Zhang, Y. PBA-YOLOv7: An object detection method based on an improved YOLOv7 network. Appl. Sci. 2023, 13, 10436. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Huang, M.; Mi, W.; Wang, Y. Edgs-yolov8: An improved YOLOv8 lightweight uav detection model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
Ma, B.; Hua, Z.; Wen, Y.; Deng, H.; Zhao, Y.; Pu, L.; Song, H. Using an improved lightweight YOLOv8 model for real-time detection of multi-stage apple fruit in complex orchard environments. Artif. Intell. Agric. 2024, 11, 70–82. [Google Scholar]
Shi, Y.; Qing, S.; Zhao, L.; Wang, F.; Yuwen, X.; Qu, M. Yolo-peach: A high-performance lightweight yolov8s-based model for accurate recognition and enumeration of peach seedling fruits. Agronomy 2024, 14, 1628. [Google Scholar]
Atila, Ü.; Uçar, M.; Akyol, K.; Uçar, E. Plant leaf disease classification using EfficientNet deep learning model. Ecol. Inform. 2021, 61, 101182. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Sun, D.; Zhang, K.; Zhong, H.; Xie, J.; Xue, X.; Yan, M.; Wu, W.; Li, J. Efficient tobacco pest detection in complex environments using an enhanced YOLOv8 model. Agriculture 2024, 14, 353. [Google Scholar] [CrossRef]
Ma, R.; Wang, J.; Zhao, W.; Guo, H.; Dai, D.; Yun, Y.; Li, L.; Hao, F.; Bai, J.; Ma, D. Identification of maize seed varieties using MobileNetV2 with improved attention mechanism CBAM. Agriculture 2022, 13, 11. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Wang, Y.; Deng, H.; Wang, Y.; Song, L.; Ma, B.; Song, H. CenterNet-LW-SE net: Integrating lightweight CenterNet and channel attention mechanism for the detection of Camellia oleifera fruits. Multimed. Tools Appl. 2024, 83, 68585–68603. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Li, J.; Zhu, Z.; Liu, H.; Su, Y.; Deng, L. Strawberry R-CNN: Recognition and counting model of strawberry based on improved faster R-CNN. Ecol. Inform. 2023, 77, 102210. [Google Scholar]
Huang, R.; Pedoeem, J.; Chen, C. YOLO-LITE: A real-time object detection algorithm optimized for non-GPU computers. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2503–2510. [Google Scholar]
Malta, A.; Mendes, M.; Farinha, T. Augmented reality maintenance assistant using yolov5. Appl. Sci. 2021, 11, 4758. [Google Scholar] [CrossRef]
Norkobil Saydirasulovich, S.; Abdusalomov, A.; Jamil, M.K.; Nasimov, R.; Kozhamzharova, D.; Cho, Y.-I. A YOLOv6-based improved fire detection approach for smart city environments. Sensors 2023, 23, 3161. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Jiang, S.; Zhao, E.; Liu, Y.; Zhu, H.; Wang, W.; Wang, R. Detection of Camellia oleifera fruit in complex scenes by using YOLOv7 and data augmentation. Appl. Sci. 2022, 12, 11318. [Google Scholar] [CrossRef]
Ma, N.; Su, Y.; Yang, L.; Li, Z.; Yan, H. Wheat seed detection and counting method based on improved YOLOv8 model. Sensors 2024, 24, 1654. [Google Scholar] [CrossRef]
Wang, Y.; Rong, Q.; Hu, C. Ripe tomato detection algorithm based on improved YOLOv9. Plants 2024, 13, 3253. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Wang, C.; Ji, T.; Wang, Q.; Zhang, T. D3-YOLOv10: Improved YOLOv10-based lightweight tomato detection algorithm under facility scenario. Agriculture 2024, 14, 2268. [Google Scholar] [CrossRef]
Teng, H.; Wang, Y.; Li, W.; Chen, T.; Liu, Q. Advancing Rice Disease Detection in Farmland with an Enhanced YOLOv11 Algorithm. Sensors 2025, 25, 3056. [Google Scholar] [CrossRef]
Yin, X.; Zhao, Z.; Weng, L. MAS-YOLO: A Lightweight Detection Algorithm for PCB Defect Detection Based on Improved YOLOv12. Appl. Sci. 2025, 15, 6238. [Google Scholar] [CrossRef]

Figure 1. Loquat dataset collected under different conditions: (a) Isolated fruit; (b) Cluster of fruits; (c) Sunny day; (d) Cloudy day; (e) Front lighting; (f) Back lighting; (g) Obstruction; (h) Ultra-long distance. These categories are used to reflect the model’s adaptability under complex conditions such as lighting variations, occlusions, and changes in viewing distance, providing critical foundations for subsequent training and performance evaluation.

Figure 2. Image enhancement of loquat fruit: (a) Raw images; (b) Flip; (c) Random rotation; (d) Random noise; (e) Sharpen; (f) Different brightness. These augmentation strategies aim to simulate variations in viewpoint, imaging noise, and lighting conditions that may occur in orchard environments. This enhances the model’s robustness and generalization capabilities in complex scenarios, providing a more diverse sample distribution for subsequent training.

Figure 3. YOLOv8n network structure diagram.

Figure 4. YOLO-MCS network structure diagram.

Figure 5. MBConv module structure.

Figure 6. Comparative diagram of bottleneck architectures with and without C2f enhancement: (a) Baseline bottleneck; (b) Improved bottleneck.

Figure 7. SCConv module structure diagram.

Figure 8. SimAm attention mechanism structure diagram.

Figure 9. Detection results for different network backbones: (a) Raw images; (b) YOLOv8n; (c) Mo bileNetv3; (d) ShuffleNetv2; (e) GhostNetv2; (f) EfficientNet-b0. White circles indicate genuine targets not recognized by the model, while blue circles denote false positive regions. This visualization highlights differences among model backbones in feature extraction capability, false negative rate, and false positive rate, providing a visual basis for selecting the optimal backbone architecture.

Figure 10. Precision-Recall curves of the proposed YOLO-MCS model and several representative YOLO variants under identical testing conditions. The P-R curve illustrates the trade-off between Precision and Recall at different confidence thresholds, with a position further to the upper right indicating superior overall detection performance. The legend lists the corresponding mAP values for each model, enabling direct comparison of detection robustness and highlighting the relative advantages of the proposed method.

Figure 11. Visual examples of detection results for the proposed model under various challenging conditions. Subfigures (a–c) illustrate performance under different lighting environments, including strong illumination, backlighting, and low-light scenes. Subfigures (d–f) show detection outcomes from different camera perspectives, reflecting viewpoint variations commonly encountered in orchard operations. Subfigures (g–i) present cases with different levels of occlusion caused by leaves, branches, or overlapping fruits. These examples demonstrate the model’s robustness to illumination changes, viewpoint shifts, and occlusion interference, highlighting its capability to adapt to diverse real-world orchard scenarios.

Table 1. Experimental parameters.

Training Parameters	Numerical Value
Image size	300
Batch size	640 × 640
Initial learning rate	0.01
Optimizer	SGD
Momentum	0.937
Multi-threaded	16

Table 2. Dataset Division.

Category	Original Image	Image Enhancement	Training Set	Validation Set	Test Set
Number	325	1625	1560	195	195

Note: This table presents the proportion and sample size of the loquat dataset across the training, validation, and test sets. The training set is used for model parameter learning, the validation set for participation and early stopping decisions, and the test set for objectively evaluating the model’s final performance on unseen data. Structured data partitioning ensures controllability during model training and enhances the model’s generalization capability.

Table 3. EfficientNet-b0 lightweight module structure table.

	Type	Resolution	Channels	Layers
1	Conv 3×3	224 × 224	32	1
2	MBConv1,k3×3	112 × 112	16	1
3	MBConv6,k3×3	112 ×112	24	2
4	MBConv6,k5×5	56 × 56	40	2
5	MBConv6,k5×5	28 × 28	80	3
6	MBConv6,k5×5	14 × 14	112	3
7	MBConv6,k5×5	14 × 14	192	4
8	MBConv6,k3×3	7 × 7	320	1
9	Conv 1×1 & Pooling & FC	7 × 7	1280	1

Note: Pooling indicates the subsampling layer operation, and FC signifies the fully connected layer transformation.

Table 4. Model detection results with different attention mechanisms introduced.

Models	Precision (%)	Recall (%)	mAP (%)	GFLOPs (G)	Params (M)
YOLOv8n	92.5	87.6	90.8	8.2	3.0
CBAM	86.8	84.2	88.0	5.5	1.7
SE	89.8	86.3	86.9	5.4	1.7
ECA	85.8	84.3	89.0	5.4	1.7
GAM	86.8	87.2	88.5	6.5	2.2
LSKA	88.7	80.6	88.2	5.6	1.7
SimAm	93.8	87.6	93.0	5.4	1.7

Note: This table compares the changes in detection performance after introducing different attention mechanisms into the model. The listed metrics include core evaluation Parameters such as Precision, Recall, and mAP. By comparing these metrics, the contributions of each attention module in feature extraction and information filtering can be observed, thereby analyzing their impact on the model’s overall recognition capability and generalization performance. This provides quantitative evidence for selecting the optimal attention structure.

Table 5. Comparison of different main stems.

Backbone Network	Precision (%)	Recall (%)	mAP (%)	GFLOPs (G)	Params (M)
YOLOv8n	92.5	87.6	90.8	8.2	3.0
MobileNetv3	90.3	77.3	87.4	5.8	2.4
ShuffleNetv2	83.4	84.8	89.6	4.7	1.5
GhostNetv2	82.1	84.6	87.3	7.7	3.5
EfficientNet-b0	93.8	87.6	93.0	5.4	1.7

Note: This table provides a comparative evaluation of different backbone networks in terms of detection accuracy (Precision, Recall, and mAP) and computational complexity (GFLOPs and Params). These metrics reflect each backbone’s feature extraction capability and efficiency, allowing assessment of the trade-off between model performance and lightweight design. The comparison offers a direct basis for selecting an appropriate backbone for the proposed detection framework.

Table 6. Ablation experiment results.

	Efficient Net-b0	Sim Am	SC Conv	P (%)	R (%)	mAP (%)	GFLOPs (G)	Params (M)
1	✗	✗	✗	92.5	87.6	90.8	8.2	3.0
2	✓	✗	✗	81.4	87.6	86.3	5.7	1.9
3	✗	✓	✗	91.3	87.6	91.1	8.3	2.9
4	✗	✗	✓	88.8	91.2	92.3	7.8	2.8
5	✓	✓	✗	87.5	85.0	88.9	5.8	1.8
6	✓	✓	✓	93.8	87.6	93.0	5.4	1.7

Note: This table summarizes the ablation experiments conducted to evaluate the contribution of each module—EfficientNet-b0 backbone, SimAm, and SCConv—to the overall detection performance. “✗” indicates that the module is not included in the model, while “✓” indicates that it is. The reported metrics (Precision, Recall, and mAP) reflect variations in detection accuracy, while GFLOPs and Params are used to quantify the computational cost. By comparing different module combinations, the table reveals how each component influences accuracy–efficiency trade-offs and helps identify the most effective configuration for the proposed detection network.

Table 7. Comparison with other YOLO models.

Models	Precision (%)	Recall (%)	mAP (%)	GFLOPs (G)	Params (M)
Faster R-CNN	70.1	98.0	95.5	370.2	137.1
YOLO-Lite	91.7	94.0	95.8	15.4	3.7
YOLOv5	90.5	85.4	90.0	7.2	2.5
YOLOv6	88.7	84.7	88.7	11.9	4.2
YOLOv7	92.0	82.2	90.3	105.3	27.7
YOLOv8	92.5	87.6	90.8	8.2	3.0
YOLOv9	84.5	95.2	94.8	102.3	25.3
YOLOv10	89.5	73.0	85.4	58.9	15.3
YOLOv11	83.9	88.6	89.3	50.8	12.5
YOLOv12	85.2	87.4	88.0	48.6	11.9
YOLO-MCS	93.8	87.6	93.0	5.4	1.7

Note: This table presents a comparative evaluation of the proposed YOLO-MCS model against classical detection networks (e.g., Faster R-CNN) and multiple YOLO series models (YOLOv5–YOLOv12). The metrics include Precision, Recall, and mAP, which assess detection accuracy, while GFLOPs and Params reflect computational complexity. By comparing these models under consistent experimental settings, the table highlights differences in accuracy–efficiency trade-offs and demonstrates the advantages of the proposed method in achieving high accuracy with lower computational cost.

Table 8. Test results under different conditions.

Model	Conditions	P (%)	R (%)	mAP (%)
YOLO-MCS	Different light conditions	87.3	81.7	88.7
	Different perspectives	87.0	89.0	89.9
	Different obstruction	85.4	87.8	89.9

Note: This table summarizes the detection performance of the YOLO-MCS model under three representative real-world conditions: varying lighting environments, different camera perspectives, and diverse obstruction levels. Precision, Recall, and mAP are reported to reflect the model’s stability across these scenarios. The results demonstrate the model’s ability to maintain stable and high mAP despite illumination changes, viewpoint shifts, and occlusion interference, highlighting its robustness and generalization capability in complex orchard environments.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, W.; Gao, L.; Sun, F.; Bian, Y. YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments. Agriculture 2026, 16, 262. https://doi.org/10.3390/agriculture16020262

AMA Style

Zhou W, Gao L, Sun F, Bian Y. YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments. Agriculture. 2026; 16(2):262. https://doi.org/10.3390/agriculture16020262

Chicago/Turabian Style

Zhou, Wei, Leina Gao, Fuchun Sun, and Yuechao Bian. 2026. "YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments" Agriculture 16, no. 2: 262. https://doi.org/10.3390/agriculture16020262

APA Style

Zhou, W., Gao, L., Sun, F., & Bian, Y. (2026). YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments. Agriculture, 16(2), 262. https://doi.org/10.3390/agriculture16020262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments

Abstract

1. Introduction

2. Material Preparation and Experimental Process

2.1. Experimental Platform and Parameter Settings

2.2. Evaluation Criteria

2.3. Loquat Image Collection

2.4. Dataset Creation

2.5. YOLOv8n Model

2.6. YOLO-MCS Model

2.6.1. EfficientNet-b0 Feature Extraction Network

2.6.2. C2f_SCConv Convolution Module

2.6.3. SimAm Attention Mechanism Module

3. Experimental Results and Analysis

3.1. Contrastive Investigation of Attention Mechanisms

3.2. Comparison Experiment Between Different Backbone Networks

3.2.1. Performance Comparison Across Different Backbone Networks

3.2.2. Comparison of Loquat Detection Images Under Different Backbone Networks

3.3. Ablation Experiment

3.4. Comparison of Different Models

3.4.1. Performance Comparison of Different YOLO Versions

3.4.2. Visual Analytics

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI